How to use multiprocessing to loop through a big list of URL?(如何使用多处理循环遍历一大串 URL?)
问题描述
问题:检查超过 1000 个 url 的列表并获取 url 返回码(status_code).
我的脚本可以运行,但速度很慢.
我认为必须有一种更好的、pythonic(更漂亮)的方式来执行此操作,我可以在其中生成 10 或 20 个线程来检查 url 并收集 resonses.(即:
200 ->www.yahoo.com404->www.badurl.com...
输入文件:Url10.txt
www.example.comwww.yahoo.comwww.testsite.com
....
导入请求使用 open("url10.txt") 作为 f:urls = f.read().splitlines()打印(网址)对于网址中的网址:url = 'http://'+url #将http://添加到每个url(必须有更好的方法来做到这一点)尝试:resp = requests.get(url, timeout=1)print(len(resp.content), '->', resp.status_code, '->', resp.url)例外为 e:打印(错误",网址)
挑战:通过多处理提高速度.
多处理
但它不工作.我收到以下错误:(注意:我不确定我是否正确地实现了这个)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
导入请求从多处理导入池使用 open("url10.txt") 作为 f:urls = f.read().splitlines()def checkurlconnection(url):对于网址中的网址:url = 'http://'+url尝试:resp = requests.get(url, timeout=1)print(len(resp.content), '->', resp.status_code, '->', resp.url)例外为 e:打印(错误",网址)如果 __name__ == __main__":p = 池(进程=4)结果 = p.map(checkurlconnection, urls)
在这种情况下,您的任务受 I/O 限制而非处理器限制 - 网站回复所需的时间比 CPU 循环一次所需的时间长您的脚本(不包括 TCP 请求).这意味着您不会从并行执行此任务中获得任何加速(这就是 multiprocessing
所做的).你想要的是多线程.实现这一点的方法是使用文档很少,可能名称不佳的 multiprocessing.dummy
:
导入请求from multiprocessing.dummy import Pool as ThreadPoolurls = ['https://www.python.org','https://www.python.org/about/']def get_status(url):r = requests.get(url)返回 r.status_code如果 __name__ == "__main__":pool = ThreadPool(4) # 建立工人池results = pool.map(get_status, urls) #在自己的线程中打开urlpool.close() #关闭池并等待工作完成pool.join()
参见此处,了解 Python 中多处理与多线程的示例.p>
Problem: Check a listing of over 1000 urls and get the url return code (status_code).
The script I have works but very slow.
I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses. (i.e:
200 -> www.yahoo.com
404 -> www.badurl.com
...
Input file:Url10.txt
www.example.com
www.yahoo.com
www.testsite.com
....
import requests
with open("url10.txt") as f:
urls = f.read().splitlines()
print(urls)
for url in urls:
url = 'http://'+url #Add http:// to each url (there has to be a better way to do this)
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
Challenges: Improve speed with multiprocessing.
With multiprocessing
But is it not working. I get the following error: (note: I am not sure if I have even implemented this correctly)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(url):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing
does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy
:
import requests
from multiprocessing.dummy import Pool as ThreadPool
urls = ['https://www.python.org',
'https://www.python.org/about/']
def get_status(url):
r = requests.get(url)
return r.status_code
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_status, urls) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
See here for examples of multiprocessing vs multithreading in Python.
这篇关于如何使用多处理循环遍历一大串 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何使用多处理循环遍历一大串 URL?
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- 我如何卸载 PyTorch? 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01