Scrapy Splash crawler ReactorNotRestartable(Scrapy Splash Crawler Reator NotRestartable)
问题描述
我已经在Windows 10上使用Visual Studio代码开发了一个SRapy Splash Screper。
当我在没有runner.py
文件的情况下像这样运行我的刮取器时,它会工作并生成抓取的内容int";out.json";:scrapy crawl mytest -o out.json
但是,当我使用runner.py
文件在Visual Studio代码中以调试模式运行刮除器时,它在execute
行(下面的完整代码)上失败:
Exception has occurred: ReactorNotRestartable
exception: no description
File "C:scrapyhw_spidersspiders
unner.py", line 8, in <module>
execute(
我已经检查过了:
- Scrapy - Reactor not Restartable
- Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice
- ReactorNotRestartable error in while loop with scrapy
从这些帖子来看,如果我启动第二个爬行器(例如,多次调用Crawl,而只启动一次),似乎是一个问题,然而,我看不到我应该从哪里开始。
我还在那里看到while
循环和Twisted reactor
存在潜在问题,但我在代码中也看不到这些问题。
所以我现在不知道需要在哪里修复代码。
runner.py
#https://newbedev.com/debugging-scrapy-project-in-visual-studio-code
import os
from scrapy.cmdline import execute
os.chdir(os.path.dirname(os.path.realpath(__file__)))
try:
execute(
[
'scrapy',
'crawl',
'mytest',
'-o',
'out.json',
]
)
except SystemExit:
pass
Launch.json
{
"version": "0.1.0",
"configurations": [
{
"name": "Python: Launch Scrapy Spider",
"type": "python",
"request": "launch",
"module": "scrapy",
"args": [
"runspider",
"${file}"
],
"console": "integratedTerminal"
}
]
}
settings.json
{
"python.analysis.extraPaths": [
"./hw_spiders"
]
}
Midlewares.py
from scrapy import signals
from itemadapter import is_item, ItemAdapter
class MySpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class MyDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
Pipelines.py
from itemadapter import ItemAdapter
class MyPipeline:
def process_item(self, item, spider):
return item
settings.py
BOT_NAME = 'hw_spiders'
SPIDER_MODULES = ['hw_spiders.spiders']
NEWSPIDER_MODULE = 'hw_spiders.spiders'
ROBOTSTXT_OBEY = True
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
# 'hw_spiders.middlewares.MySpiderMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
# 'hw_spiders.middlewares.MyDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}
SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
ROBOTSTXT_OBEY = False
myest.py
import json
import re
import os
import scrapy
import time
from scrapy_splash import SplashRequest
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from ..myitems import CarItem
class MyTest_Spider(scrapy.Spider):
name = 'mytest'
start_urls = ['<hidden>']
def start_requests(self):
yield SplashRequest(
self.start_urls[0], self.parse
)
def parse(self, response):
object_links = response.css('div.wrapper div.inner33 > a::attr(href)').getall()
for link in object_links:
yield scrapy.Request(link, self.parse_object)
next_page = response.css('div.nav-links a.next.page-numbers::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_object(self, response):
item = RentalItem()
item['url'] = response.url
object_features = response.css('table.info tr')
for feature in object_features:
try:
feature_title = feature.css('th::text').get().strip()
feature_info = feature.css('td::text').get().strip()
except:
continue
item['thumbnails'] = response.css("ul#objects li a img::attr(src)").getall()
更新%1
所以我现在从我的项目中删除了runner.py,只有.vcodelaunch.json:
当我在Visual Studio代码中打开文件mytest.py
并按F5进行调试时,我看到以下输出:
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Try the new cross-platform PowerShell https://aka.ms/pscore6
PS C:scrapyhw_spiders> & 'C:UsersAdamAppDataLocalProgramsPythonPython38-32python.exe' 'c:UsersAdam.vscodeextensionsms-python.python-2021.11.1422169775pythonFileslibpythondebugpylauncher' '51812' '--' '-m' 'scrapy' 'runspider' 'c:scrapyhw_spidersspidersmytest.py'
2021-11-19 14:19:02 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: hw_spiders)
2021-11-19 14:19:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020,
15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.19041-SP0
2021-11-19 14:19:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
scrapy runspider [options] <spider_file>
runspider: error: Unable to load 'c:\scrapy\hw_spiders\spiders\mytest.py': attempted relative import with no known parent package
这肯定是第from ..myitems import RentalItem
行,但我不知道为什么失败。
推荐答案
您应该创建runner.py
文件并使用默认的runner.py
配置来运行runner.py
文件,或者不是拥有runner.py
文件并使用scrapy
launch.json
(如您的问题所示),而不是两者都有。
看起来您问题中的article只是复制了this Stackoverflow question中的所有答案,并在没有上下文的情况下将它们组合在一起。
这篇关于Scrapy Splash Crawler Reator NotRestartable的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Scrapy Splash Crawler Reator NotRestartable
- 我如何卸载 PyTorch? 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01