How to fix callback inside the spider in python scrapy? - python-3.x

I'm creating a web scraper and want to callback to get sub-pages, but it seems not working correctly and no result, any one can help ?
Here is my code
class YellSpider(scrapy.Spider):
name = "yell"
start_urls = [url, ]
def parse(self, response):
pageNum = 0
pages = len(response.xpath(".//div[#class='col-sm-14 col-md-16 col-lg-14 text-center']/*"))
for page in range(pages):
pageNum = page + 1
for x in range(5):
num = random.randint(5, 8)
time.sleep(num)
for item in response.xpath(".//a[#href][contains(#href,'/#view=map')][contains(#href,'/biz/')]"):
subcategory = base_url + item.xpath("./#href").extract_first().replace("/#view=map", "")
sub_req = scrapy.Request(subcategory, callback=self.parse_details)
yield sub_req
next_page = base_url + "ucs/UcsSearchAction.do?&selectedClassification=" + classification + "&keywords=" + keyword + "&location=" + city + "&pageNum=" + str(
pageNum + 1)
if next_page:
yield scrapy.Request(next_page, self.parse)
def parse_details(self, sub_req):
for x in range(5):
num = random.randint(1, 5)
# time.sleep(num)
name = sub_req.xpath(".//h1[#class='text-h1 businessCard--businessName']/text()").extract_first()
address = " ".join(
sub_req.xpath(".//span[#class='address'][#itemprop='address']/child::node()/text()").extract())
telephone = sub_req.xpath(".//span[#class='business--telephoneNumber']/text()").extract_first()
web = sub_req.xpath(
".//div[#class='row flexColumns-sm-order-8 floatedColumns-md-right floatedColumns-lg-right floatedColumns-md-19 floatedColumns-lg-19']//a[#itemprop='url']/#href").extract_first()
hours = ""
overview = ""
yield {
'Name': name,
'Address': address,
'Telephone': telephone,
'Web Site': web
}
I want to callback from response to parse_details
I expect to loop over all adds in sub_req and scrape the data from it
and here is the logfile:
2019-06-17 15:50:33 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Web_Scraper)
2019-06-17 15:50:33 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.16299-SP0
2019-06-17 15:50:33 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Web_Scraper', 'LOG_FILE': 'output.log', 'NEWSPIDER_MODULE': 'Web_Scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Web_Scraper.spiders']}
2019-06-17 15:50:33 [scrapy.extensions.telnet] INFO: Telnet Password: 91c423015a2cd984
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-06-17 15:50:34 [scrapy.core.engine] INFO: Spider opened
2019-06-17 15:50:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-17 15:50:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-17 15:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/robots.txt> (referer: None)
2019-06-17 15:50:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1> (referer: None)
2019-06-17 15:50:49 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.yell.com//biz/pennine-tuition-services-liverpool-9465687> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/> from <GET https://www.yell.com//biz/pennine-tuition-services-liverpool-9465687>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/> from <GET https://www.yell.com//biz/home-maths-tutoring-liverpool-7467622>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/> from <GET https://www.yell.com//biz/kumon-maths-and-english-wallasey-8945913>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tution-wallasey-7574939/> from <GET https://www.yell.com//biz/maths-tution-wallasey-7574939>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/> from <GET https://www.yell.com//biz/patrick-haslam-tutoring-liverpool-8777349>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tution-wallasey-8327361/> from <GET https://www.yell.com//biz/maths-tution-wallasey-8327361>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/> from <GET https://www.yell.com//biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/> from <GET https://www.yell.com//biz/tutor-services-liverpool-liverpool-8134849>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/> from <GET https://www.yell.com//biz/advanced-maths-tutorials-liverpool-3755223>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/> from <GET https://www.yell.com//biz/kumon-maths-and-english-liverpool-8903743>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/> from <GET https://www.yell.com//biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/north-west-tutors-liverpool-901511208/> from <GET https://www.yell.com//biz/north-west-tutors-liverpool-901511208>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/triple-m-education-prenton-8934754/> from <GET https://www.yell.com//biz/triple-m-education-prenton-8934754>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tution-wallasey-7574939/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tuition-wallasey-901339881/> from <GET https://www.yell.com//biz/maths-tuition-wallasey-901339881>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tution-wallasey-8327361/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tution-wallasey-7574939/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/activett-liverpool-901311152/> from <GET https://www.yell.com//biz/activett-liverpool-901311152>
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=4> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/> from <GET https://www.yell.com//biz/liz-beattie-tutoring-prenton-7618961>
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/askademia-liverpool-8680035> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tution-wallasey-8327361/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/north-west-tutors-liverpool-901511208/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/triple-m-education-prenton-8934754/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/1-2-1-tutoring-liverpool-901224945> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tuition-wallasey-901339881/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/dmw-tuition-ltd-liverpool-6887458> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/askademia-liverpool-8680035>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/wallasey-tuition-wallasey-7390339> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/zenitheducators-liverpool-7791342> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/explore-learning-liverpool-901511688> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/north-west-tutors-liverpool-901511208/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/edes-educational-centre-liverpool-8380869> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/triple-m-education-prenton-8934754/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/love2teach-liverpool-liverpool-9678322> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/activett-liverpool-901311152/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/1-2-1-tutoring-liverpool-901224945>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tuition-wallasey-901339881/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/guaranteed-grades-liverpool-7368523> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/dmw-tuition-ltd-liverpool-6887458>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/wallasey-tuition-wallasey-7390339>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/zenitheducators-liverpool-7791342>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=4)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/explore-learning-liverpool-901511688>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/edes-educational-centre-liverpool-8380869>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/love2teach-liverpool-liverpool-9678322>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/activett-liverpool-901311152/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/guaranteed-grades-liverpool-7368523>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-17 15:50:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 21544,
'downloader/request_count': 45,
'downloader/request_method_count/GET': 45,
'downloader/response_bytes': 257613,
'downloader/response_count': 45,
'downloader/response_status_count/200': 29,
'downloader/response_status_count/301': 16,
'dupefilter/filtered': 51,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 6, 17, 13, 50, 57, 376036),
'item_scraped_count': 25,
'log_count/DEBUG': 71,
'log_count/INFO': 9,
'request_depth_max': 3,
'response_received_count': 29,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 44,
'scheduler/dequeued/memory': 44,
'scheduler/enqueued': 44,
'scheduler/enqueued/memory': 44,
'start_time': datetime.datetime(2019, 6, 17, 13, 50, 34, 114473)}
2019-06-17 15:50:57 [scrapy.core.engine] INFO: Spider closed (finished)

Related

Unsure how to troubleshoot Scrapy terminal output

and thanks in advance.
I'm attempting to use scrapy, which is somewhat new to me. I built (what I thought was) a simple spider which does the following:
class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def parse(self, response):
for entry in response.xpath('/html/body/div[3]/div/div[3]/section'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': entry.xpath('./td[2]/a/#href').get()
}
Then I run the following in a terminal window:
$ scrapy crawl KYM_entries -O practice.csv
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: KYM_spider)
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 21.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 3.4.8, Platform Linux-5.15.0-56-generic-x86_64-with-glibc2.35
2022-12-26 20:08:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'KYM_spider',
'NEWSPIDER_MODULE': 'KYM_spider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['KYM_spider.spiders']}
2022-12-26 20:08:04 [py.warnings] WARNING: /usr/local/lib/python3.10/dist-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2022-12-26 20:08:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet Password: 97ac3d17f1e4cea1
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-26 20:08:04 [scrapy.core.engine] INFO: Spider opened
2022-12-26 20:08:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/robots.txt> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-26 20:08:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 466,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 11690,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.953839,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 27, 1, 8, 5, 833510),
'httpcompression/response_bytes': 45804,
'httpcompression/response_count': 2,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 65228800,
'memusage/startup': 65228800,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 12, 27, 1, 8, 4, 879671)}
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Spider closed (finished)
This returns an empty CSV, which I suppose means either something is wrong with the xpath, or there is something wrong with the connection to Know Your Meme. However, beyond the 200 code saying it is connecting to the site, I'm unsure how to troubleshoot what is happening here.
So I have a couple questions, one more direct to my issue, and one more broadly interested in this output:
Is there a way to see at what point my script is failing to retrieve the specified data in the xpath for this particular case?
Is there a simple guide or reference for how to read scrapy output?
I have looked into your code. There are a few issues with the selectors/XPath. I have updated the CSS selector and removed the XPATH. meme URLs are relative URLs so I have added urljoin method to make these URLs absolute URLs. I have added start_request method as my version of scrapy is 2.6.0. if you are using a lower version of scrapy (1.6.0) you can remove this method.
class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse)
def parse(self, response):
for entry in response.css('.entry-grid-body .photo'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': response.urljoin(entry.css('::attr(href)').get())
}
The code is working fine now. Below is the output.
2022-12-27 13:14:52 [scrapy.core.engine] INFO: Spider opened
2022-12-27 13:14:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-27 13:14:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-27 13:14:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/mayinquangcao'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/this-is-x-bitch-we-clown-in-this-muthafucka-betta-take-yo-sensitive-ass-back-to-y'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/choo-choo-charles'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/bug-fables-the-everlasting-sapling'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/onii-holding-a-picture'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/vintage-recipe-videos'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/ytpmv-elf'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/i-just-hit-a-dog-going-70mph-on-my-truck'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/women-dodging-accountability'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/grinchs-ultimatum'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/where-is-idos-black-and-white'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/basilisk-time'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/rankinbass-productions'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/error143'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/whatsapp-university'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/messi-autism-speculation-messi-is-autistic'}

Extractring <li> and <ul> using scrapy

I'm new to Scrapy but I'm running into an issue forming an accurate selector based on scrapy's tutorial code basically I'm trying to list all business,their Address and their website. But when I try to list them only one result comes out (if i set all of them to getall then i'm getting all of them just they are thrown there randomly and i need them in format:
{"address": "mazowieckie, Warszawa", "name": "Dom Development S.A.", "link": "domd.pl"})
Here is code that I use:
class RynekMainSpider(scrapy.Spider):
name = "RynekMain"
start_urls = [
'https://rynekpierwotny.pl/deweloperzy/?page=1',
]
def parse(self, response):
for quote in response.css('ul.rp-1qtpzi4'):
yield {
'address': quote.css('address.rp-o9b83y::text').get(),
'name': quote.css('h2.rp-69f2r4::text').get(),
'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
}
```
Thanks in advance.
You are getting only one output because the element selection/locator strategy ul.rp-1qtpzi4 is incorrect meaning it's not selecting all the lising from the entire page but the correct selection like
.rp-y89gny.eboilu01 ul li select all 24 items
import scrapy
from scrapy.crawler import CrawlerProcess
class RynekMainSpider(scrapy.Spider):
name = "RynekMain"
start_urls = [
'https://rynekpierwotny.pl/deweloperzy/?page=1']
def parse(self, response):
for quote in response.css('.rp-y89gny.eboilu01 ul li'):
yield {
'address': quote.css('address.rp-o9b83y::text').get(),
'name': quote.css('h2.rp-69f2r4::text').get(),
'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
}
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(RynekMainSpider)
process.start()
Output:
{'address': 'mazowieckie, Warszawa', 'name': 'Dom Development S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/dom-development-sa-955/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Ronson Development Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/ronson-development-sp-z-oo-863/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Echo Investment S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/echo-investment-sa-7478/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Psie Pole', 'name': 'INTER-ES Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/inter-es-deweloper-928/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, Bielsko-Biała', 'name': 'Murapol S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/murapol-sa-884/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Robyg S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-sa-888/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, cieszyński, Cieszyn', 'name': 'ATAL S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/atal-sa-1084/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'zachodniopomorskie, Szczecin', 'name': 'Assethome – Przedstawiciel Dewelopera', 'link': 'https://rynekpierwotny.pl/deweloperzy/asset-home-przedstawiciel-dewelopera-7429/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Hreit', 'link': 'https://rynekpierwotny.pl/deweloperzy/hreit-7892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Develia', 'link': 'https://rynekpierwotny.pl/deweloperzy/develia-1048/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Fabryczna', 'name': 'PROFIT Development', 'link': 'https://rynekpierwotny.pl/deweloperzy/profit-development-940/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Novisa Development Sp. z o.o. Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/novisa-development-sp-z-oo-sp-j-484/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Robyg', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-grupa-deweloperska-4251/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Arche S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/arche-sp-z-oo-934/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'warmińsko-mazurskie, ełcki, Ełk', 'name': 'Rutkowski Development Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/rutkowski-development-sp-j-1846/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Cordia Polska', 'link': 'https://rynekpierwotny.pl/deweloperzy/cordia-polska-3824/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Budlex Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/budlex-sp-z-oo-1684/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Euro Styl S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/euro-styl-sa-964/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'łódzkie, Skierniewice', 'name': 'JHM DEVELOPMENT S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/jhm-development-sa-892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Lokum Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/lokum-deweloper-948/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'podlaskie, Łomża', 'name': 'Eldor Bud Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/eldor-bud-sp-z-oo-4355/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Nexity Polska Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/nexity-polska-sp-z-oo-2856/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Spravia', 'link': 'https://rynekpierwotny.pl/deweloperzy/spravia-1236/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'małopolskie, Kraków', 'name': 'Bryksy', 'link': 'https://rynekpierwotny.pl/deweloperzy/bryksy-914/'}
'item_scraped_count': 24,,
response.css('ul.rp-1qtpzi4') will get you the container of the items, and not the items (li tag) themselves. So you're looping over the container (once) and getting just the first item.
Change it to:
import scrapy
class RynekMainSpider(scrapy.Spider):
name = "RynekMain"
start_urls = [
'https://rynekpierwotny.pl/deweloperzy/?page=1',
]
def parse(self, response):
for quote in response.css('ul.rp-1qtpzi4 li'):
yield {
'address': quote.css('address.rp-o9b83y::text').get(),
'name': quote.css('h2.rp-69f2r4::text').get(),
'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
}

Stuck in a loop / bad Xpath when doing scraping of website

I'm trying to scrape data from this website: https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM
I have made the following script for the intial data:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = rows.xpath("(//tr/td[#class='time']/span)[1]/text()").get()
time = rows.xpath("//tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
however the data I'm getting is repeated, like if I'm not navigating the For cycle:
PS C:\Users\gasgu\PycharmProjects\ScrapingProject\projects\waia>
scrapy crawl waiascrap 2021-08-20 15:25:11 [scrapy.utils.log] INFO:
Scrapy 2.5.0 started (bot: waia) 2021-08-20 15:25:11
[scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5,
cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python
3.9.6 (t ags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography
3.4.7, Platform Windows-10-
10.0.19042-SP0 2021-08-20 15:25:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-08-20
15:25:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'waia', 'NEWSPIDER_MODULE': 'waia.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['waia.spiders']} 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet Password: 9299b6be5840b21c
2021-08-20 15:25:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-08-20
15:25:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled item pipelines: [] 2021-08-20
15:25:11 [scrapy.core.engine] INFO: Spider opened 2021-08-20 15:25:11
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2021-08-20 15:25:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://aa-dc.org/robots.txt> (referer: None) 2021-08-20
15:25:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> (referer: None)
2021-08-20 15:25:16 [scrapy.core.scraper] DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:19 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:22 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:26 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:29 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:32 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:35 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:39 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'}
EDIT:
now it's working, there was a combination of the errors marked by #Prophet, and a problem with my Xpath.
I'm putting my code working below:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.xpath(".//td[#class='time']/span/text()").get()
time = row.xpath(".//td[#class='time']/span/time/text()").get()
yield {
'day': day,
'time': time,
}
To select element inside element you have to put a dot . in front of the XPath expression saying "from here".
Otherwise it will bring you the first match of (//tr/td[#class='time']/span)[1]/text() on the entire page each time, as you see.
Also, since you are iterating per each row it should be row.xpath..., not rows.xpath since rows is a list of elements while each row is an element.
Also, to apply search on a web element according to XPath locator you should use find_element_by_xpath method, not xpath.
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.find_element_by_xpath(".(//tr/td[#class='time']/span)[1]/text()").get()
time = row.find_element_by_xpath("//.tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}

Scrapy pagination: Unable to paginate

First of all, thank you if you are reading this.
I have been using Python with scrapy to scrape minor data, however, I want to pull in some additional information but I got stuck on pagination.
The website is https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html
The element is
<span class="jslink pg-btn page-next" data-href="https://home.mobile.de/regional/baden-württemberg/2.html" title="Zur nächsten Seite"> </span>
element
What is the xpath expression I can use in Rule(LinkExtractor(restrict_xpaths="")?
I'm using crawl template.
My code so far:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Baden1Spider(CrawlSpider):
name = 'baden1'
allowed_domains = ['home.mobile.de']
start_urls = ['https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html?fbclid=IwAR0MpRTx1TrrrBdg2cKr5E08QiP4fE-pjOAwb7_UsEytToJmWFEfpdD6X0w/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='box']/div[#class='row ']"), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_xpaths="//span[#class='jslink pg-btn page-next']"))
)
def parse_item(self, response):
yield{
'Dealer Name': response.xpath("//address[#class='fullAddress']/strong/text()").get(),
'Street': response.xpath("normalize-space(//div[contains(#class, 'addressData')]/text())").get(),
'ZIP Code': response.xpath("normalize-space(//div[contains(#class, 'addressData')]/text()/following::text()[1])").get().split()[0],
'City': response.xpath("normalize-space(//div[contains(#class, 'addressData')]/text()/following::text()[1])").get().split()[1],
'Phone Number 1': response.xpath("normalize-space(//div[contains(#class, 'dealerContactPhoneNumbers')]/text())").get(),
'Phone Number 2': response.xpath("normalize-space(//div[contains(#class, 'dealerContactPhoneNumbers')]/text()/following::text()[1])").get(),
'Source': response.url
}
N.B. This is my first post here in stackoverflow. If I made any mistake, pardon me.
Here is the pagination:
Your code is working fine. Starting url: " https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html" is same as your mentioned. If you click on the first page then you will get this url, from where data is generating. I make the pagination in start_urls using list comprehension. Now You can increase or decrease range of page numbers at anytime. Here I scrape only five pages and you can scrape total pages or whatever you wish just put the page numbers inside the range. I scrape 5 pages total 160 items.
CODE:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Baden1Spider(CrawlSpider):
name = 'baden1'
allowed_domains = ['home.mobile.de']
start_urls = ['https://home.mobile.de/regional/baden-w%C3%BCrttemberg/'+ str(x) +'.html' for x in range(0,5)]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='box']/div[#class='row ']"), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_xpaths="//span[#class='jslink pg-btn page-next']"))
)
def parse_item(self, response):
yield{
'Dealer Name': response.xpath("//address[#class='fullAddress']/strong/text()").get(),
'Street': response.xpath("normalize-space(//div[contains(#class, 'addressData')]/text())").get(),
'ZIP Code': response.xpath("normalize-space(//div[contains(#class, 'addressData')]/text()/following::text()[1])").get().split()[0],
'City': response.xpath("normalize-space(//div[contains(#class, 'addressData')]/text()/following::text()[1])").get().split()[1],
'Phone Number 1': response.xpath("normalize-space(//div[contains(#class, 'dealerContactPhoneNumbers')]/text())").get(),
'Phone Number 2': response.xpath("normalize-space(//div[contains(#class, 'dealerContactPhoneNumbers')]/text()/following::text()[1])").get(),
'Source': response.url
}
OUTPUT: A portion of total output.
'Dealer Name': 'Abbas KfZ An- und Verkauf', 'Street': 'schießstattweg 18', 'ZIP Code': '88677', 'City': 'Markdorf', 'Phone Number 1': 'Tel.:\xa0+49 (0)176 56730811', 'Phone Number 2': '', 'Source': 'https://home.mobile.de/ABBASKFZANUNDVERKAUF'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/SCHAIBLEMASCHINENHANDEL>
{'Dealer Name': 'Schaible Maschinenhandel', 'Street': 'In Oberwiesen 7', 'ZIP Code': '88682', 'City': 'Salem', 'Phone Number 1': 'Tel.:\xa0+49 (0)7553 60146', 'Phone Number 2': 'Mobiltelefon:\xa0+49 (0)171 7998515', 'Source': 'https://home.mobile.de/SCHAIBLEMASCHINENHANDEL'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/RUSH-AUTOMOBILE>
{'Dealer Name': 'RUSH Automobile UG (haftungsbeschränkt)', 'Street': 'Hallendorferstrasse 6', 'ZIP Code': '88690', 'City': 'Uhldingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7551 949277', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)171 3608800', 'Source': 'https://home.mobile.de/RUSH-AUTOMOBILE'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/FIRST-CLASS-AUTOMOBILE> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AH-MUTTER> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/LOCFAHRZEUGE>
{'Dealer Name': 'LOC Fahrzeuge OHG', 'Street': 'Meersburger Straße 2', 'ZIP Code': '88690', 'City': 'Uhldingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7556 928597', 'Phone Number 2': 'Fax:\xa0+49 (0)7556 928583', 'Source': 'https://home.mobile.de/LOCFAHRZEUGE'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AH-SCHMID-BERMATINGEN> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/FIRST-CLASS-AUTOMOBILE>
{'Dealer Name': 'First Class Automobile Seit 1989', 'Street': 'Büro: Oberer Höhenweg 29', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)176 20491640', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7544 91111', 'Source': 'https://home.mobile.de/FIRST-CLASS-AUTOMOBILE'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AH-MUTTER>
{'Dealer Name': 'Autohaus Matthias Mutter', 'Street': 'Salemerstrasse 42', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 912100', 'Phone Number 2': 'Fax:\xa0+49 (0)7544 91110', 'Source': 'https://home.mobile.de/AH-MUTTER'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AH-SCHMID-BERMATINGEN>
{'Dealer Name': 'Autohaus Schmid', 'Street': 'Salemer Straße 30', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 7544 2375', 'Phone Number 2': 'Fax:\xa0+49 7544 1355', 'Source': 'https://home.mobile.de/AH-SCHMID-BERMATINGEN'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/YAMAHA-NESENSOHN> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/YAMAHA-NESENSOHN>
{'Dealer Name': 'Yamaha Nesensohn', 'Street': 'Salemerstrasse 51', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 2902', 'Phone Number 2': 'Fax:\xa0+49 (0)7544 73025', 'Source': 'https://home.mobile.de/YAMAHA-NESENSOHN'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUS-KIRCHHOFF> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOMOBILEREHM> (referer:
https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUS-KIRCHHOFF>
{'Dealer Name': 'Autohaus Kirchhoff', 'Street': 'Am Luckengraben 4', 'ZIP Code': '88699', 'City': 'Frickingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7554 8450', 'Phone Number 2': 'Fax:\xa0+49 (0)7554 8252', 'Source': 'https://home.mobile.de/AUTOHAUS-KIRCHHOFF'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUSREICHLEOHG> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOMOBILEREHM>
{'Dealer Name': 'Automobile Rehm', 'Street': 'Heidbühlstr. 9', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 175 2234111', 'Phone Number 2': '', 'Source': 'https://home.mobile.de/AUTOMOBILEREHM'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG>
{'Dealer Name': 'Autohaus Sailer GmbH & Co.KG', 'Street': 'Hofäckerstr. 1', 'ZIP Code': '88697', 'City': 'Bermatingen-Ahausen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 968300', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7544 9683018', 'Source': 'https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG'}
2021-08-06 12:40:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1>
{'Dealer Name': 'Patrick Kayser', 'Street': 'Langbrühl 6', 'ZIP Code': '88709', 'City': 'Hagnau', 'Phone Number 1':
'Tel.:\xa0+49 (0)178 6524858', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7532 4458081', 'Source': 'https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1'}
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUSREICHLEOHG>
{'Dealer Name': 'Autohaus Reichle OHG', 'Street': 'Hauptstraße 57', 'ZIP Code': '88699', 'City': 'Frickingen-Altheim', 'Phone Number 1': 'Tel.:\xa0+49 7554 8337', 'Phone Number 2': 'Mobiltelefon:\xa0+49 151 65828855', 'Source': 'https://home.mobile.de/AUTOHAUSREICHLEOHG'}
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE>
{'Dealer Name': 'Lackiermeisterbetrieb & KFZ Service', 'Street': 'Lippertsreuterstr. 6b', 'ZIP Code': '88699', 'City': 'Frickingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7554 9892115', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)1525 2160629', 'Source': 'https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE'}
2021-08-06 12:40:15 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-06 12:40:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369317,
'downloader/request_count': 165,
'downloader/request_method_count/GET': 165,
'downloader/response_bytes': 2468479,
'downloader/response_count': 165,
'downloader/response_status_count/200': 165,
'elapsed_time_seconds': 17.246198,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 6, 6, 40, 15, 130449),
'httpcompression/response_bytes': 6481573,
'httpcompression/response_count': 165,
'item_scraped_count': 160,

Scrapy: follow links to scrape additional information for each item

I am trying to scrape a website that has some info about 15 articles on each page. For each article, I'd like to get the title, date, and then follow the "Read More" link to get additional information (e.g., the source of the article).
So far, I could successfully scrape the title and date for each article in all pages and store them in a CSV file.
My problem is that I couldn't follow the Read More link to get the additional info (source) for each article. I read a lot of similar questions and their answers, but I could not fix it yet.
Here is my code:
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
print("procesing:"+response.url)
Title = response.xpath('//h2[#class="entry-title"]/a/text()').extract()
Date = response.xpath('//p[#class="entry-content__text"]/strong/text()').extract()
ReadMore_links = response.xpath('//a[#class="button entry-content__button entry-content__button--smaller"]/#href').extract()
for link in ReadMore_links:
yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)
def parsepage2(self, response):
Source = response.xpath('//p[#class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
return Source
row_data = zip(Title, Date, Source)
for item in row_data:
scraped_info = {
'page':response.url,
'Title': item[0],
'Date': item[1],
'Source': item[2],
}
yield scraped_info
next_page = response.xpath('//a[#class="next page-numbers"]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
You need to process each article, get Date, Title and "Read More" and next yield another scrapy.Request passing already collected information using cb_kwargs (or request.meta in old versions):
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
for article in response.xpath('//article'):
Title = article.xpath('.//h2[#class="entry-title"]/a/text()').get()
Date = article.xpath('.//p[#class="entry-content__text"]/strong/text()').get()
ReadMore_link = article.xpath('.//a[#class="button entry-content__button entry-content__button--smaller"]/#href').get()
yield scrapy.Request(
url=response.urljoin(ReadMore_link),
callback=self.parse_article_details,
cb_kwargs={
'article_title': Title,
'article_date': Date,
}
)
next_page = response.xpath('//a[#class="next page-numbers"]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
def parse_article_details(self, response, article_title, article_date):
Source = response.xpath('//p[#class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
scraped_info = {
'page':response.url,
'Title': article_title,
'Date': article_date,
'Source': Source,
}
yield scraped_info
UPDATE
Everything works correctly on my side:
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
You might want to have look at the follow_all function, it is a better option than urljoin:
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns

Resources