Why scrapy image pipeline is not downloading images? - python-3.x

I am trying to download all the images from the product gallery. I have tried the mentioned script but somehow I am not able to download the images. I could manage to download the main image which contains an id. The other images from the gallery do not contain any id and I failed to download them.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BasicSpider(CrawlSpider):
name = 'basic'
allowed_domains = ['www.leebmann24.de']
start_urls = ['https://www.leebmann24.de/bmw.html']
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='category-products']/ul/li/h2/a"), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths="//li[#class='next']/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'URL': response.url,
'Price': response.xpath("normalize-space(//span[#class='price']/text())").get(),
'image_urls': response.xpath("//div[#class='item']/a/img/#src").getall()
}

#Raisul Islam, '//*[#id="image-main"]/#src' is generating the image url and I'm not getting any issues. Please, see the output whether that's your expacted or not.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BasicSpider(CrawlSpider):
name = 'basic'
allowed_domains = ['www.leebmann24.de']
start_urls = ['https://www.leebmann24.de/bmw.html']
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='category-products']/ul/li/h2/a"), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths="//li[#class='next']/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'URL': response.url,
'Price': response.xpath("normalize-space(//span[#class='price']/text())").get(),
'image_urls': response.xpath('//*[#id="image-main"]/#src').get()
}
Output:
{'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-3er-f30-f31.html', 'Price': '57,29\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452302924-1.jpg'}
2022-09-07 02:35:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html> (referer: https://www.leebmann24.de/bmw.html?p=2)
2022-09-07 02:35:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html>
{'URL': 'https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html', 'Price': '15,64\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/b/m/bmw-erste-hilfe-klarsichtbeutel-51477158433.jpg'}
2022-09-07 02:35:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.leebmann24.de/erste-hilfe-set.html> (failed 1 times): 503 Service Unavailable
2022-09-07 02:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html> (referer: https://www.leebmann24.de/bmw.html)
2022-09-07 02:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html>
{'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html', 'Price': '71,66\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452347734-1.jpg'}

This expression will get all product images except main (you said that you already have it):
'//div[#id="itemslider-zoom"]//a/#href'

Related

How to use Scrapy on a website that does not change the URL when changing language

As far as I can see when the language button is pressed, this website https://www.learnit.nl/ fetches the english version by sending a POST Request to https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1 and I dont know how to replicate with Scrapy. I'll appreciate any help.
Data is in API calls json response with post method where payload is a big json and how to replicate with Scrapy, you can follow the next example:
import json
import scrapy
class CourseSpider(scrapy.Spider):
name = 'course'
body = add payload here
def start_requests(self):
yield scrapy.Request(
url='https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1',
callback=self.parse,
body=json.dumps(self.body),
method="POST",
headers={
}
)
def parse(self, response):
response = json.loads(response.body)
for resp in response['to_words']:
yield {
'course': resp
}
Output:
{'course': 'Writing clear texts'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML e-mail'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML and CSS Basics'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML and CSS Continued'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML Training E-learning'}
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.879555,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 28, 16, 3, 22, 536326),
'httpcompression/response_bytes': 36269,
'httpcompression/response_count': 1,
'item_scraped_count': 514,
... so on
As payload is a big json and can't post here as outof limit.
Full working code here

Stuck in a loop / bad Xpath when doing scraping of website

I'm trying to scrape data from this website: https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM
I have made the following script for the intial data:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = rows.xpath("(//tr/td[#class='time']/span)[1]/text()").get()
time = rows.xpath("//tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
however the data I'm getting is repeated, like if I'm not navigating the For cycle:
PS C:\Users\gasgu\PycharmProjects\ScrapingProject\projects\waia>
scrapy crawl waiascrap 2021-08-20 15:25:11 [scrapy.utils.log] INFO:
Scrapy 2.5.0 started (bot: waia) 2021-08-20 15:25:11
[scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5,
cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python
3.9.6 (t ags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography
3.4.7, Platform Windows-10-
10.0.19042-SP0 2021-08-20 15:25:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-08-20
15:25:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'waia', 'NEWSPIDER_MODULE': 'waia.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['waia.spiders']} 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet Password: 9299b6be5840b21c
2021-08-20 15:25:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-08-20
15:25:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-08-20 15:25:11
[scrapy.middleware] INFO: Enabled item pipelines: [] 2021-08-20
15:25:11 [scrapy.core.engine] INFO: Spider opened 2021-08-20 15:25:11
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2021-08-20 15:25:11
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2021-08-20 15:25:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://aa-dc.org/robots.txt> (referer: None) 2021-08-20
15:25:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> (referer: None)
2021-08-20 15:25:16 [scrapy.core.scraper] DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:19 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:22 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:26 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:29 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:32 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:35 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:39 [scrapy.core.scraper]
DEBUG: Scraped from <200
https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day':
'Sunday', 'time': '6:45 am'}
EDIT:
now it's working, there was a combination of the errors marked by #Prophet, and a problem with my Xpath.
I'm putting my code working below:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.xpath(".//td[#class='time']/span/text()").get()
time = row.xpath(".//td[#class='time']/span/time/text()").get()
yield {
'day': day,
'time': time,
}
To select element inside element you have to put a dot . in front of the XPath expression saying "from here".
Otherwise it will bring you the first match of (//tr/td[#class='time']/span)[1]/text() on the entire page each time, as you see.
Also, since you are iterating per each row it should be row.xpath..., not rows.xpath since rows is a list of elements while each row is an element.
Also, to apply search on a web element according to XPath locator you should use find_element_by_xpath method, not xpath.
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.find_element_by_xpath(".(//tr/td[#class='time']/span)[1]/text()").get()
time = row.find_element_by_xpath("//.tr/td[#class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}

Scrapy: follow links to scrape additional information for each item

I am trying to scrape a website that has some info about 15 articles on each page. For each article, I'd like to get the title, date, and then follow the "Read More" link to get additional information (e.g., the source of the article).
So far, I could successfully scrape the title and date for each article in all pages and store them in a CSV file.
My problem is that I couldn't follow the Read More link to get the additional info (source) for each article. I read a lot of similar questions and their answers, but I could not fix it yet.
Here is my code:
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
print("procesing:"+response.url)
Title = response.xpath('//h2[#class="entry-title"]/a/text()').extract()
Date = response.xpath('//p[#class="entry-content__text"]/strong/text()').extract()
ReadMore_links = response.xpath('//a[#class="button entry-content__button entry-content__button--smaller"]/#href').extract()
for link in ReadMore_links:
yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)
def parsepage2(self, response):
Source = response.xpath('//p[#class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
return Source
row_data = zip(Title, Date, Source)
for item in row_data:
scraped_info = {
'page':response.url,
'Title': item[0],
'Date': item[1],
'Source': item[2],
}
yield scraped_info
next_page = response.xpath('//a[#class="next page-numbers"]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
You need to process each article, get Date, Title and "Read More" and next yield another scrapy.Request passing already collected information using cb_kwargs (or request.meta in old versions):
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
for article in response.xpath('//article'):
Title = article.xpath('.//h2[#class="entry-title"]/a/text()').get()
Date = article.xpath('.//p[#class="entry-content__text"]/strong/text()').get()
ReadMore_link = article.xpath('.//a[#class="button entry-content__button entry-content__button--smaller"]/#href').get()
yield scrapy.Request(
url=response.urljoin(ReadMore_link),
callback=self.parse_article_details,
cb_kwargs={
'article_title': Title,
'article_date': Date,
}
)
next_page = response.xpath('//a[#class="next page-numbers"]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
def parse_article_details(self, response, article_title, article_date):
Source = response.xpath('//p[#class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
scraped_info = {
'page':response.url,
'Title': article_title,
'Date': article_date,
'Source': Source,
}
yield scraped_info
UPDATE
Everything works correctly on my side:
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
You might want to have look at the follow_all function, it is a better option than urljoin:
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns

How to fix callback inside the spider in python scrapy?

I'm creating a web scraper and want to callback to get sub-pages, but it seems not working correctly and no result, any one can help ?
Here is my code
class YellSpider(scrapy.Spider):
name = "yell"
start_urls = [url, ]
def parse(self, response):
pageNum = 0
pages = len(response.xpath(".//div[#class='col-sm-14 col-md-16 col-lg-14 text-center']/*"))
for page in range(pages):
pageNum = page + 1
for x in range(5):
num = random.randint(5, 8)
time.sleep(num)
for item in response.xpath(".//a[#href][contains(#href,'/#view=map')][contains(#href,'/biz/')]"):
subcategory = base_url + item.xpath("./#href").extract_first().replace("/#view=map", "")
sub_req = scrapy.Request(subcategory, callback=self.parse_details)
yield sub_req
next_page = base_url + "ucs/UcsSearchAction.do?&selectedClassification=" + classification + "&keywords=" + keyword + "&location=" + city + "&pageNum=" + str(
pageNum + 1)
if next_page:
yield scrapy.Request(next_page, self.parse)
def parse_details(self, sub_req):
for x in range(5):
num = random.randint(1, 5)
# time.sleep(num)
name = sub_req.xpath(".//h1[#class='text-h1 businessCard--businessName']/text()").extract_first()
address = " ".join(
sub_req.xpath(".//span[#class='address'][#itemprop='address']/child::node()/text()").extract())
telephone = sub_req.xpath(".//span[#class='business--telephoneNumber']/text()").extract_first()
web = sub_req.xpath(
".//div[#class='row flexColumns-sm-order-8 floatedColumns-md-right floatedColumns-lg-right floatedColumns-md-19 floatedColumns-lg-19']//a[#itemprop='url']/#href").extract_first()
hours = ""
overview = ""
yield {
'Name': name,
'Address': address,
'Telephone': telephone,
'Web Site': web
}
I want to callback from response to parse_details
I expect to loop over all adds in sub_req and scrape the data from it
and here is the logfile:
2019-06-17 15:50:33 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Web_Scraper)
2019-06-17 15:50:33 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.16299-SP0
2019-06-17 15:50:33 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Web_Scraper', 'LOG_FILE': 'output.log', 'NEWSPIDER_MODULE': 'Web_Scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Web_Scraper.spiders']}
2019-06-17 15:50:33 [scrapy.extensions.telnet] INFO: Telnet Password: 91c423015a2cd984
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-06-17 15:50:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-06-17 15:50:34 [scrapy.core.engine] INFO: Spider opened
2019-06-17 15:50:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-17 15:50:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-17 15:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/robots.txt> (referer: None)
2019-06-17 15:50:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1> (referer: None)
2019-06-17 15:50:49 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.yell.com//biz/pennine-tuition-services-liverpool-9465687> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/> from <GET https://www.yell.com//biz/pennine-tuition-services-liverpool-9465687>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/> from <GET https://www.yell.com//biz/home-maths-tutoring-liverpool-7467622>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/> from <GET https://www.yell.com//biz/kumon-maths-and-english-wallasey-8945913>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tution-wallasey-7574939/> from <GET https://www.yell.com//biz/maths-tution-wallasey-7574939>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/> from <GET https://www.yell.com//biz/patrick-haslam-tutoring-liverpool-8777349>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tution-wallasey-8327361/> from <GET https://www.yell.com//biz/maths-tution-wallasey-8327361>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/> from <GET https://www.yell.com//biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/> from <GET https://www.yell.com//biz/tutor-services-liverpool-liverpool-8134849>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/> from <GET https://www.yell.com//biz/advanced-maths-tutorials-liverpool-3755223>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/> from <GET https://www.yell.com//biz/kumon-maths-and-english-liverpool-8903743>
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/> from <GET https://www.yell.com//biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/north-west-tutors-liverpool-901511208/> from <GET https://www.yell.com//biz/north-west-tutors-liverpool-901511208>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/triple-m-education-prenton-8934754/> from <GET https://www.yell.com//biz/triple-m-education-prenton-8934754>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/pennine-tuition-services-liverpool-9465687/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tution-wallasey-7574939/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/maths-tuition-wallasey-901339881/> from <GET https://www.yell.com//biz/maths-tuition-wallasey-901339881>
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/home-maths-tutoring-liverpool-7467622/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-wallasey-8945913/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tution-wallasey-8327361/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tution-wallasey-7574939/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/activett-liverpool-901311152/> from <GET https://www.yell.com//biz/activett-liverpool-901311152>
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/dr-john-ankers-science-and-maths-tutor-liverpool-8467525/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=4> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/tutor-services-liverpool-liverpool-8134849/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/> from <GET https://www.yell.com//biz/liz-beattie-tutoring-prenton-7618961>
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/askademia-liverpool-8680035> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/patrick-haslam-tutoring-liverpool-8777349/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tution-wallasey-8327361/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/north-west-tutors-liverpool-901511208/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/advanced-maths-tutorials-liverpool-3755223/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/triple-m-education-prenton-8934754/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/1-2-1-tutoring-liverpool-901224945> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-liverpool-8903743/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/maths-tuition-wallasey-901339881/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/dmw-tuition-ltd-liverpool-6887458> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/askademia-liverpool-8680035>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/wallasey-tuition-wallasey-7390339> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/zenitheducators-liverpool-7791342> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/explore-learning-liverpool-901511688> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/kumon-maths-and-english-study-centre-bebington-wirral-7460985/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/north-west-tutors-liverpool-901511208/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/edes-educational-centre-liverpool-8380869> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/triple-m-education-prenton-8934754/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/love2teach-liverpool-liverpool-9678322> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/activett-liverpool-901311152/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/1-2-1-tutoring-liverpool-901224945>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/maths-tuition-wallasey-901339881/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com//biz/guaranteed-grades-liverpool-7368523> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/dmw-tuition-ltd-liverpool-6887458>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/wallasey-tuition-wallasey-7390339>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/zenitheducators-liverpool-7791342>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=1> (referer: https://www.yell.com/ucs/UcsSearchAction.do?&selectedClassification=Tutoring&keywords=Math&location=liverpool&pageNum=4)
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/explore-learning-liverpool-901511688>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/edes-educational-centre-liverpool-8380869>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/love2teach-liverpool-liverpool-9678322>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/activett-liverpool-901311152/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com//biz/guaranteed-grades-liverpool-7368523>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yell.com/biz/liz-beattie-tutoring-prenton-7618961/>
{'Name': None, 'Address': '', 'Telephone': None, 'Web Site': None}
2019-06-17 15:50:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-17 15:50:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 21544,
'downloader/request_count': 45,
'downloader/request_method_count/GET': 45,
'downloader/response_bytes': 257613,
'downloader/response_count': 45,
'downloader/response_status_count/200': 29,
'downloader/response_status_count/301': 16,
'dupefilter/filtered': 51,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 6, 17, 13, 50, 57, 376036),
'item_scraped_count': 25,
'log_count/DEBUG': 71,
'log_count/INFO': 9,
'request_depth_max': 3,
'response_received_count': 29,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 44,
'scheduler/dequeued/memory': 44,
'scheduler/enqueued': 44,
'scheduler/enqueued/memory': 44,
'start_time': datetime.datetime(2019, 6, 17, 13, 50, 34, 114473)}
2019-06-17 15:50:57 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy crawl spider does not follow all the links and does not populate Item loader

I am trying to take the links for this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products) and scrape the title from each one. However it doen't work! The spider does not seem to follow the links!
CODE
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from depositsusa.items import DepositsusaItem
from scrapy.loader import ItemLoader
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['web']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-
database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse'),
)
def parse(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
items
import scrapy
from scrapy.item import Item, Field
class DepositsusaItem(Item):
# main fields
name = Field()
# Housekeeping Fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
pass
OUTPUT
(base) C:\Users\User\Documents\Python WebCrawling Learing
Projects\depositsusa>scrapy crawl deposits
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
depositsusa)
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform
Windows-10-10.0.17134-SP0
2018-11-17 00:29:48 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
'depositsusa', 'NEWSPIDER_MODULE': 'depositsusa.spiders', 'ROBOTSTXT_OBEY':
True, 'SPIDER_MODULES': ['depositsusa.spiders']}
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled downloader
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-17 00:29:48 [scrapy.core.engine] INFO: Spider opened
2018-11-17 00:29:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2018-11-17 00:29:48 [scrapy.extensions.telnet] DEBUG: Telnet console
listening on 127.0.0.1:6024
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/robots.txt> (referer: None)
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://minerals.usgs.gov/science/mineral-deposit-database/#products>
(referer: None)
2018-11-17 00:29:49 [scrapy.core.scraper] DEBUG: Scraped from <200
https://minerals.usgs.gov/science/mineral-deposit-database/>
{'date': [datetime.datetime(2018, 11, 17, 0, 29, 49, 832526)],
'project': ['depositsusa'],
'server': ['DESKTOP-9CUE746'],
'spider': ['deposits'],
'url': ['https://minerals.usgs.gov/science/mineral-deposit-database/']}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-17 00:29:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 475,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 25123,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 23, 29, 49, 848053),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 16, 23, 29, 48, 520273)}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Spider closed (finished)
I am quite new at python, so what seems to be the problem? Is it something to do with linkextraction or the parse function?
You have to change a couple of things.
First, when you use a CrawlSpider, you can't have a callback named parse as you would override the CrawlSpider's parse: https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
Secondly, you want to have the correct list of allowed_domains.
Try something like this:
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[#id="products"][1]/p/a'),
callback='parse_x'),
)
def parse_x(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[#class="container"][1]/header/h1/text()')
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()

Resources