why the scrapy yield.Request() did not Recursive? - python-3.x

here is my code?
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com/']
start_urls = ['http://quotes.toscrape.com//']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
I am new to scrapy. I think this will always Recursive.But actually it didn't.That's why?

the problem here is that scrapy uses allowed_domains as a regex for determining if the links passing through belong to the specified domain.
just change the string quotes.toscrape.com/ to quotes.toscrape.com if you only want to allow requests from that specific subdomain.
You can also remove that entire variable if you want to allow requests from every domain.

Related

Scrapy only returning first result of each page

As the question title implies I'm having trouble with the Web scraper library, Scrapy. It's only returning the first "quote" off each page of the Quotes to Scrape site.
I know this may seem simple to those who have mastered scrapy, but I'm having trouble with the concept used here. If someone could fix the error and explain the process, that would be great.
This is my current code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SpiderSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'),
callback='parse_filter_book', follow=True)]
def parse_filter_book(self, response):
title = response.xpath('//div/h1/a/text()').extract_first()
author = response.xpath(
'//div[#class = "quote"]/span/small/text()').extract_first()
author_url = response.xpath(
'//div[#class = "quote"]/span/a/#href').extract_first()
final_author_url = self.base_url + author_url.replace('../..', '')
quote = response.xpath(
'//div[#class = "quote"]/span[#class= "text"]/text()').extract_first()
yield {
'Title': title,
'Author': author,
'URL': final_author_url,
'Quote': quote,
}
Currently I'm trying something based off this approach. I've seen others do something similar to this, but I'm failing to pull of the same.
def parse_filter_book(self, response):
for quote in response.css('div.mw-parser-output > div'):
title = quote.xpath('//div/h1/a/text()').extract_first()
author = quote.xpath(
'//div[#class = "quote"]/span/small/text()').extract_first()
author_url = quote.xpath(
'//div[#class = "quote"]/span/a/#href').extract_first()
final_author_url = self.base_url + author_url.replace('../..', '')
quotes = quote.xpath(
'//div[#class = "quote"]/span[#class= "text"]/text()').extract_first()
The current output is just 10 links, one from each of the 10 pages. With the new modified version, it produces no output, just an error.
It's also my goal just to be scraping the 10 pages in the site, hence why the rules are the way they are.
----- Update ----
Wow, thanks. I copied pasted the corrected function and am getting the desired output. Going through the explanation and comparing my old code to this new one right now, so will answer properly in a while.
Your first code sample will receive a response and will only extract one item, since there is no loop and the selectors are using extract_first():
def parse_filter_book(self, response):
title = response.xpath('//div/h1/a/text()').extract_first()
...
yield {
'Title': title,
...
}
This is literally telling the spider to find in the response all elements that matches for this XPath //div/h1/a/text(), then extract_first() item that matched and set this value in the title variable.
It will do the same for all the other variables, yield the result and finish it's execution.
The general idea in the second code is right, you select all elements that are a quote, iterate between them and extract the values in each iteration. There are a few issues though.
This will return empty:
response.css('div.mw-parser-output > div')
I don't see any element div with that class in the page. Replacing it by response.css('div.quote') is enough to select the quotes elements.
However we still need to fix your extraction paths. In this loop, quote is already an element of div[#class="quote"] so you should supress that as you want to look inside the selector.
for quote in response.css('div.quote'):
title = quote.xpath('//div/h1/a/text()').get()
author = quote.xpath('span/small/text()').get()
author_url = quote.xpath('span/a/#href').get()
final_author_url = response.urljoin(author_url)
quotes = quote.xpath('span[#class="text"]/text()').get()
yield {
'Title': title,
'Author': author,
'URL': final_author_url,
'Quote': quotes, # I believe you meant quotes not quote, quote is the selector, quotes the text.
}
Notes
I left title untouched, it will always scrape the same thing, the title of the page, wasn't sure if that was the intention.
I suggest you to use .get() method instead of .extract_first(). After Scrapy 1.5.2 they are the same thing, but allows for easier comprehension.
You can call response.urljoin() method to join the response's url with the relative url you scraped. Quite handy.
The issue is with your quote selector that is returning an empty list:
response.css('div.mw-parser-output > div'). Therefore you never enter the for loop
To make sure that you are getting all the quotes, you could simply put all the quotes into a variable and then print it to make sure that you are getting what you need.
I also updated the xpaths in your spider as they were extracting data from the whole page and not the quote selector. Make sure to append . to the start of your xpath when you already have a local selector object.
Example:
This will get the first author in your quote selector
quote.xpath('.//span/small/text()').extract_first()
This will get you the first author on the webpage:
quote.xpath('//div[#class = "quote"]/span/small/text()').extract_first()
Working spider:
class SpiderSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'),
callback='parse_filter_book', follow=True)]
def parse_filter_book(self, response):
quotes = response.css('.quote')
for quote in quotes:
# I'm not sure where this title is coming from in the quote
#title = quote.xpath('.//div/h1/a/text()').extract_first()
author = quote.xpath(
'.//span/small/text()').extract_first()
author_url = quote.xpath(
'.//span/a/#href').extract_first()
final_author_url = self.base_url + author_url.replace('../..', '')
text = quote.xpath(
'.//span[#class= "text"]/text()').extract_first()
yield {
'Author': author,
'URL': final_author_url,
'Quote': text,
}

I am trying to join a URL in scrapy but unable to do so

I am trying to fetch (id and name)i.e name from one website and want to append the variable to another link. for eg in the name variable i get - /in/en/books/1446502-An-Exciting-Day.(There are many records) and then i want to append the name variable to 'https://www.storytel.com' to fetch data specific to the book. Also I want to put a condition for a_name i.e if response.css('span.expandAuthorName::text') is not available than put '-' else fetch the name.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brickset-spider'
start_urls = ['https://www.storytel.com/in/en/categories/1-Children?pageNumber=100']
def parse(self, response):
# for quote in response.css('div.gridBookTitle'):
# item = {
# 'name': quote.css('a::attr(href)').extract_first()
# }
# yield item
urls = response.css('div.gridBookTitle > a::attr(href)').extract()
for url in urls:
url = ['https://www.storytel.com'].urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
yield {
'a_name': response.css('span.expandAuthorName::text').extract_first()
}
I am trying to append "https://www.storytel.com".urljoin(url) but i am getting error for the same. Being new to scrapy I tried many thing but unable to resolve the issue. I am getting error - in line 15 list object has no attribute urljoin. Any leads on how to overcome this. Thanks in advance.
Check with this solution.
for url in urls:
url = 'https://www.storytel.com'+ url
yield scrapy.Request(url=url, callback=self.parse_details)
it helps let me know.
url = ['https://www.storytel.com'].urljoin(url)
Here you are trying to "join" a string to a list of strings. If you want to append a given url (which is a string) to the base string (https://etc...), you could do it by:
full_url = "https://www.storytel.com".join(url)
# OR
full_url = "https://www.storytel.com" + url
You can check the docs about strings (specifically 'join') here: https://docs.python.org/3.8/library/stdtypes.html#str.join
EDIT: also, I'm not sure that urljoin exists...

Where are my mistakes in my scrapy codes?

I want to crawl a website via scrapy but my codes come up with an error.
I have tried to use xpath but it seems I can not define the div class in the web site.
The following code raises an error on ("h2 ::text").extract().
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class MySpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["baltictriennial13.org"]
start_urls = ["https://www.baltictriennial13.org/artist/caroline-achaintre/"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='artist']")
items = []
for titles in titles:
item = ArtistlistItem()
item["artist"] = titles.select("h2 ::text").extract()
item["biograpy"] = titles.select("p::text").extract()
items.append(item)
return items
I want to crawl the web site and store the data in a .csv file.
The main issue with your code is using of .select instead of .css. Here is what do you need but I'm not sure about titles part (may be you need it on other pages):
def parse(self, response):
titles = response.xpath("//div[#class='artist']")
# items = []
for title in titles:
item = ArtistlistItem()
item["artist"] = title.css("h2::text").get()
item["biograpy"] = title.css("p::text").get()
# items.append(item)
yield item
try to remove the space in h2 ::text --> h2::text. If that doesn't work try h2/text()

How to go to following link from a for loop?

I am using scrapy to scrape a website I am in a loop where every item have link I want to go to following every time in a loop.
import scrapy
class MyDomainSpider(scrapy.Spider):
name = 'My_Domain'
allowed_domains = ['MyDomain.com']
start_urls = ['https://example.com']
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
title = colom.xpath('//*[#class="lng_cont_name"]/text()').extract_first()
address = colom.xpath('//*[#class="adWidth cont_sw_addr"]/text()').extract_first()
con_address = address[9:-9]
url= colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
print(url)
print('*********************')
yield scrapy.Request(url, callback = self.parse_dir_contents)
def parse_dir_contents(self, response):
print('000000000000000000')
a = response.xpath('//*[#class="fn"]/text()').extract_first()
print(a)
I have tried something like this but zeros print only once but stars prints 10 time I want it to run 2nd function to run every time when the loop runs.
You are probably doing something that is not intended. With
url = colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
inside the loop, url always results in the same value. By default, Scrapy filters duplicate requests (see here). If you really want to scrape the same URL multiple times, you can disable the filtering on request level with dont_filter=True argument to scrapy.Request constructor. However, I think that what you really want is to go like this (only the relevant part of the code left):
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
url = colom.xpath('./#data-href').extract_first()
yield scrapy.Request(url, callback=self.parse_dir_contents)

Scrapy - xpath - extract returns null

My goal is to build a scraper that extract data from a table from this site.
Initially I followed the tutorial of Scrapy where I succeeded in extracting data from the test site. When I try to replicate it for Bitinfocharts, first issue is I need to use xpath, which the tutorial doesn't cover in detail (they use css only). I have been able to scrape the specific data I want through shell.
My current issue is understanding how I can scrape them all from my code and at the same time write the results to a .csv / .json file?
I'm probably missing something completely obvious. If you can have a look at my code and let me know I'm doing wrong, I would deeply appreciate it.
Thanks!
First attempt:
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
scrapy.Item in RichlistTestItem()
scrapy.Item['wallet'] = sel.xpath('td[2]/a/text()').extract()[0]
scrapy.Item['balance'] = sel.xpath('td[3]/a/text').extract()[0]
scrapy.Item['percentage_of_coins'] = sel.xpath('/td[4]/a/text').extract()[0]
yield('wallet', 'balance', 'percentage_of_coins')
Second attempt: (probably closer to 50th attempt)
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/a/text').extract()
percentage_of_coins = sel.xpath('/td[4]/a/text').extract()
print(wallet, balance, percentage_of_coins)
I have fixed your second trial, specifically the code snippet below
for sel in response.xpath("//*[#id=\"tblOne\"]/tbody/tr"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/text()').extract()
percentage_of_coins = sel.xpath('td[4]/text()').extract()
The problems, I found are
there was a trailing "/" for the table row selector.
for balance the
value was inside td not inside a link inside td
for percetag.. again
the value was inside td.
Also there is a data-val property for each of the td. Scraping those might be little easier than getting the value from inside of td.

Resources