Scrapy - extract information from list of links - python-3.x

I am programming a scraper with python and scrapy .I have as start_urls a page that contains a list of products, my scraper gets the links of these products and scrape the information of each of the products (I save the information in the fields of the class items.py). Each of these products can contain a list of variations, I need to extract information from all the variations and save them in a list field and then save this information in item['variations'].
def parse(self, response):
links = response.css(css_links).getall()
links = [self.process_url(link) for link in links]
for link in links:
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_product)
def parse_product(self, response):
items = SellItem()
shipper = self.get_shipper(response)
items['shipper'] = shipper
items['weight'] = self.get_weight(response)
items['url'] = response.url
items['category'] = self.get_category(response)
items['cod'] = response.css(css_cod).get()
items['price'] = self.get_price(response)
items['cantidad'] = response.css(css_cantidad).get()
items['name'] = response.css(css_name).get()
items['images'] = self.get_images(response)
variations = self.get_variations(response)
if variations:
valid_urls = self.get_valid_urls(variations)
for link in valid_urls:
#I need to go to each of these urls and scrape information and then store it in the
#variable items['variations'].

You need to add a 2nd method, call it "parse_details"
Then add callback=self.parse_details when you do "yield request" from your first method i.e. parse_product.
You can transfer the collected data between methods using "response.meta"
Scrapy covers it in the docs:
see : https://docs.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
Also read about : "Request.cb_kwargs"

Related

Where are my mistakes in my scrapy codes?

I want to crawl a website via scrapy but my codes come up with an error.
I have tried to use xpath but it seems I can not define the div class in the web site.
The following code raises an error on ("h2 ::text").extract().
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class MySpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["baltictriennial13.org"]
start_urls = ["https://www.baltictriennial13.org/artist/caroline-achaintre/"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='artist']")
items = []
for titles in titles:
item = ArtistlistItem()
item["artist"] = titles.select("h2 ::text").extract()
item["biograpy"] = titles.select("p::text").extract()
items.append(item)
return items
I want to crawl the web site and store the data in a .csv file.
The main issue with your code is using of .select instead of .css. Here is what do you need but I'm not sure about titles part (may be you need it on other pages):
def parse(self, response):
titles = response.xpath("//div[#class='artist']")
# items = []
for title in titles:
item = ArtistlistItem()
item["artist"] = title.css("h2::text").get()
item["biograpy"] = title.css("p::text").get()
# items.append(item)
yield item
try to remove the space in h2 ::text --> h2::text. If that doesn't work try h2/text()

Scrapy spider returns no items data

My scrapy script seems not to follow links, which ends up not extracting data from each of them (to pass some content as scrapy items).
I am trying to scrape a lot of data from a news website. I managed to copy/write a spider that, as I assumed, should read links from a file (I've generated it with another script), put them in start_urls list and start following these links to extract some data, and then pass it as items, and also -- write each item's data in a separate file (last part is actually for another question).
After running scrapy crawl PNS, script goes through all the links from start_urls but does nothing more -- it follows links read from start_urls list (I'm getting "GET link" message in bash), but seems not to enter them and read some more links to follow and extract data from.
import scrapy
import re
from ProjectName.items import ProjectNameArticle
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
start_urls = []
with open('start_urls.txt', 'r') as file:
for line in file:
start_urls.append(line.strip())
def parse(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
Expected result:
Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to start_urls list,
For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each start_urls link),
Followed links are the main target from which I want to extract specific data: article title, date, time, text.
For now never mind writing each scrapy item to a distinc .txt file.
Actual result:
Running my spider triggers GET for each start_urls link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.
Dude, I have been coding in Python Scrapy for long time and I hate using start_urls
You can simply use start_requests which is very easy to read, and also very easy to learn for beginners
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
def start_requests(self):
with open('start_urls.txt', 'r') as file:
for line in file:
yield Request(line.strip(),
callback=self.my_callback_func)
def my_callback_func(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
I also have never used Item class and find it useless too
You can simply have data_dic = {} instead of data_dic = ProjectNameArticle()

Going through specified items in a page using scrapy

I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.

Scrapy - xpath - extract returns null

My goal is to build a scraper that extract data from a table from this site.
Initially I followed the tutorial of Scrapy where I succeeded in extracting data from the test site. When I try to replicate it for Bitinfocharts, first issue is I need to use xpath, which the tutorial doesn't cover in detail (they use css only). I have been able to scrape the specific data I want through shell.
My current issue is understanding how I can scrape them all from my code and at the same time write the results to a .csv / .json file?
I'm probably missing something completely obvious. If you can have a look at my code and let me know I'm doing wrong, I would deeply appreciate it.
Thanks!
First attempt:
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
scrapy.Item in RichlistTestItem()
scrapy.Item['wallet'] = sel.xpath('td[2]/a/text()').extract()[0]
scrapy.Item['balance'] = sel.xpath('td[3]/a/text').extract()[0]
scrapy.Item['percentage_of_coins'] = sel.xpath('/td[4]/a/text').extract()[0]
yield('wallet', 'balance', 'percentage_of_coins')
Second attempt: (probably closer to 50th attempt)
import scrapy
class RichlistTestItem(scrapy.Item):
# overview details
wallet = scrapy.Field()
balance = scrapy.Field()
percentage_of_coins = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domain = ['https://bitinfocharts.com/']
start_urls = [
'https://bitinfocharts.com/top-100-richest-vertcoin-addresses.html'
]
def parse(self, response):
for sel in response.xpath("//*[#id='tblOne']/tbody/tr/"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/a/text').extract()
percentage_of_coins = sel.xpath('/td[4]/a/text').extract()
print(wallet, balance, percentage_of_coins)
I have fixed your second trial, specifically the code snippet below
for sel in response.xpath("//*[#id=\"tblOne\"]/tbody/tr"):
wallet = sel.xpath('td[2]/a/text()').extract()
balance = sel.xpath('td[3]/text()').extract()
percentage_of_coins = sel.xpath('td[4]/text()').extract()
The problems, I found are
there was a trailing "/" for the table row selector.
for balance the
value was inside td not inside a link inside td
for percetag.. again
the value was inside td.
Also there is a data-val property for each of the td. Scraping those might be little easier than getting the value from inside of td.

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Resources