Going through specified items in a page using scrapy - python-3.x

I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.

Related

Scrapy - extract information from list of links

I am programming a scraper with python and scrapy .I have as start_urls a page that contains a list of products, my scraper gets the links of these products and scrape the information of each of the products (I save the information in the fields of the class items.py). Each of these products can contain a list of variations, I need to extract information from all the variations and save them in a list field and then save this information in item['variations'].
def parse(self, response):
links = response.css(css_links).getall()
links = [self.process_url(link) for link in links]
for link in links:
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_product)
def parse_product(self, response):
items = SellItem()
shipper = self.get_shipper(response)
items['shipper'] = shipper
items['weight'] = self.get_weight(response)
items['url'] = response.url
items['category'] = self.get_category(response)
items['cod'] = response.css(css_cod).get()
items['price'] = self.get_price(response)
items['cantidad'] = response.css(css_cantidad).get()
items['name'] = response.css(css_name).get()
items['images'] = self.get_images(response)
variations = self.get_variations(response)
if variations:
valid_urls = self.get_valid_urls(variations)
for link in valid_urls:
#I need to go to each of these urls and scrape information and then store it in the
#variable items['variations'].
You need to add a 2nd method, call it "parse_details"
Then add callback=self.parse_details when you do "yield request" from your first method i.e. parse_product.
You can transfer the collected data between methods using "response.meta"
Scrapy covers it in the docs:
see : https://docs.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
Also read about : "Request.cb_kwargs"

Scrapy crawl saved links out of csv or array

import scrapy
class LinkSpider(scrapy.Spider):
name = "articlelink"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
BASE_URL = 'https://www.topart-online.com/de/'
#scraping cards of specific category
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
yield{
'links': a.xpath('#href').get()
}
next_page_url = response.xpath('//a[#class="page-link"]/#href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
this is my spider which crawls all pages of that category and saves all the productlinks into a csv file when i run in my server "scrapy crawl articlelink -o filename.csv".
now i have to crawl all the links in my csv file for specific information, that arent contained in the card of the productlink clearfix
How do i satrt?
So i'm glad you've gone and had a look at how to scrape the next pages.
Now with regard to the productean number, this is where yielding a dictionary is cumbersome. You will be better off in almost most scrapy scripts to be using the items dictionary. It's suited to being able to grab elements from different pages which is exactly want to do.
Scrapy extracts data from HTML and the mechanism it suggests to do this is through what they call items. Now scrapy accepts a few different types of ways to put data into some form of object. You can use a bog standard dictionary, but for data that requires modification or isn't that clean almost anything other than a very structured data set from the website you should use items at the least. Item's provides a dictionary like object.
To use the items mechanism in our spider script, we have to instantiate the items class to create an item's object. We then populate that items dictionary with the data we want and then in your particularly case we share this items dictionary across functions to continuing adding data from a different page.
In addition to this we have to declare the item field names as is called. But they are the keys to the items dictionary. We do this through items.py located in the project folder.
Code Example
items.py
import scrapy
class TopartItem(scrapy.Item):
title = scrapy.Field()
links = scrapy.Field()
ItemSKU = scrapy.Field()
Delivery_Status = scrapy.Field()
ItemEAN = scrapy.Field()
spider script
import scrapy
from ..items import TopartItem
class LinkSpider(scrapy.Spider):
name = "link"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status'] }
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
items = TopartItem()
link = a.xpath('#href')
items['title'] = a.xpath('.//div[#class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get().strip()
items['links'] = link.get()
items['ItemSKU'] = a.xpath('.//span[#class="sn_p01_pno"]/text()').get().strip()
items['Delivery_Status'] = a.xpath('.//div[#class="availabilitydeliverytime"]/text()').get().strip().replace('/','')
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
def parse_item(self,response):
items = response.meta['items']
items['ItemEAN'] = response.xpath('//div[#class="productean"]/text()').get().strip()
yield items
Explanation
First within items.py, we are creating a class called TopArtItem because we are inheriting from scrapy.Item we can instantiating a field object creation for our item's object. Any field we want to create we give it a name and create the field object by scrapy.Field()
Within the spider script we have to import this class TopArtItem into our spider script. from ..items is a relative import which means from the parent directory of the spider script we want to take something from the items.py.
Now the code will look slightly familiar here. First to create our specific item' dictionary called items within the for loop we use items = TopArtItem().
The way to add to the items dictionary is similar to any other python dictionary, the key's in the items dictionary are our fields we created in items.py.
The variable link is the link to the specific page. We then grab the data we want which you've seen before.
So when we populate our items dictionary in we need to grab the productean number from the individual pages. We do this by following the link, and the callback is the function we want to send the HTML from the individual page to.
The meta = {'items',items} is the way we transfer our items dictionary to that function parse_item. We create a meta dictionary, with key called items, and the value is our items dictionary we just created.
We then have created the function parse_item. To get access to that items dictionary we must access it through the response.meta which holds the meta dictionary we created as we made the request in the previous function. response.meta['items'] is how we acccess our items dictionary, we then have called this items as before.
Now we can populate the items dictionary which already has data in it from the previous function and add the productean number to it. We then finally yield that items dictionary to tell scrapy we are done adding data for extraction from this particular card.
To summarise that workflow in the parse function we have a loop and for every loop iteration we are following a link, we extract the four pieces of data first, we then make scrapy follow the specific link and add the 5th piece of data before moving onto the next card in the original html document.
Additional Information
Note, try to use get() if you want to grab only one piece of data, and getall() for more than one piece of data. This is instead of extract_first() and extract(). If you look at the scrapy docs, they recommend this. get() is abit more concise and when using extract() you weren't always sure whether you'll get a string or a list as the extracted data. getall will always give you a list.
Recommend that you look up other examples of items in other scrapy scripts. by searching github or other websites. I would recommend once you understand the workflow to read carefully the items page on the docs here. It's clear but not example friendly. I think its more understandable when you've created scripts with items a few times.
Updated next page links
I've replaced the code you had for next_page with a more robust way of getting all the data.
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
Here we are creating a variable for the last page and grabbing the number from that. I made this because incase you had pages where there were more than 3 pages.
We are doing a for loop and creating the url for each iteration. This url has whats called an f-string in it. f'' This allows us to plant a variable in the string, and to use a for loop to add either a number or anything else into that url. So we are planting the number 2 first into the url which gives us the link to the 2nd page. We then use the variable for the last page + 1, as the range function will only process the lastpage by selecting lastpage+1. To create the 3rd page url. We then follow that 3rd page url too.

Scrapy spider returns no items data

My scrapy script seems not to follow links, which ends up not extracting data from each of them (to pass some content as scrapy items).
I am trying to scrape a lot of data from a news website. I managed to copy/write a spider that, as I assumed, should read links from a file (I've generated it with another script), put them in start_urls list and start following these links to extract some data, and then pass it as items, and also -- write each item's data in a separate file (last part is actually for another question).
After running scrapy crawl PNS, script goes through all the links from start_urls but does nothing more -- it follows links read from start_urls list (I'm getting "GET link" message in bash), but seems not to enter them and read some more links to follow and extract data from.
import scrapy
import re
from ProjectName.items import ProjectNameArticle
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
start_urls = []
with open('start_urls.txt', 'r') as file:
for line in file:
start_urls.append(line.strip())
def parse(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
Expected result:
Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to start_urls list,
For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each start_urls link),
Followed links are the main target from which I want to extract specific data: article title, date, time, text.
For now never mind writing each scrapy item to a distinc .txt file.
Actual result:
Running my spider triggers GET for each start_urls link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.
Dude, I have been coding in Python Scrapy for long time and I hate using start_urls
You can simply use start_requests which is very easy to read, and also very easy to learn for beginners
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
def start_requests(self):
with open('start_urls.txt', 'r') as file:
for line in file:
yield Request(line.strip(),
callback=self.my_callback_func)
def my_callback_func(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('#href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
I also have never used Item class and find it useless too
You can simply have data_dic = {} instead of data_dic = ProjectNameArticle()

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Using scrapy to find specific text from multiple websites

I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ?
I have been reading scrapy's documentation, but I can't seem to find this.
Thank you.
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
You'll need to use some parser or regex to find the text you are looking for inside the response body.
every scrapy callback method contains the response body inside the response object, which you can check with response.body (for example inside the parse method), then you'll have to use some regex or better xpath or css selectors to go to the path of your text knowing the xml structure of the page you crawled.
Scrapy lets you use the response object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()') for example.
Hope it helped.

Resources