Scrapy crawl saved links out of csv or array - python-3.x

import scrapy
class LinkSpider(scrapy.Spider):
name = "articlelink"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
BASE_URL = 'https://www.topart-online.com/de/'
#scraping cards of specific category
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
yield{
'links': a.xpath('#href').get()
}
next_page_url = response.xpath('//a[#class="page-link"]/#href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
this is my spider which crawls all pages of that category and saves all the productlinks into a csv file when i run in my server "scrapy crawl articlelink -o filename.csv".
now i have to crawl all the links in my csv file for specific information, that arent contained in the card of the productlink clearfix
How do i satrt?

So i'm glad you've gone and had a look at how to scrape the next pages.
Now with regard to the productean number, this is where yielding a dictionary is cumbersome. You will be better off in almost most scrapy scripts to be using the items dictionary. It's suited to being able to grab elements from different pages which is exactly want to do.
Scrapy extracts data from HTML and the mechanism it suggests to do this is through what they call items. Now scrapy accepts a few different types of ways to put data into some form of object. You can use a bog standard dictionary, but for data that requires modification or isn't that clean almost anything other than a very structured data set from the website you should use items at the least. Item's provides a dictionary like object.
To use the items mechanism in our spider script, we have to instantiate the items class to create an item's object. We then populate that items dictionary with the data we want and then in your particularly case we share this items dictionary across functions to continuing adding data from a different page.
In addition to this we have to declare the item field names as is called. But they are the keys to the items dictionary. We do this through items.py located in the project folder.
Code Example
items.py
import scrapy
class TopartItem(scrapy.Item):
title = scrapy.Field()
links = scrapy.Field()
ItemSKU = scrapy.Field()
Delivery_Status = scrapy.Field()
ItemEAN = scrapy.Field()
spider script
import scrapy
from ..items import TopartItem
class LinkSpider(scrapy.Spider):
name = "link"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status'] }
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
items = TopartItem()
link = a.xpath('#href')
items['title'] = a.xpath('.//div[#class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get().strip()
items['links'] = link.get()
items['ItemSKU'] = a.xpath('.//span[#class="sn_p01_pno"]/text()').get().strip()
items['Delivery_Status'] = a.xpath('.//div[#class="availabilitydeliverytime"]/text()').get().strip().replace('/','')
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
def parse_item(self,response):
items = response.meta['items']
items['ItemEAN'] = response.xpath('//div[#class="productean"]/text()').get().strip()
yield items
Explanation
First within items.py, we are creating a class called TopArtItem because we are inheriting from scrapy.Item we can instantiating a field object creation for our item's object. Any field we want to create we give it a name and create the field object by scrapy.Field()
Within the spider script we have to import this class TopArtItem into our spider script. from ..items is a relative import which means from the parent directory of the spider script we want to take something from the items.py.
Now the code will look slightly familiar here. First to create our specific item' dictionary called items within the for loop we use items = TopArtItem().
The way to add to the items dictionary is similar to any other python dictionary, the key's in the items dictionary are our fields we created in items.py.
The variable link is the link to the specific page. We then grab the data we want which you've seen before.
So when we populate our items dictionary in we need to grab the productean number from the individual pages. We do this by following the link, and the callback is the function we want to send the HTML from the individual page to.
The meta = {'items',items} is the way we transfer our items dictionary to that function parse_item. We create a meta dictionary, with key called items, and the value is our items dictionary we just created.
We then have created the function parse_item. To get access to that items dictionary we must access it through the response.meta which holds the meta dictionary we created as we made the request in the previous function. response.meta['items'] is how we acccess our items dictionary, we then have called this items as before.
Now we can populate the items dictionary which already has data in it from the previous function and add the productean number to it. We then finally yield that items dictionary to tell scrapy we are done adding data for extraction from this particular card.
To summarise that workflow in the parse function we have a loop and for every loop iteration we are following a link, we extract the four pieces of data first, we then make scrapy follow the specific link and add the 5th piece of data before moving onto the next card in the original html document.
Additional Information
Note, try to use get() if you want to grab only one piece of data, and getall() for more than one piece of data. This is instead of extract_first() and extract(). If you look at the scrapy docs, they recommend this. get() is abit more concise and when using extract() you weren't always sure whether you'll get a string or a list as the extracted data. getall will always give you a list.
Recommend that you look up other examples of items in other scrapy scripts. by searching github or other websites. I would recommend once you understand the workflow to read carefully the items page on the docs here. It's clear but not example friendly. I think its more understandable when you've created scripts with items a few times.
Updated next page links
I've replaced the code you had for next_page with a more robust way of getting all the data.
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
Here we are creating a variable for the last page and grabbing the number from that. I made this because incase you had pages where there were more than 3 pages.
We are doing a for loop and creating the url for each iteration. This url has whats called an f-string in it. f'' This allows us to plant a variable in the string, and to use a for loop to add either a number or anything else into that url. So we are planting the number 2 first into the url which gives us the link to the 2nd page. We then use the variable for the last page + 1, as the range function will only process the lastpage by selecting lastpage+1. To create the 3rd page url. We then follow that 3rd page url too.

Related

Problem exporting Web Url results into CSV using beautifulsoup3

Problem: I tried to export results (Name, Address, Phone) into CSV but the CSV code not returning expected results.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import json
import re
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
#Find all Companies Name under h2tag
company_name_list_heading = soup.findAll("h2")
#Find all Address on page Name under a tag
company_name_list_items = soup.findAll("a",{"class":"address"})
#Find all Phone numbers on page Name under ul
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Created for loop to print out all company Data
for company_address in company_name_list_items:
print(company_address.prettify())
# Create for loop to print out all company Names
for company_name in company_name_list_heading:
print(company_name.prettify())
# Create for loop to print out all company Numbers
for company_numbers in company_name_list_numbers:
print(company_numbers.prettify())
Below is the code to export the results (name, address & phonenumber) into CSV
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "Address", "Phone"])
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
company_name_list_heading = soup.findAll("h2")
company_name_list_items = soup.findAll("a",{"class":"address"})
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Here is the for loop to loop over data.
for company_name in company_name_list_heading:
names = company_name.contents[0]
for company_numbers in company_name_list_numbers:
names = company_numbers.contents[1]
for company_address in company_name_list_items:
address = company_address.contents[1]
writer.writerow([name, Address, Phone])
outfile.close()
You need to work on understanding how for loops work, and also the difference between strings, and variables and other datatypes. You also need to work on using what you have seen from other stackoverflow questions and learn to apply that. This is essentially the same as youre other 2 questions you already posted, but just a different site you're scraping from (but I didn't flag it as a duplicate, as you're new to stackoverflow and web scrpaing and I remember what it was like to try to learn). I'll still answer your questions, but eventually you need to be able to find the answers on your own and learn how to adapt it and apply (coding isn't a paint by colors. Which I do see you are adapting some of it. Good job in finding the "div",{"class":"CompanyInfo"} tag to get the company info)
That data you are pulling (name, address, phone) needs to be within a nested loop of the div class=CompanyInfo element/tag. You could theoretically have it the way you have it now, by putting those into a list, and then writing to the csv file from your lists, but theres a risk of data missing and then your data/info could be off or not with the correct corresponding company.
Here's what the full code looks like. notice that the variables are stored with in the loop, and then written. It then goes to the next block of CompanyInfo and continues.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
# Now loop through those elements
for element in product_name_list:
# Takes 1 block of the "div",{"class":"CompanyInfo"} tag and finds/stores name, address, phone
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
# writes the name, address, phone to csv
writer.writerow([name, address, phone])
# now will go to the next "div",{"class":"CompanyInfo"} tag and repeats
outfile.close()

Scrapy: using multiple image fields, store results

I am using scrapy to capture several images. I save them in separate fields. Because of dependencies with other systems, I do not want all the images results (url, path, checksum) stored in one field.
image_url1
image_url2
The results (url, path, checksum), are stored in;
images1
images2
I finally have it working where it downloads 2 pictures.
The results for the image_url1 is stored in images1. Only it doesn't store the results for image_url2 in images2. I don't know how I make clear that the results for image_url2 should be stored in images2. If I now run the following code, it tries to put the 2 results (url, path, checksum) together (results of image_url1 and image_url2 behind each other separated by a space). It cannot insert that field into MySQL so that fails.
class GooutImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_url1']:
yield scrapy.Request(image_url)
for image_url in item['image_url2']:
yield scrapy.Request(image_url)
I already made the following edit in settings;
IMAGES_URLS_FIELD = 'image_url1'
IMAGES_RESULT_FIELD = 'images1'
I can't find anything how to work with multiple image fields.
*** Edit/solution after feedback
After the suggestion to do this in item_completed, I come up with the following;
class GooutImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
if item['image_url1']:
for image_url in item['image_url1']:
yield scrapy.Request(image_url)
if item['image_url2']:
for image_url in item['image_url2']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
for download_status, result in results:
if result['url'] == item['image_url1'][0]:
item['images1'] = result['path']
if result['url'] == item['image_url2'][0]:
item['images2'] = result['path']
return item
Don't know if this is the best way to do it, but it works. Feedback is appreciated. Thanks!
The item_completed() method is what does this storing, so that's what you'll need to override.
You will need to know where to store the data though, so you'll probably want to add that information to the meta dict of the request you create in get_media_requests().

Going through specified items in a page using scrapy

I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Pass values into scrapy callback

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.
The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.
I was hoping someone could help me figure out how to pass
a unique id to each callback so it can be used as the filename when saving the file
Pass the url of the originating page so it can be added to the metadata via the Items
Follow the links on the child pages to go another level deeper into the site
Below is my code thus far
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem
class CrawlSpider(scrapy.Spider):
name = "librarycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
In Settings.py
ITEM_PIPELINES = [
'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'
In items.py
import scrapy
class LibrarycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
Files = scrapy.Field()
I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.
What you want to do looks like a job for CrawlSpider instead of Spider.
CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.
If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.
for link in soup.find_all("a"):
item=crawlerItem()
item['url'] = response.urljoin(link.get('href'))
request=scrapy.Request(url,callback=self.scrape_page)
request.meta['item']=item
yield request
To get the item just go look for it in the response:
def scrape_page(self, response):
item=response.meta['item']
In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url
Also,
It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

Resources