I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.
Related
I am programming a scraper with python and scrapy .I have as start_urls a page that contains a list of products, my scraper gets the links of these products and scrape the information of each of the products (I save the information in the fields of the class items.py). Each of these products can contain a list of variations, I need to extract information from all the variations and save them in a list field and then save this information in item['variations'].
def parse(self, response):
links = response.css(css_links).getall()
links = [self.process_url(link) for link in links]
for link in links:
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_product)
def parse_product(self, response):
items = SellItem()
shipper = self.get_shipper(response)
items['shipper'] = shipper
items['weight'] = self.get_weight(response)
items['url'] = response.url
items['category'] = self.get_category(response)
items['cod'] = response.css(css_cod).get()
items['price'] = self.get_price(response)
items['cantidad'] = response.css(css_cantidad).get()
items['name'] = response.css(css_name).get()
items['images'] = self.get_images(response)
variations = self.get_variations(response)
if variations:
valid_urls = self.get_valid_urls(variations)
for link in valid_urls:
#I need to go to each of these urls and scrape information and then store it in the
#variable items['variations'].
You need to add a 2nd method, call it "parse_details"
Then add callback=self.parse_details when you do "yield request" from your first method i.e. parse_product.
You can transfer the collected data between methods using "response.meta"
Scrapy covers it in the docs:
see : https://docs.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
Also read about : "Request.cb_kwargs"
I'm trying to make a spider that scrapes products from a page and, when finished, scrape the next page on the catalog and the next one after that, etc.
I got all the products from a page (I'm scraping amazon) with
rules = {
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//a[contains(#class, "a-link-normal") and contains(#class,"a-text-normal")]') ),
callback = 'parse_item', follow = False)
}
And that works just fine. The problem is that I should go to the 'next' page and keep scraping.
What I tried to do is a rule like this
rules = {
#Next Button
Rule(LinkExtractor(allow =(), restrict_xpaths = ('(//li[#class="a-normal"]/a/#href)[2]') )),
}
Problem is that the xPath returns (for example, from this page: https://www.amazon.com/s?k=mac+makeup&lo=grid&page=2&crid=2JQQNTWC87ZPV&qid=1559841911&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_2)
/s?k=mac+makeup&lo=grid&page=3&crid=2JQQNTWC87ZPV&qid=1559841947&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_3
Which would be the URL for the next page but without the www.amazon.com.
I think that my code is not working because I'm missing the www.amazon.com before the url above.
Any idea how to make this work? Maybe the way I went in doing this is not the right one.
Try using urljoin.
link = "/s?k=mac+makeup&lo=grid&page=3&crid=2JQQNTWC87ZPV&qid=1559841947&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_3"
new_link = response.urljoin(link)
The following spider is a possible solution, the main ideas is use the parse_links function to get the links to the individual page which yields the response to the parse function, and you can also yield the next page response to the same function untill you've crawled through all the pages.
class AmazonSpider(scrapy.spider):
start_urls = ['https://www.amazon.com/s?k=mac+makeup&lo=grid&crid=2JQQNTWC87ZPV&qid=1559870748&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_1'
wrapper_xpath = '//*[#id="search"]/div[1]/div[2]/div/span[3]/div[1]/div' # Product wrapper
link_xpath = './//div/div/div/div[2]/div[2]/div/div[1]/h2/a/#href' # Link xpath
np_xpath = '(//li[#class="a-normal"]/a/#href)[2]' # Next page xpath
def parse_links(self, response):
for li in response.xpath(self.wrapper_xpath):
link = li.xpath(self.link_xpath).extract_first()
link = response.urljoin(link)
yield scrapy.Request(link, callback = self.parse)
next_page = response.xpath(self.np_xpath).extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse_links)
else:
print("next_page is none")
Problem: I tried to export results (Name, Address, Phone) into CSV but the CSV code not returning expected results.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import json
import re
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
#Find all Companies Name under h2tag
company_name_list_heading = soup.findAll("h2")
#Find all Address on page Name under a tag
company_name_list_items = soup.findAll("a",{"class":"address"})
#Find all Phone numbers on page Name under ul
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Created for loop to print out all company Data
for company_address in company_name_list_items:
print(company_address.prettify())
# Create for loop to print out all company Names
for company_name in company_name_list_heading:
print(company_name.prettify())
# Create for loop to print out all company Numbers
for company_numbers in company_name_list_numbers:
print(company_numbers.prettify())
Below is the code to export the results (name, address & phonenumber) into CSV
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "Address", "Phone"])
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
company_name_list_heading = soup.findAll("h2")
company_name_list_items = soup.findAll("a",{"class":"address"})
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Here is the for loop to loop over data.
for company_name in company_name_list_heading:
names = company_name.contents[0]
for company_numbers in company_name_list_numbers:
names = company_numbers.contents[1]
for company_address in company_name_list_items:
address = company_address.contents[1]
writer.writerow([name, Address, Phone])
outfile.close()
You need to work on understanding how for loops work, and also the difference between strings, and variables and other datatypes. You also need to work on using what you have seen from other stackoverflow questions and learn to apply that. This is essentially the same as youre other 2 questions you already posted, but just a different site you're scraping from (but I didn't flag it as a duplicate, as you're new to stackoverflow and web scrpaing and I remember what it was like to try to learn). I'll still answer your questions, but eventually you need to be able to find the answers on your own and learn how to adapt it and apply (coding isn't a paint by colors. Which I do see you are adapting some of it. Good job in finding the "div",{"class":"CompanyInfo"} tag to get the company info)
That data you are pulling (name, address, phone) needs to be within a nested loop of the div class=CompanyInfo element/tag. You could theoretically have it the way you have it now, by putting those into a list, and then writing to the csv file from your lists, but theres a risk of data missing and then your data/info could be off or not with the correct corresponding company.
Here's what the full code looks like. notice that the variables are stored with in the loop, and then written. It then goes to the next block of CompanyInfo and continues.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
# Now loop through those elements
for element in product_name_list:
# Takes 1 block of the "div",{"class":"CompanyInfo"} tag and finds/stores name, address, phone
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
# writes the name, address, phone to csv
writer.writerow([name, address, phone])
# now will go to the next "div",{"class":"CompanyInfo"} tag and repeats
outfile.close()
I'm running into some trouble trying to enter and analyze several items within a page.
I have a certain page which contains items in it, the code looks something like this
class Spider(CrawlSpider):
name = 'spider'
maxId = 20
allowed_domain = ['www.domain.com']
start_urls = ['http://www.startDomain.com']
In the start url, i have some items that all follow, in XPath, the following path (within the startDomain):
def start_requests(self):
for i in range(self.maxId):
yield Request('//*[#id="result_{0}"]/div/div/div/div[2]/div[1]/div[1]/a/h2'.format(i) , callback = self.parse_item)
I'd like to find a way to access each one of these links (the ones tied to result{number}) and then scrape the contents of that certain item.
I'm trying to run the code below to take a list of URLs (in the file out.txt) and pulls the text content from that page, using xpath. The code finds the domain from the URL, then looks up the domain in a json file I've created that has both domain and Xpath. The xpath is then used to find the content.
However, right now if I run the code outside of the loop it works fine (page = 200). But if I do it inside the loop I'm getting page=404.
I'm sure that this is a grammar error with the loop and is probably really simple. What am I doing wrong?
URLList = open("out.txt").readlines()
for item in URLList:
inputurl = item
print (inputurl)
type(inputurl)
#this takes a URL and finds the xpath - it uses an external
domainlookup.json that is manually created
# inputurl = input("PLEASE PROVIDE A URL FROM AN APPROVED DOMAIN: ")
t = urlparse(inputurl).netloc
domain = ('.'.join(t.split('.')[1:]))
with open('domainlookup.json') as json_data:
domainlookup = json.load(json_data)
for i in domainlookup:
if i['DOMAIN'] == domain:
xpath = (i['XPATH'])
#this requests the xpath from the URL and scrapes the text content
page = requests.get(inputurl)
tree = html.fromstring(page.content)
content = tree.xpath(xpath)
You'll find what's wrong with your code using this code:
URLList = open("out.txt").readlines()
for item in URLList:
inputurl = item
print("[{0}]".format(inputurl) )
As you can see from the output you're not removing new line character from the URL that's why requests can't load it later. Just strip() it before using:
inputurl = item.strip()