Scrapy x path: Loop over two different blocks and combine results - python-3.x

I am scraping this website: https://www.mrlodge.de/wohnungen/
Therefore I wrote a crawler which works great and extracts all information from each offer. Each offer also has detailed information. This detailed information I can reach with the url returned by this xpath expression:
x = listings.xpath('//form/input[#name="name_url"]/#value').getall()
The problem I am facing is that it is inside the offer block but not inside the div tags which I am already looping. I have tried the following but I only get the first element for detail_url. There has to be a way to include it inside the for loop but I just don't get how.
Please help
def parse(self, response):
json_response = json.loads(response.body)
listings = Selector(text=json_response.get('list'))
x = listings.xpath('//form/input[#name="name_url"]/#value').get()
for listing in listings.xpath("//div[#class='mrlobject-list__item__content']"):
yield {
'title': listing.xpath('.//div[#class="obj-smallinfo"]/text()').get(),
'rent': listing.xpath(".//span[#class='obj-rent']/text()").get(),
'room': listing.xpath(".//span[#class='obj-room']/text()").get(),
'area': listing.xpath(".//span[#class='obj-area']/text()").get(),
'info': listing.xpath(".//div[#class='object-title']/text()").get(),
'detail_url': x
}

Related

I am trying to join a URL in scrapy but unable to do so

I am trying to fetch (id and name)i.e name from one website and want to append the variable to another link. for eg in the name variable i get - /in/en/books/1446502-An-Exciting-Day.(There are many records) and then i want to append the name variable to 'https://www.storytel.com' to fetch data specific to the book. Also I want to put a condition for a_name i.e if response.css('span.expandAuthorName::text') is not available than put '-' else fetch the name.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brickset-spider'
start_urls = ['https://www.storytel.com/in/en/categories/1-Children?pageNumber=100']
def parse(self, response):
# for quote in response.css('div.gridBookTitle'):
# item = {
# 'name': quote.css('a::attr(href)').extract_first()
# }
# yield item
urls = response.css('div.gridBookTitle > a::attr(href)').extract()
for url in urls:
url = ['https://www.storytel.com'].urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
yield {
'a_name': response.css('span.expandAuthorName::text').extract_first()
}
I am trying to append "https://www.storytel.com".urljoin(url) but i am getting error for the same. Being new to scrapy I tried many thing but unable to resolve the issue. I am getting error - in line 15 list object has no attribute urljoin. Any leads on how to overcome this. Thanks in advance.
Check with this solution.
for url in urls:
url = 'https://www.storytel.com'+ url
yield scrapy.Request(url=url, callback=self.parse_details)
it helps let me know.
url = ['https://www.storytel.com'].urljoin(url)
Here you are trying to "join" a string to a list of strings. If you want to append a given url (which is a string) to the base string (https://etc...), you could do it by:
full_url = "https://www.storytel.com".join(url)
# OR
full_url = "https://www.storytel.com" + url
You can check the docs about strings (specifically 'join') here: https://docs.python.org/3.8/library/stdtypes.html#str.join
EDIT: also, I'm not sure that urljoin exists...

Scraping 'next' page after finishing in the main one using Rules

I'm trying to make a spider that scrapes products from a page and, when finished, scrape the next page on the catalog and the next one after that, etc.
I got all the products from a page (I'm scraping amazon) with
rules = {
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//a[contains(#class, "a-link-normal") and contains(#class,"a-text-normal")]') ),
callback = 'parse_item', follow = False)
}
And that works just fine. The problem is that I should go to the 'next' page and keep scraping.
What I tried to do is a rule like this
rules = {
#Next Button
Rule(LinkExtractor(allow =(), restrict_xpaths = ('(//li[#class="a-normal"]/a/#href)[2]') )),
}
Problem is that the xPath returns (for example, from this page: https://www.amazon.com/s?k=mac+makeup&lo=grid&page=2&crid=2JQQNTWC87ZPV&qid=1559841911&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_2)
/s?k=mac+makeup&lo=grid&page=3&crid=2JQQNTWC87ZPV&qid=1559841947&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_3
Which would be the URL for the next page but without the www.amazon.com.
I think that my code is not working because I'm missing the www.amazon.com before the url above.
Any idea how to make this work? Maybe the way I went in doing this is not the right one.
Try using urljoin.
link = "/s?k=mac+makeup&lo=grid&page=3&crid=2JQQNTWC87ZPV&qid=1559841947&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_3"
new_link = response.urljoin(link)
The following spider is a possible solution, the main ideas is use the parse_links function to get the links to the individual page which yields the response to the parse function, and you can also yield the next page response to the same function untill you've crawled through all the pages.
class AmazonSpider(scrapy.spider):
start_urls = ['https://www.amazon.com/s?k=mac+makeup&lo=grid&crid=2JQQNTWC87ZPV&qid=1559870748&sprefix=MAC+mak%2Caps%2C312&ref=sr_pg_1'
wrapper_xpath = '//*[#id="search"]/div[1]/div[2]/div/span[3]/div[1]/div' # Product wrapper
link_xpath = './//div/div/div/div[2]/div[2]/div/div[1]/h2/a/#href' # Link xpath
np_xpath = '(//li[#class="a-normal"]/a/#href)[2]' # Next page xpath
def parse_links(self, response):
for li in response.xpath(self.wrapper_xpath):
link = li.xpath(self.link_xpath).extract_first()
link = response.urljoin(link)
yield scrapy.Request(link, callback = self.parse)
next_page = response.xpath(self.np_xpath).extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse_links)
else:
print("next_page is none")

Scrapy Extract method yields a Cannot mix str and non-str arguments error

I am in the middle of learning scrappy right now and am building a simple scraper of a real estate site. With this code I am trying to scrape all of the URLs for the real estate listing of a specific city. I have run into the following error with my code - "Cannot mix str and non-str arguments".
I believe I have isolated my problem to following part of my code
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
If I use the extract_first() function instead of the extract function in the props xpath assignment, the code kind of works. It grabs the first link for the property on each page. However, this ultimately is not what I want. I believe I have the xpath call correct as the code runs if I use the extract_first() method.
Can someone explain what I am doing wrong here? I have listed my full code below
import scrapy
from scrapy.http import Request
class AdvancedSpider(scrapy.Spider):
name = 'advanced'
allowed_domains = ['www.realtor.com']
start_urls = ['http://www.realtor.com/realestateandhomes-search/Houston_TX/']
def parse(self, response):
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
for prop in props:
absolute_url = response.urljoin(props)
yield Request(absolute_url, callback=self.parse_props)
next_page_url = response.xpath('//a[#class = "next"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url)
def parse_props(self, response):
pass
Please let me know if I can clarify anything.
You are passing props list of strings to response.urljoin() but meant prop instead:
for prop in props:
absolute_url = response.urljoin(prop)
Alecxe's is right, it was a simple oversight in the spelling of iterator in your loop. You can use the following notation:
for prop in response.xpath('//div[#class = "address ellipsis"]/a/#href').extract():
yield scrapy.Request(response.urljoin(prop), callback=self.parse_props)
It's cleaner and you're not instantiating the "absolute_url" per loop. On a larger scale, would help you save some memory.

Using multiple parsers while scraping a page

I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.
You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)
yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Resources