Using multiple parsers while scraping a page - python-3.x

I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.

You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)

yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.

Related

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

I have always resisted using asyncio within my code, but using it might help with some performance issues that I'm having.
Here is my scenario:
An end user provides a list of news sites to scrape
Each element is passed to an Article Class
A valid article is passed to an Extraction Class
The Extraction Class passes data to a NewsExtraction Class
90% this of the time this flow is flawless, but on an occasion one of the 12 functions in the NewsExtraction Class fails to extract data, which exist in the HTML provide. It seems that my code is "stepping on itself," which cause the data element not to be parsed. When I rerun the code all the elements are parsed correctly.
The NewsExtraction Class has this function get_article_data_elements, which is called from the Extraction Class.
The function get_article_data_elements call these items:
published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()
Each of these data elements are used to populate a Python Dictionary, which is eventually passed back to the End User.
I have been trying to add asyncio code to the NewsExtraction Class, but I kept getting this error message:
RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited
I have spent the last 3 days trying to figure this issue out. I have looked at dozens of questions on Stack Overflow on this error RuntimeWarning: coroutine never awaited. I have also looked at numerous articles on using asyncio, but I cannot figure out how to use asyncio with my NewsExtraction Class, which is called from the Extraction Class.
Can someone provide me some pointers to solve my issue?
class NewsExtraction(object):
"""
This class is used to extract common data elements from a news article
on xyz
"""
def __init__(self, url, soup):
self._url = url
self._raw_soup = soup
truncated...
async def _extract_article_published_date(self):
"""
This function is designed to extract the publish date for the article being parsed.
:return: date article was originally published
:rtype: string
"""
json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
if json_date_published is not None:
if len(json_date_published) != 0:
return json_date_published
else:
return None
elif json_date_published is None:
if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
if len(date_published) != 0:
return date_published.text
else:
logger.info('The HTML tag to extract the publish date for the following article was not found.')
logger.info(f'Article URL -- {self._url}')
return None
truncated...
async def get_article_data_elements(self):
"""
This function is designed to extract all the common data elements from a
news article on xyz.
:return: dictionary of data elements related to the article
:rtype: dict
"""
article_data_elements = {}
# I have tried this:
published_date = self._extract_article_published_date().__await__()
# and this
published_date = self.task(self._extract_article_published_date())
await published_date
truncated...
I have also tried to use:
if __name__ == "__main__":
asyncio.run(NewsExtraction.get_article_data_elements())
# asyncio.run(self.get_article_data_elements())
I'm really banging my head on the wall with using asyncio in my news extraction code.
If this question is off base, I will be happy to delete it and keep reading about how to use asyncio correctly.
Can someone provide me some pointers to solve my issue?
Thanks in advance for any guidance on using asyncio
Your are defining _extract_article_published_date and get_article_data_elements as coroutines, and this coroutines must be await-ed in your code to get the result of their execution in an asynchronous way.
You can do this creating an instance of type NewsExtraction and calling this methods with the keyword await in front, this await pass the execution to other task in the loop until his awaited task completes its execution. Note that there are no threads or process involved in this task execution, the execution is passed only if it is no using cpu-time (await-ing I/O operations or sleeping).
if __name__ == '__main__':
extractor = NewsExtraction(...)
# this creates the event loop and runs the coroutine
asyncio.run(extractor.get_article_data_elements())
Inside your _extract_article_published_date you must also await your coroutines that perform requests over the network, if you are using some library for the scraping make sure that uses async/await behind the scenes to get a real performance while using asyncio.
async def get_article_data_elements(self):
article_data_elements = {}
# note here that the instance is self
published_date = await self._extract_article_published_date()
truncated...
You must dive into the asyncio documentation to get a better understanding of these features of Python 3.7+.

Scrapy x path: Loop over two different blocks and combine results

I am scraping this website: https://www.mrlodge.de/wohnungen/
Therefore I wrote a crawler which works great and extracts all information from each offer. Each offer also has detailed information. This detailed information I can reach with the url returned by this xpath expression:
x = listings.xpath('//form/input[#name="name_url"]/#value').getall()
The problem I am facing is that it is inside the offer block but not inside the div tags which I am already looping. I have tried the following but I only get the first element for detail_url. There has to be a way to include it inside the for loop but I just don't get how.
Please help
def parse(self, response):
json_response = json.loads(response.body)
listings = Selector(text=json_response.get('list'))
x = listings.xpath('//form/input[#name="name_url"]/#value').get()
for listing in listings.xpath("//div[#class='mrlobject-list__item__content']"):
yield {
'title': listing.xpath('.//div[#class="obj-smallinfo"]/text()').get(),
'rent': listing.xpath(".//span[#class='obj-rent']/text()").get(),
'room': listing.xpath(".//span[#class='obj-room']/text()").get(),
'area': listing.xpath(".//span[#class='obj-area']/text()").get(),
'info': listing.xpath(".//div[#class='object-title']/text()").get(),
'detail_url': x
}

Scrapy Extract method yields a Cannot mix str and non-str arguments error

I am in the middle of learning scrappy right now and am building a simple scraper of a real estate site. With this code I am trying to scrape all of the URLs for the real estate listing of a specific city. I have run into the following error with my code - "Cannot mix str and non-str arguments".
I believe I have isolated my problem to following part of my code
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
If I use the extract_first() function instead of the extract function in the props xpath assignment, the code kind of works. It grabs the first link for the property on each page. However, this ultimately is not what I want. I believe I have the xpath call correct as the code runs if I use the extract_first() method.
Can someone explain what I am doing wrong here? I have listed my full code below
import scrapy
from scrapy.http import Request
class AdvancedSpider(scrapy.Spider):
name = 'advanced'
allowed_domains = ['www.realtor.com']
start_urls = ['http://www.realtor.com/realestateandhomes-search/Houston_TX/']
def parse(self, response):
props = response.xpath('//div[#class = "address ellipsis"]/a/#href').extract()
for prop in props:
absolute_url = response.urljoin(props)
yield Request(absolute_url, callback=self.parse_props)
next_page_url = response.xpath('//a[#class = "next"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url)
def parse_props(self, response):
pass
Please let me know if I can clarify anything.
You are passing props list of strings to response.urljoin() but meant prop instead:
for prop in props:
absolute_url = response.urljoin(prop)
Alecxe's is right, it was a simple oversight in the spelling of iterator in your loop. You can use the following notation:
for prop in response.xpath('//div[#class = "address ellipsis"]/a/#href').extract():
yield scrapy.Request(response.urljoin(prop), callback=self.parse_props)
It's cleaner and you're not instantiating the "absolute_url" per loop. On a larger scale, would help you save some memory.

Is it possible to do (kind of) Polymorphism with scrapy

Im currently having some issues trying to adapt my scrapy program. The thing im trying do is make a different parser work depending on the "site" im in.
Currently i have this start request
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
I want to find a way to, depending on which keyword i get from the txt file, call different parsers for extracting the data from the page.
Is it there a way to accomplish this?
what about matching the callback dependant on the keyword you got inside start_requests? Something like:
def start_requests(self):
keyword_callback = {
'keyword1': self.parse_keyword1,
'keyword2': self.parse_keyword2,
}
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword), callback=keyword_callback[keyword])

Scrapy / Python - executing several yields

In my parse method, I'd like to call 3 methods from a SpiderClass that I inherit from.
At first, I'd like to parse the XPaths, then clean the data, then assign the data to an item instance and hand it over to the pipeline.
I'll try it with little code and just ask for the principles: cleanData and assignProductValues are never called - why?
def parse(self, response):
for href in response.xpath("//a[#class='product--title']/#href"):
url = href.extract()
yield scrapy.Request(url, callback=super(MyclassSpider, self).scrapeProduct)
yield scrapy.Request(url, callback=super(MyclassSpider, self).cleanData)
yield scrapy.Request(url, callback=super(MyclassSpider, self).assignProductValues)
I understand that I create a generator when using yield but I don't understand why the 2nd and 3rd yield are not being called after the first yield or how I can achieve them being called.
--
Then I tried another way: I don't want to do 3 requests towards the website - just one and work with the data.
def parse(self, response):
for href in response.xpath("//a[#class='product--title']/#href"):
url = href.extract()
item = MyItem()
response = scrapy.Request(url, meta={'item': item}, callback=super(MyclassSpider, self).scrapeProduct)
super(MyclassSpider, self).cleanData(response)
super(MyclassSpider, self).assignProductValues(response)
yield response
What happens here is, scrapeProduct is being called, that might take a while. (I've got a 5 seconds delay).
But then cleanData and assignProductValues are being called right away like 30 times (as often as the for is true/looped through).
How can I execute the three Methods one by one with only 1 request towards the website?
I guess that after you yield the first request, the other two are getting filtered by dupefilter. Check your log. If you don't want it to be filtered, pass dont_filter=True for the Request object.

Resources