Scrapy / Python - executing several yields - python-3.x

In my parse method, I'd like to call 3 methods from a SpiderClass that I inherit from.
At first, I'd like to parse the XPaths, then clean the data, then assign the data to an item instance and hand it over to the pipeline.
I'll try it with little code and just ask for the principles: cleanData and assignProductValues are never called - why?
def parse(self, response):
for href in response.xpath("//a[#class='product--title']/#href"):
url = href.extract()
yield scrapy.Request(url, callback=super(MyclassSpider, self).scrapeProduct)
yield scrapy.Request(url, callback=super(MyclassSpider, self).cleanData)
yield scrapy.Request(url, callback=super(MyclassSpider, self).assignProductValues)
I understand that I create a generator when using yield but I don't understand why the 2nd and 3rd yield are not being called after the first yield or how I can achieve them being called.
--
Then I tried another way: I don't want to do 3 requests towards the website - just one and work with the data.
def parse(self, response):
for href in response.xpath("//a[#class='product--title']/#href"):
url = href.extract()
item = MyItem()
response = scrapy.Request(url, meta={'item': item}, callback=super(MyclassSpider, self).scrapeProduct)
super(MyclassSpider, self).cleanData(response)
super(MyclassSpider, self).assignProductValues(response)
yield response
What happens here is, scrapeProduct is being called, that might take a while. (I've got a 5 seconds delay).
But then cleanData and assignProductValues are being called right away like 30 times (as often as the for is true/looped through).
How can I execute the three Methods one by one with only 1 request towards the website?

I guess that after you yield the first request, the other two are getting filtered by dupefilter. Check your log. If you don't want it to be filtered, pass dont_filter=True for the Request object.

Related

how to get return value of async coroutine python

I am new to python and trying to learn asyncio module. I am frustrated on getting return values from async tasks. There is a post here talked about this topic, but it can't tell which value is returned by which task(assuming some one web page response faster than another).
The code below is trying to fetch three web pages concurrently instead of doing it one by one.
import asyncio
import aiohttp
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
assert resp.status == 200
return await resp.text()
def compile_all(urls):
tasks = []
for url in urls:
tasks.append(asyncio.ensure_future(fetch(url)))
return tasks
urls = ['https://python.org', 'https://google.com', 'https://amazon.com']
tasks = compile_all(urls)
loop = asyncio.get_event_loop()
a, b, c = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
print(a)
print(b)
print(c)
First, it hit Runtimeerror though it did print some html documents: RuntimeError: Event loop is closed.
Second, question is: does this really guarantee that a, b, c will be corresponded to the urls list in sequence of urls[0], url[1], urls[2] web page? (I assume that async tasks execution won't guarantee that).
Third, any other better means or Should I use Queue in this case? if yes, how?
Any help will be greatly appreciated.
The order of the results will correspond to the order of the urls. Take a look at the docs for asyncio.gather:
If all awaitables are completed successfully, the result is an
aggregate list of returned values. The order of result values
corresponds to the order of awaitables in aws.
To process tasks as they complete you can use asyncio.as_completed. This post has more information on how it can be used.

Asyncio, the tasks are not finished properly, because of sentinel issues

I'm trying to do some web-scraping, as learning, using a predefined number of workers.
I'm using None as as sentinel to break out of the while loop and stop the worker.
The speed of each worker varies, and all workers are closed before the last
url is passed to gather_search_links to get the links.
I tried to use asyncio.Queue, but I had less control than with deque.
async def gather_search_links(html_sources, detail_urls):
while True:
if not html_sources:
await asyncio.sleep(0)
continue
data = html_sources.pop()
if data is None:
html_sources.appendleft(None)
break
data = BeautifulSoup(data, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
url = f'{domain_url}{atag.get("href")}'
detail_urls.appendleft(url)
print("apended data", len(detail_urls))
await asyncio.sleep(0)
async def get_page_source(urls, html_sources):
client = httpx.AsyncClient()
while True:
if not urls:
await asyncio.sleep(0)
continue
url = urls.pop()
print("url", url)
if url is None:
urls.appendleft(None)
break
response = await client.get(url)
html_sources.appendleft(response.text)
await asyncio.sleep(8)
html_sources.appendleft(None)
async def navigate(urls):
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
await asyncio.sleep(0)
nav_urls.appendleft(None)
loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()
navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
loop.run_until_complete(workers)
I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.
I think the issue arises at this line:
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.
I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.
EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.
LAST EDIT:
Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.
import asyncio
from collections import deque
import httpx as httpx
from bs4 import BeautifulSoup
# Get or build URLs from config
def navigate():
urls = deque()
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
return urls
# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):
client = httpx.AsyncClient()
response = await client.get(url)
data = BeautifulSoup(response.text, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
#Domain URL was defined elsewhere
url = f'{domain_url}{atag.get("href")}'
products_urls.appendleft(url)
loop = asyncio.get_event_loop()
products_urls = deque()
nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])
loop.run_until_complete(workers)

Scrapy does not respect LIFO

I use Scrapy 1.5.1
My Goal is to go through entire chain of requests for each variable before moving to the next variable. For some reason Scrapy takes 2 variables, then sends 2 requests, then takes another 2 variables and so on.
CONCURRENT_REQUESTS = 1
Here is my code sample:
def parsed ( self, response):
# inspect_response(response, self)
search = response.meta['search']
for idx, i in enumerate(response.xpath("//table[#id='ctl00_ContentPlaceHolder1_GridView1']/tr")[1:]):
__EVENTARGUMENT = 'Select${}'.format(idx)
data = {
'__EVENTARGUMENT': __EVENTARGUMENT,
}
yield scrapy.Request(response.url, method = 'POST', headers = self.headers, body = urlencode(data),callback = self.res_before_get,meta = {'search' : search}, dont_filter = True)
def res_before_get ( self, response):
# inspect_response(response, self)
url = 'http://www.moj-yemen.net/Search_detels.aspx'
yield scrapy.Request(url, callback = self.results, dont_filter = True)
My desired behavior is:
1 value from Parse is sent to res_before_get and then i do smth with it.
then another values from Parse is sent to res_before_get and so on.
Post
Get
Post
Get
But currently Scrapy takes 2 values from Parse and adds them to queue , then sends 2 requests from res_before_get. Thus im getting duplicate results.
Post
Post
Get
Get
What do I miss?
P.S.
This is asp.net site. Its logic is as follows:
makes POST request with search payload.
Make GET request to get actual data.
Both request share the same sessionID
Thats why it is important to preserve the order.
At the moment im getting POST1 and POST2. And since the sessionID is associated with POST2, both GET1 and GET2 return the same page.
Scrapy works asynchronously, so you cannot expect it to respect the order of your loops or anything.
If you need it to work sequentially, you'll have to accommodate the callbacks to work like that, for example:
def parse1(self, response):
...
yield Request(..., callback=self.parse2, meta={...(necessary information)...})
def parse2(self, response):
...
if (necessary information):
yield Request(...,
callback=self.parse2,
meta={...(remaining necessary information)...},
)

How to go to following link from a for loop?

I am using scrapy to scrape a website I am in a loop where every item have link I want to go to following every time in a loop.
import scrapy
class MyDomainSpider(scrapy.Spider):
name = 'My_Domain'
allowed_domains = ['MyDomain.com']
start_urls = ['https://example.com']
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
title = colom.xpath('//*[#class="lng_cont_name"]/text()').extract_first()
address = colom.xpath('//*[#class="adWidth cont_sw_addr"]/text()').extract_first()
con_address = address[9:-9]
url= colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
print(url)
print('*********************')
yield scrapy.Request(url, callback = self.parse_dir_contents)
def parse_dir_contents(self, response):
print('000000000000000000')
a = response.xpath('//*[#class="fn"]/text()').extract_first()
print(a)
I have tried something like this but zeros print only once but stars prints 10 time I want it to run 2nd function to run every time when the loop runs.
You are probably doing something that is not intended. With
url = colom.xpath('//*[#id="tab-5"]/ul/li/#data-href').extract_first()
inside the loop, url always results in the same value. By default, Scrapy filters duplicate requests (see here). If you really want to scrape the same URL multiple times, you can disable the filtering on request level with dont_filter=True argument to scrapy.Request constructor. However, I think that what you really want is to go like this (only the relevant part of the code left):
def parse(self, response):
Colums = response.xpath('//*[#id="tab-5"]/ul/li')
for colom in Colums:
url = colom.xpath('./#data-href').extract_first()
yield scrapy.Request(url, callback=self.parse_dir_contents)

Using multiple parsers while scraping a page

I have searched some of the questions regarding this topic but i couldn't find a solution to my problem.
I'm currently trying to use multiple parsers on a site depending on the product I want to search. After trying some methods, I ended up with this:
With this start request:
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
That gets into my normal parse_item.
What i want to do is, with this parse_item (by checking with the item category like laptop, tablet, etc):
def parse_item(self,response):
#I get the items category for the if/else
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(#class, "a-list-item")]//a/text())').extract_first())
#Get the product link, for example (https://www.amazon.com/Lenovo-T430s-Performance-Professional-Refurbished/dp/B07L4FR92R/ref=sr_1_7?s=pc&ie=UTF8&qid=1545829464&sr=1-7&keywords=laptop)
urlProducto = response.request.url
#This can be done in a nicer way, just trying out if it works atm
if category == 'Laptop':
yield response.follow(urlProducto, callback = parse_laptop)
With:
def parse_laptop(self, response):
#Parse things
Any suggestions? The error i get when running this code is 'parse_laptop' is not defined. I have already tried putting the parse_laptop above the parse_item and i still get the same error.
You need to refer to a method and not a function, so just change it like this:
yield response.follow(urlProducto, callback = self.parse_laptop)
yield response.follow(urlProducto, callback = parse_laptop)
This is the request and here's you function def parse_laptop(self, response): you probably have noticed that you parse_laptop function requires self object.
so please modify you request to :
yield response.follow(urlProducto, callback = self.parse_laptop)
This should do the work.
Thanks.

Resources