Asyncio, the tasks are not finished properly, because of sentinel issues - python-3.x

I'm trying to do some web-scraping, as learning, using a predefined number of workers.
I'm using None as as sentinel to break out of the while loop and stop the worker.
The speed of each worker varies, and all workers are closed before the last
url is passed to gather_search_links to get the links.
I tried to use asyncio.Queue, but I had less control than with deque.
async def gather_search_links(html_sources, detail_urls):
while True:
if not html_sources:
await asyncio.sleep(0)
continue
data = html_sources.pop()
if data is None:
html_sources.appendleft(None)
break
data = BeautifulSoup(data, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
url = f'{domain_url}{atag.get("href")}'
detail_urls.appendleft(url)
print("apended data", len(detail_urls))
await asyncio.sleep(0)
async def get_page_source(urls, html_sources):
client = httpx.AsyncClient()
while True:
if not urls:
await asyncio.sleep(0)
continue
url = urls.pop()
print("url", url)
if url is None:
urls.appendleft(None)
break
response = await client.get(url)
html_sources.appendleft(response.text)
await asyncio.sleep(8)
html_sources.appendleft(None)
async def navigate(urls):
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
await asyncio.sleep(0)
nav_urls.appendleft(None)
loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()
navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
loop.run_until_complete(workers)

I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.
I think the issue arises at this line:
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.
I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.
EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.
LAST EDIT:
Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.
import asyncio
from collections import deque
import httpx as httpx
from bs4 import BeautifulSoup
# Get or build URLs from config
def navigate():
urls = deque()
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
return urls
# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):
client = httpx.AsyncClient()
response = await client.get(url)
data = BeautifulSoup(response.text, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
#Domain URL was defined elsewhere
url = f'{domain_url}{atag.get("href")}'
products_urls.appendleft(url)
loop = asyncio.get_event_loop()
products_urls = deque()
nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])
loop.run_until_complete(workers)

Related

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

I have always resisted using asyncio within my code, but using it might help with some performance issues that I'm having.
Here is my scenario:
An end user provides a list of news sites to scrape
Each element is passed to an Article Class
A valid article is passed to an Extraction Class
The Extraction Class passes data to a NewsExtraction Class
90% this of the time this flow is flawless, but on an occasion one of the 12 functions in the NewsExtraction Class fails to extract data, which exist in the HTML provide. It seems that my code is "stepping on itself," which cause the data element not to be parsed. When I rerun the code all the elements are parsed correctly.
The NewsExtraction Class has this function get_article_data_elements, which is called from the Extraction Class.
The function get_article_data_elements call these items:
published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()
Each of these data elements are used to populate a Python Dictionary, which is eventually passed back to the End User.
I have been trying to add asyncio code to the NewsExtraction Class, but I kept getting this error message:
RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited
I have spent the last 3 days trying to figure this issue out. I have looked at dozens of questions on Stack Overflow on this error RuntimeWarning: coroutine never awaited. I have also looked at numerous articles on using asyncio, but I cannot figure out how to use asyncio with my NewsExtraction Class, which is called from the Extraction Class.
Can someone provide me some pointers to solve my issue?
class NewsExtraction(object):
"""
This class is used to extract common data elements from a news article
on xyz
"""
def __init__(self, url, soup):
self._url = url
self._raw_soup = soup
truncated...
async def _extract_article_published_date(self):
"""
This function is designed to extract the publish date for the article being parsed.
:return: date article was originally published
:rtype: string
"""
json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
if json_date_published is not None:
if len(json_date_published) != 0:
return json_date_published
else:
return None
elif json_date_published is None:
if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
if len(date_published) != 0:
return date_published.text
else:
logger.info('The HTML tag to extract the publish date for the following article was not found.')
logger.info(f'Article URL -- {self._url}')
return None
truncated...
async def get_article_data_elements(self):
"""
This function is designed to extract all the common data elements from a
news article on xyz.
:return: dictionary of data elements related to the article
:rtype: dict
"""
article_data_elements = {}
# I have tried this:
published_date = self._extract_article_published_date().__await__()
# and this
published_date = self.task(self._extract_article_published_date())
await published_date
truncated...
I have also tried to use:
if __name__ == "__main__":
asyncio.run(NewsExtraction.get_article_data_elements())
# asyncio.run(self.get_article_data_elements())
I'm really banging my head on the wall with using asyncio in my news extraction code.
If this question is off base, I will be happy to delete it and keep reading about how to use asyncio correctly.
Can someone provide me some pointers to solve my issue?
Thanks in advance for any guidance on using asyncio
Your are defining _extract_article_published_date and get_article_data_elements as coroutines, and this coroutines must be await-ed in your code to get the result of their execution in an asynchronous way.
You can do this creating an instance of type NewsExtraction and calling this methods with the keyword await in front, this await pass the execution to other task in the loop until his awaited task completes its execution. Note that there are no threads or process involved in this task execution, the execution is passed only if it is no using cpu-time (await-ing I/O operations or sleeping).
if __name__ == '__main__':
extractor = NewsExtraction(...)
# this creates the event loop and runs the coroutine
asyncio.run(extractor.get_article_data_elements())
Inside your _extract_article_published_date you must also await your coroutines that perform requests over the network, if you are using some library for the scraping make sure that uses async/await behind the scenes to get a real performance while using asyncio.
async def get_article_data_elements(self):
article_data_elements = {}
# note here that the instance is self
published_date = await self._extract_article_published_date()
truncated...
You must dive into the asyncio documentation to get a better understanding of these features of Python 3.7+.

how to get return value of async coroutine python

I am new to python and trying to learn asyncio module. I am frustrated on getting return values from async tasks. There is a post here talked about this topic, but it can't tell which value is returned by which task(assuming some one web page response faster than another).
The code below is trying to fetch three web pages concurrently instead of doing it one by one.
import asyncio
import aiohttp
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
assert resp.status == 200
return await resp.text()
def compile_all(urls):
tasks = []
for url in urls:
tasks.append(asyncio.ensure_future(fetch(url)))
return tasks
urls = ['https://python.org', 'https://google.com', 'https://amazon.com']
tasks = compile_all(urls)
loop = asyncio.get_event_loop()
a, b, c = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
print(a)
print(b)
print(c)
First, it hit Runtimeerror though it did print some html documents: RuntimeError: Event loop is closed.
Second, question is: does this really guarantee that a, b, c will be corresponded to the urls list in sequence of urls[0], url[1], urls[2] web page? (I assume that async tasks execution won't guarantee that).
Third, any other better means or Should I use Queue in this case? if yes, how?
Any help will be greatly appreciated.
The order of the results will correspond to the order of the urls. Take a look at the docs for asyncio.gather:
If all awaitables are completed successfully, the result is an
aggregate list of returned values. The order of result values
corresponds to the order of awaitables in aws.
To process tasks as they complete you can use asyncio.as_completed. This post has more information on how it can be used.

Is there a workaround for the blocking that happens with Firebase Python SDK? Like adding a completion callback?

Recently, I have moved my REST server code in express.js to using FastAPI. So far, I've been successful in the transition until recently. I've noticed based on the firebase python admin sdk documention, unlike node.js, the python sdk is blocking. The documentation says here:
In Python and Go Admin SDKs, all write methods are blocking. That is, the write methods do not return until the writes are committed to the database.
I think this feature is having a certain effect on my code. It also could be how I've structured my code as well. Some code from one of my files is below:
from app.services.new_service import nService
from firebase_admin import db
import json
import redis
class TryNewService:
async def tryNew_func(self, request):
# I've already initialized everything in another file for firebase
ref = db.reference()
r = redis.Redis()
holdingData = await nService().dialogflow_session(request)
fulfillmentText = json.dumps(holdingData[-1])
body = await request.json()
if ("user_prelimInfo_address" in holdingData):
holdingData.append("session")
holdingData.append(body["session"])
print(holdingData)
return(holdingData)
else:
if (("Default Welcome Intent" in holdingData)):
pass
else:
UserVal = r.hget(name='{}'.format(body["session"]), key="userId").decode("utf-8")
ref.child("users/{}".format(UserVal)).child("c_data").set({holdingData[0]:holdingData[1]})
print(holdingData)
return(fulfillmentText)
Is there any workaround for the blocking effect of usingref.set() line in my code? Kinda like adding a callback in node.js? I'm new to the asyncio world of python 3.
Update as of 06/13/2020: So I added following code and am now getting a RuntimeError: Task attached to a different loop. In my second else statement I do the following:
loop = asyncio.new_event_loop()
UserVal = r.hget(name='{}'.format(body["session"]), key="userId").decode("utf-8")
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as pool:
result = await loop.run_in_executor(pool, ref.child("users/{}".format(UserVal)).child("c_data").set({holdingData[0]:holdingData[1]}))
print("custom thread pool:{}".format(result))
With this new RuntimeError, I would appreciate some help in figuring out.
If you want to run synchronous code inside an async coroutine, then the steps are:
loop = get_event_loop()
Note: Get and not new. Get provides current event_loop, and new_even_loop returns a new one
await loop.run_in_executor(None, sync_method)
First parameter = None -> use default executor instance
Second parameter (sync_method) is the synchronous code to be called.
Remember that resources used by sync_method need to be properly synchronized:
a) either using asyncio.Lock
b) or using asyncio.run_coroutine_threadsafe function(see an example below)
Forget for this case about ThreadPoolExecutor (that provides a way to I/O parallelism, versus concurrency provided by asyncio).
You can try following code:
loop = asyncio.get_event_loop()
UserVal = r.hget(name='{}'.format(body["session"]), key="userId").decode("utf-8")
result = await loop.run_in_executor(None, sync_method, ref, UserVal, holdingData)
print("custom thread pool:{}".format(result))
With a new function:
def sync_method(ref, UserVal, holdingData):
result = ref.child("users/{}".format(UserVal)).child("c_data").set({holdingData[0]:holdingData[1]}))
return result
Please let me know your feedback
Note: previous code it's untested. I have only tested next minimum example (using pytest & pytest-asyncio):
import asyncio
import time
import pytest
#pytest.mark.asyncio
async def test_1():
loop = asyncio.get_event_loop()
delay = 3.0
result = await loop.run_in_executor(None, sync_method, delay)
print(f"Result = {result}")
def sync_method(delay):
time.sleep(delay)
print(f"dddd {delay}")
return "OK"
Answer #jeff-ridgeway comment:
Let's try to change previous answer to clarify how to use run_coroutine_threadsafe, to execute from a sync worker thread a coroutine that gather these shared resources:
Add loop as additional parameter in run_in_executor
Move all shared resources from sync_method to a new async_method, that is executed with run_coroutine_threadsafe
loop = asyncio.get_event_loop()
UserVal = r.hget(name='{}'.format(body["session"]), key="userId").decode("utf-8")
result = await loop.run_in_executor(None, sync_method, ref, UserVal, holdingData, loop)
print("custom thread pool:{}".format(result))
def sync_method(ref, UserVal, holdingData, loop):
coro = async_method(ref, UserVal, holdingData)
future = asyncio.run_coroutine_threadsafe(coro, loop)
future.result()
async def async_method(ref, UserVal, holdingData)
result = ref.child("users/{}".format(UserVal)).child("c_data").set({holdingData[0]:holdingData[1]}))
return result
Note: previous code is untested. And now my tested minimum example updated:
#pytest.mark.asyncio
async def test_1():
loop = asyncio.get_event_loop()
delay = 3.0
result = await loop.run_in_executor(None, sync_method, delay, loop)
print(f"Result = {result}")
def sync_method(delay, loop):
coro = async_method(delay)
future = asyncio.run_coroutine_threadsafe(coro, loop)
return future.result()
async def async_method(delay):
time.sleep(delay)
print(f"dddd {delay}")
return "OK"
I hope this can be helpful
Run blocking database calls on the event loop using a ThreadPoolExecutor. See https://medium.com/#hiranya911/firebase-python-admin-sdk-with-asyncio-d65f39463916

Scrapy does not respect LIFO

I use Scrapy 1.5.1
My Goal is to go through entire chain of requests for each variable before moving to the next variable. For some reason Scrapy takes 2 variables, then sends 2 requests, then takes another 2 variables and so on.
CONCURRENT_REQUESTS = 1
Here is my code sample:
def parsed ( self, response):
# inspect_response(response, self)
search = response.meta['search']
for idx, i in enumerate(response.xpath("//table[#id='ctl00_ContentPlaceHolder1_GridView1']/tr")[1:]):
__EVENTARGUMENT = 'Select${}'.format(idx)
data = {
'__EVENTARGUMENT': __EVENTARGUMENT,
}
yield scrapy.Request(response.url, method = 'POST', headers = self.headers, body = urlencode(data),callback = self.res_before_get,meta = {'search' : search}, dont_filter = True)
def res_before_get ( self, response):
# inspect_response(response, self)
url = 'http://www.moj-yemen.net/Search_detels.aspx'
yield scrapy.Request(url, callback = self.results, dont_filter = True)
My desired behavior is:
1 value from Parse is sent to res_before_get and then i do smth with it.
then another values from Parse is sent to res_before_get and so on.
Post
Get
Post
Get
But currently Scrapy takes 2 values from Parse and adds them to queue , then sends 2 requests from res_before_get. Thus im getting duplicate results.
Post
Post
Get
Get
What do I miss?
P.S.
This is asp.net site. Its logic is as follows:
makes POST request with search payload.
Make GET request to get actual data.
Both request share the same sessionID
Thats why it is important to preserve the order.
At the moment im getting POST1 and POST2. And since the sessionID is associated with POST2, both GET1 and GET2 return the same page.
Scrapy works asynchronously, so you cannot expect it to respect the order of your loops or anything.
If you need it to work sequentially, you'll have to accommodate the callbacks to work like that, for example:
def parse1(self, response):
...
yield Request(..., callback=self.parse2, meta={...(necessary information)...})
def parse2(self, response):
...
if (necessary information):
yield Request(...,
callback=self.parse2,
meta={...(remaining necessary information)...},
)

Python 3 - Example of Multithreading request using requests, asyncio and concurrent

So, I wrote the following code based on an example of a multi request using asyncio and concurrent libs, that I found here.
Basically, for a given list of names, I make some parallel requests to get the user's ID. (That's just an example of usage)
The values returned from getUserId function are stored in the 'futures' list, and you can get these values using the result() method that can be accessed from each element in the list.
This is currently a good approach?
import requests, asyncio, concurrent
async def makeMultiRequests(names):
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
getUserId,
name
)
for name in names
]
return futures
def getUserId(username):
# requests.get() user Id
return userId;
names = ['Name 1', ..., 'Name N']
loop = asyncio.get_event_loop()
resultList = loop.run_until_complete(makeMultiRequests(names))
loop.close()
for result in resultList:
print(result.result()) # Users Id printed
Best regards.

Resources