python grequets to get time for each http response individually - python-3.x

I have written a python script using grequests to send http requests to server. The problem is that I need to get response time of each request. I have used hooks but still i can't find a single method to display exact response time. I used time.time() but I cant keep track of each request.
Below is the code.
def do_something(response, *args, **kwargs):
print('Response: ', response.text)
roundtrip = time.time() - start
print (roundtrip)
urls = ["http://192.168.40.122:35357/v2.0/tokens"]*100
while True:
payload = {some_payload}
start = time.time()
unsent_request = (grequests.post(u, hooks={'response': do_something}, json=payload) for u in urls)
print(unsent_request)
print(grequests.map(unsent_request, size=100))

grequests is just a wrapper around requests library. Just use the .elapsed attribute from the latest library, this way:
response_list = grequests.map(unsent_request, size=100)
for response in response_list:
print(response.elapsed and response.elapsed.total_seconds() or "failed")

Related

How to make multiple REST calls asynchronous in python3

I have the following code to make multiple REST calls. Basically I have a dictionary where key is a string and value is a JSON date that I need to use as payload to pass to a REST API POST method.
At the moment, the dictionary contains 10 entries, so I need to make 10 REST calls.
At the moment, I have implemented using requests package in python3 which is synchronous in nature. So after 1 REST call, it waits for its response and similarly for 10 REST calls, it will wait 10 times for the response from API.
def createCategories(BACKEND_URL, token, category):
url = os.path.join(BACKEND_URL, 'api/v1/category-creation')
category_dict = read_payloads(category)
headers = {
"token": f'{token}',
"Content-Type": "application/json",
"accept": "application/json"
}
for name, category_payload in category_dict.items():
json_payload = json.dumps(category_payload)
response = requests.request("POST", url, headers=headers, data=json_payload)
##########################
## Load as string and parsing
response_data = json.loads(response.text)
print(response_data)
category_id = response_data['id']
message = 'The entity with id: ' + str(category_id) + ' is created successfully. '
logging.info(message)
return "categories created successfully."
I read that we need to use asyncio to make these asynchronous. What code changes do I need to make?
You can continue using requests library. You need to use threading or concurrent.futures modules to make several requests simutaneoudly.
Another option is to use some async library like aiohttp or some others.
import requests
from threading import current_thread
from concurrent.futures import ThreadPoolExecutor, Future
from time import sleep, monotonic
URL = "https://api.github.com/events"
def make_request(url: str) -> int:
r = requests.get(url)
sleep(2.0) # wait n seconds
return r.status_code
def done_callback(fut: Future):
if fut.exception():
res = fut.exception()
print(f"{current_thread().name}. Error: {res}")
elif fut.cancelled():
print(f"Task was canceled")
else:
print(f"{current_thread().name}. Result: {fut.result()}")
if __name__ == '__main__':
urls = [URL for i in range(20)] # 20 tasks
start = monotonic()
with ThreadPoolExecutor(5) as pool:
for i in urls:
future_obj = pool.submit(make_request, i)
future_obj.add_done_callback(done_callback)
print(f"Time passed: {monotonic() - start}")

How to get a certain number of words from a website in python

i want to fetch data from cheat.sh using the requests lib and the discord.py lib....but since discord only allows 2000 characters at length to send at a time, i want to fetch only a certain number of words/digits/newline like 1800. how can i do so?
a small bit of code example showing my idea
import requests
url = "https://cheat.sh/python/string+annotations" #this gets the docs of string annotation in python
response = requests.get(url)
data = response.text # This gives approximately 2403 words...but i want to get only 1809 words
print(data)
import requests
url = "https://cheat.sh/python/string+annotations" #this gets the docs of string
annotation in python
response = requests.get(url)
data = response.text[:1800]
print(data)
This will be the correct code

Asyncio, the tasks are not finished properly, because of sentinel issues

I'm trying to do some web-scraping, as learning, using a predefined number of workers.
I'm using None as as sentinel to break out of the while loop and stop the worker.
The speed of each worker varies, and all workers are closed before the last
url is passed to gather_search_links to get the links.
I tried to use asyncio.Queue, but I had less control than with deque.
async def gather_search_links(html_sources, detail_urls):
while True:
if not html_sources:
await asyncio.sleep(0)
continue
data = html_sources.pop()
if data is None:
html_sources.appendleft(None)
break
data = BeautifulSoup(data, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
url = f'{domain_url}{atag.get("href")}'
detail_urls.appendleft(url)
print("apended data", len(detail_urls))
await asyncio.sleep(0)
async def get_page_source(urls, html_sources):
client = httpx.AsyncClient()
while True:
if not urls:
await asyncio.sleep(0)
continue
url = urls.pop()
print("url", url)
if url is None:
urls.appendleft(None)
break
response = await client.get(url)
html_sources.appendleft(response.text)
await asyncio.sleep(8)
html_sources.appendleft(None)
async def navigate(urls):
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
await asyncio.sleep(0)
nav_urls.appendleft(None)
loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()
navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
loop.run_until_complete(workers)
I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.
I think the issue arises at this line:
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.
I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.
EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.
LAST EDIT:
Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.
import asyncio
from collections import deque
import httpx as httpx
from bs4 import BeautifulSoup
# Get or build URLs from config
def navigate():
urls = deque()
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
return urls
# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):
client = httpx.AsyncClient()
response = await client.get(url)
data = BeautifulSoup(response.text, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
#Domain URL was defined elsewhere
url = f'{domain_url}{atag.get("href")}'
products_urls.appendleft(url)
loop = asyncio.get_event_loop()
products_urls = deque()
nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])
loop.run_until_complete(workers)

Ensuring unique timestamps generation in asyncio/aiohttp coroutines

I'm rewriting a web scraper with aiohttp. At some point, it has to make a POST request with a payload notably including a 'CURRENT_TIMESTAMP_ID'. These requests seem to always succeed, but they sometimes are redirected (302 status code) to another location, as additional details need to be fetched to be displayed on the page. Those redirections often fails ("A system error occurred" or "not authorized" error message is displayed on the page), and I don't know why.
I guess it's because they sometimes share the same value for 'CURRENT_TIMESTAMP_ID' (because headers and cookies are the same). Thus, I'd like to generate different timestamps in each request but I had no success doing that. I tried using some randomness with stuffs like asyncio.sleep(1+(randint(500, 2000) / 1000)). Also, note that doing the scraping with task_limit=1 succeeds (see code below).
Here is the relevant part of my code:
async def search(Number, session):
data = None
loop = asyncio.get_running_loop()
while data is None:
t = int(round(time() * 1000)) #often got the same value here
payload = {'Number': Number,
'CURRENT_TIMESTAMP_ID': t}
params = {'CURRENT_TIMESTAMP_ID': t}
try:
async with session.post(SEARCH_URL, data=payload, params=params) as resp:
resp.raise_for_status()
data = await resp.text()
return data
except aiohttp.ClientError as e:
print(f'Error with number{Number}: {e}')
It's called via:
async def main(username, password):
headers = {'User-Agent': UserAgent().random}
async with aiohttp.ClientSession(headers=headers) as session:
await login(session, username, password)
"""Perform the following operations:
1. Fetch a bunch of urls concurrently, with a limit of x tasks
2. Gather the results into chunks of size y
3. Process the chunks in parallel using z different processes
"""
partial_search = async_(partial(search, session=session)) #I'm using Python 3.7
urls = ['B5530'] * 3 #trying to scrape the same URL 3 times
results = await ( #I'm using aiostream cause I got a huge list of urls. Problem also occurs with gather.
stream.iterate(urls)
| pipe.map(partial_search, ordered=False, task_limit=100)
| pipe.chunks(100 // cpu_count())
| pipe.map(process_in_executor, ordered=False, task_limit=cpu_count() + 1)
)
Hope someone will see what I'm missing!

TypeError('not a valid non-string sequence or mapping object',)

I am using aiohttp get request to download some content from another web api
but i am receiving:
exception = TypeError('not a valid non-string sequence or mapping object',)
Following is the data which i am trying to sent.
data = "symbols=LGND-US&exprs=CS_EVENT_TYPE_CD_R(%27%27,%27now%27,%271D%27)"
How to resolve it?
I tried it in 2 ways:
r = yield from aiohttp.get(url, params=data) # and
r = yield from aiohttp.post(url, data=data)
At the same time i am able to fetch data using:
r = requests.get(url, params=data) # and
r = requests.post(url, data=data)
But i need async implementation.
And also suggest me some way if i can use import requests library instead of import aiohttp to make async http request, because in many cases aiohttp post & get request are not working but the same are working for requests.get & post requests.
The docs use bytes (i.e. the 'b' prefix) for the data argument.
r = await aiohttp.post('http://httpbin.org/post', data=b'data')
Also, the params argument should be a dict or a list of tuples.

Resources