Using asyncio with Azure Computer Vision SDK read_in_stream - python-3.x

I have an application where I need to process several pdfs using Azure Computer Vision SDK. I am following this example from the official documentation.
From what I have understood, we can submit pdfs by
# Async SDK call that "reads" the image
response = client.read_in_stream(filepath, raw=False)
and get the results using operation_location
# Get ID from returned headers
operation_location = response.headers["Operation-Location"]
operation_id = operation_location.split("/")[-1]
# SDK call that gets what is read
while True:
result = client.get_read_result(operation_id)
if result.status.lower () not in ['notstarted', 'running']:
break
print ('Waiting for result...')
time.sleep(10)
return result
I have noticed that read_in_stream takes around 10 to 30 seconds (depending on number of pages, images in pdf, quality, etc.) and would like to use asyncio to concurrently proceed to next tasks instead of waiting for pdfs to just submit. I tried using joblib backends to speed this up (multithreading, multiprocessing and also a combination of the two), but the performance was just 2-2.5 times better even after tweaking other parameters of joblib.
I want to know the correct way of using asyncio for this problem and would like to keep using the azure SDK objects instead of resorting to requests library and dealing with raw json responses. I think with requests, asyncio and aiohttp library this could be achieved as below, but how to proceed with azure SDK?
timeout = aiohttp.ClientTimeout(total = 3600)
async with aiohttp.ClientSession(timeout = timeout) as session:
async with session.get(url) as resp:
# Submit pdfs and get its result

Related

how limit request per second with httpx [Python 3.6]

My project consists of consuming an api that is built on top of the aws lambda service. Technically, the leader who built the api tells me that there is no fixed request limit since the service is elastic, but it is important to take into account the number of requests per second that the api can support.
To control the limit of requests per second (concurrently), the python script that I am developing uses asyncio and httpx to consume the api concurrently, and taking advantage of the max_connections parameter of httpx.Limits I am trying to find the optimal value so that the api does not freeze.
My problem is that I don't know if I am misinterpreting the use of the max_connections parameter, since when testing with a value of 1000, my understanding tells me that per second I am making 1000 requests concurrently to the api, but even so, the api after a certain time freezes.
I would like to be able to control the limit of requests per second without the need to use third-party libraries.
How could I do it?
Here is my MWE
async def consume(client, endpoint: str = '/create', reg):
data = {"param1": reg[1]}
response = await client.post(url=endpoint, data=json.dumps(data))
return response.json()
async def run(self, regs):
# Empty list to consolidate all responses
results = []
# httpx limits configuration
limits = httpx.Limits(max_keepalive_connections=None, max_connections=1000)
timeout = httpx.Timeout(connect=60.0, read=30.0, write=30.0, pool=60.0)
# httpx client context
async with httpx.AsyncClient(base_url='https://apiexample', headers={'Content-Type': 'application/json'},
limits=limits, timeout=timeout) as client:
# regs is a list of more than 1000000 tuples
tasks = [asyncio.ensure_future(consume(client=client, reg=reg))
for reg in regs]
result = await asyncio.gather(*tasks)
results += result
return results
Thanks in advance.
Your leader is wrong - there is a request limit for AWS lambda (it's 1000 concurrent executions by default).
AWS API is highly unlikely to "freeze" (there are many layers of protection), so I would look for a problem on your side.
Start debugging, by lowering the concurent connections setting (e.g. 100), and explore other settings if this doesn't fix the issue..
More info: https://www.bluematador.com/blog/why-aws-lambda-throttles-functions

Multi Thread Requests Python3

I have researched a lot on this topic but the problem is am not able to figure out how to send multi-threading post requests using python3
names = ["dfg","dddfg","qwed"]
for name in names :
res = requests.post(url,data=name)
res.text
Here I want to send all these names and I want to use multi threading to make it faster.
Solution 1 - concurrent.futures.ThreadPoolExecutor fixed number of threads
Using a custom function (request_post) you can do almost anything.
import concurrent
import requests
def request_post(url, data):
return requests.post(url, data=data)
with concurrent.futures.ThreadPoolExecutor() as executor: # optimally defined number of threads
res = [executor.submit(request_post, url, data) for data in names]
concurrent.futures.wait(res)
res will be list of request.Response for each request made wrapped on Future instances. To access the request.Response you need to use res[index].result() where index size is len(names).
Future objects give you better control on the responses received, like if it completed correctly or there was an exception or time-out etc. More about here
You don't take risk of problems related to high number of threads (solution 2).
Solution 2 - multiprocessing.dummy.Pool and spawn one thread for each request
Might be usefull if you are not requesting a lot of pages and also or if the response time is quite slow.
from multiprocessing.dummy import Pool as ThreadPool
import itertools
import requests
with ThreadPool(len(names)) as pool: # creates a Pool of 3 threads
res = pool.starmap(requests.post(itertools.repeat(url),names))
pool.starmap - is used to pass (map) multiple arguments to one function (requests.post) that is gonna be called by a list of Threads (ThreadPool). It will return a list of request.Response for each request made.
intertools.repeat(url) is needed to make the first argument be repeated the same number of threads being created.
names is the second argument of requests.post so it's gonna work without needing to explicitly use the optional parameter data. Its len must be the same of the number of threads being created.
This code will not work if you needed to call another parameter like an optional one

Threads or asyncio gather?

Which is the best method to do concurrent i/o operations?
thread or
asyncio
There will be list of files.
I open the files and generate a graph using the .txt file and store it on the disk.
I have tried using threads but its time consuming and sometimes it does not generate a graph for some files.
Is there any other method?
I tried with the code below with async on the load_instantel_ascii function but it gives exception
for fl in self.finallist:
k = randint(0, 9)
try:
task2.append( * [load_instantel_ascii(fleName = fl, columns = None,
out = self.outdir,
separator = ',')])
except:
print("Error on Graph Generation")
event_loop.run_until_complete(asyncio.gather(yl1
for kl1 in task2)
)
If I understood everything correct and you want asynchronous file I/O, then asyncio itself doesn't support it out of the box. In the end all asyncio-related stuff that provides async file I/O does it using threads pool.
But it probably doesn't mean you shouldn't use asyncio: this lib is cool as a way to write asynchronous code in a first place, even if it wrapper above threads. I would give a try to something like aiofiles.

Wrapping synchronous requests into asyncio (async/await)?

I am writing a tool in Python 3.6 that sends requests to several APIs (with various endpoints) and collects their responses to parse and save them in a database.
The API clients that I use have a synchronous version of requesting a URL, for instance they use
urllib.request.Request('...
Or they use Kenneth Reitz' Requests library.
Since my API calls rely on synchronous versions of requesting a URL, the whole process takes several minutes to complete.
Now I'd like to wrap my API calls in async/await (asyncio). I'm using python 3.6.
All the examples / tutorials that I found want me to change the synchronous URL calls / requests to an async version of it (for instance aiohttp). Since my code relies on API clients that I haven't written (and I can't change) I need to leave that code untouched.
So is there a way to wrap my synchronous requests (blocking code) in async/await to make them run in an event loop?
I'm new to asyncio in Python. This would be a no-brainer in NodeJS. But I can't wrap my head around this in Python.
The solution is to wrap your synchronous code in the thread and run it that way. I used that exact system to make my asyncio code run boto3 (note: remove inline type-hints if running < python3.6):
async def get(self, key: str) -> bytes:
s3 = boto3.client("s3")
loop = asyncio.get_event_loop()
try:
response: typing.Mapping = \
await loop.run_in_executor( # type: ignore
None, functools.partial(
s3.get_object,
Bucket=self.bucket_name,
Key=key))
except botocore.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "NoSuchKey":
raise base.KeyNotFoundException(self, key) from e
elif e.response["Error"]["Code"] == "AccessDenied":
raise base.AccessDeniedException(self, key) from e
else:
raise
return response["Body"].read()
Note that this will work because the vast amount of time in the s3.get_object() code is spent in waiting for I/O, and (generally) while waiting for I/O python releases the GIL (the GIL is the reason that generally threads in python is not a good idea).
The first argument None in run_in_executor means that we run in the default executor. This is a threadpool executor, but it may make things more explicit to explicitly assign a threadpool executor there.
Note that, where using pure async I/O you could easily have thousands of connections open concurrently, using a threadpool executor means that each concurrent call to the API needs a separate thread. Once you run out of threads in your pool, the threadpool will not schedule your new call until a thread becomes available. You can obviously raise the number of threads, but this will eat up memory; don't expect to be able to go over a couple of thousand.
Also see the python ThreadPoolExecutor docs for an explanation and some slightly different code on how to wrap your sync call in async code.

Multi requests to Youtube API V2 using Groovy

I have a list of youtube videos from different playlists and I need to check if these videos are still valid (they are around 1000). What I am doing at the moment it is hitting Youtube using its API v2 and Groovy with this simple script:
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
http = new HTTPBuilder('http://gdata.youtube.com')
myVideoIds.each { id ->
if (!isValidYoutubeUrl(id)) {
// do stuff
}
}
boolean isValidYoutubeUrl (id) {
boolean valid = true
http.request(GET) {
uri.path = "feeds/api/videos/${id}"
headers.'User-Agent' = 'Mozilla/5.0 Ubuntu/8.10 Firefox/3.0.4'
response.failure = { resp ->
valid = false
}
}
valid
}
but after a few seconds it starts to return 403 for any single id (it may be due to the fact it is running too many requests closely). The problem is reduced if I insert something like Thread.sleep(3000). Is there a better solution than just delaying the requests?
In V2 of the API, there are time-based limits on how many requests you can make, but they aren't a hard and fast limit (that is, it depends somewhat on many under-the-hood factors and may not always be the same limit). Here's what the documentation says:
The YouTube API enforces quotas to prevent problems associated with
irregular API usage. Specifically, a too_many_recent_calls error
indicates that the API servers have received too many calls from the
same caller in a short amount of time. If you receive this type of
error, then we recommend that you wait a few minutes and then try your
request again.
You can avoid this by putting in a sleep like you do, but you'd want it to be 10-15 seconds or so.
It's more important, though, to implement batch processing. With this, you can make up to 50 requests at once (this counts as 50 requests against your overall request per day quota, but only as one against your per time quota). Batch processing with v2 of the API is a little involved, as you make a POST request to a batch endpoint first, and then based on those results you can send in the multiple requests. Here's the documentation:
https://developers.google.com/youtube/2.0/developers_guide_protocol?hl=en#Batch_processing
If you use v3 of the API, batch processing becomes quite a bit easier, as you just send 50 IDs at a time in the request. Change:
http = new HTTPBuilder('http://gdata.youtube.com')
to:
http = new HTTPBuilder('https://www.googleapis.com')
Then set your uri.path to
youtube/v3/videos?part=id&max_results=key={your API key}&id={variable here that represents 50 YouTube IDs, comma separated}
For 1000 videos, then, you'll only need to make 20 calls. Any video that doesn't come back in the list doesn't exist anymore (if you need to get video details, change the part parameter to be id,snippet,contentDetails or something appropriate for your needs.
Here's the documentation:
https://developers.google.com/youtube/v3/docs/videos/list#id

Resources