How to limit execution of python flask function to single instance - python-3.x

I have a Python Flask page which is extremely slow to generate. It takes about 1 minute to pull all the data from external APIs, process the data before returning the page. Fortunately, the data is valid for up to 1 hour so I can cache the result and return cached results quickly for most of the requests.
This works well except for the minute after the cache expires. If 10 requests were made within that single minute, there will be 10 calls to veryslowpage() function, this eats up the HTTPS connection pool due to the external API calls and eats up memory due to the data processing, affecting other pages on the site. Is there a method to limit this function to a single instance, so 10 requests will result in only 1 call to veryslowpage() while the rest wait until the cached result is ready?
from flask import Flask, request, abort, render_template
from flask_caching import Cache
#app.route('/veryslowpage', methods=['GET'])
#cache.cached(timeout=3600, query_string=True)
def veryslowpage():
data = callexternalAPIs()
result = heavydataprocessing(data)
return render_template("./index.html", content=result)

You could simply create a function that periodically fetch the data from API (every hour) and store it in your database. Then in your route function read the data from your database instead of external API.
A better approach is creating a very simple script and call it (in another thread) in your app/init.py that fetch the data every one hour and update the database.

You could create a file or a database entry that contains the information that you are calculating the response in a different thread. Then, your method would check if such a file exists and if it does, let it wait and check for the response periodically.
You could also proactively create the data once every hour (or every 59 minutes if that matters) so you always have a fresh response available. You could use something like APScheduler for this.

Related

How to prevent azure function from executing n-time simultaneously?

I have an external API which invokes my HTTP trigger azure function with the same query parameters 5 times at the same moment. So 5 requests are processed in the same time concurrently, each request adds a record to my google sheet and it causes unwanted duplicated records. My function is checking for duplicate in that sheet before pushing new record but when 5 instances are called a the same time concurrently, duplicate does not exist. Is there any simple solution to achieve processing those 5 request one by one, without concurrency?

How cache and execute in nodejs a call

I have an application that it's gathering data from another application, both in NodeJS.
I was wondering, how can I trigger sending the data to a third application on certain conditions? For example, every 10 mins if there's data in a bucket or when I have 20 elements to send?
And if the call on the third parties fails, how can I repeat it after 10-15 mins?
EDIT:
The behaviour should be something like:
if you have 1 data posted (axios.post) AND [10 mins passed OR other 10 data posted] SUBMIT to App n.3
What can help me doing so? Can I keep the value saved until those requirements are satisfied?
Thank you <3
You can use packages like node-schedule which is popular to schedule tasks. When callback runs check if there is enough data(posts) to send.

Python get paginated data from REST API in parallel

I want to get 1 million records from a single REST API URL with pagination. Each page can get up to 100k records in approx 1 min. I want to make 10 post requests concurrently so that I can get all 1 million records in 1 min. I used requests library, ThreadPoolExecutor to make concurrent connections and get the data but it's taking a very long time to get the data.
My understanding of aiohttp or grequests is that they call the REST API asynchronously from a single thread and get the data while waiting for others to make network connections.
Sample code
def post_request(page_number):
response = requests.post(url, data = data, fetchPageNumber:page_number)
# logic to convert response.text pandas df and write to RDBMS
with concurrent.futures.ThreadPoolExecutor(10) as executor:
parallel_response = executor.map(post_requst, list(1,11)
Please let me know what is the best way to get the paginated data in parallel.

python time out of stream method on gcp firestore

I am using GCP firestore. For some reason, I am querying all the documents present in a collection. I am using the python API.
Code I am using
db=firestore.Client()
documents = db.collection(collection_name).stream()
for doc in tqdm(documents):
#some time consuming operation.(2-3 seconds)
Everything runs fine but after 1 minute, the for loop ends.
I thought maybe the connection was getting timed out. I found this on the documentation page.
The underlying stream of responses will time out after the max_rpc_timeout_millis value set in
the GAPIC client configuration for the RunQuery API. Snapshots not consumed from the iterator
before that point will be lost.
My question is how can I modify this timeout value, to suit my needs. Thank you.
In my case, the 503 The datastore operation timed out, or the data was temporarily unavailable. response from Firestore has also been causing AttributeError: '_UnaryStreamMultiCallable' object has no attribute '_retry'.
This looks like retry policy is not set, though Python's firebase_admin package is capable of retrying timeout errors too. So, I have just configured a basic Retry object explicitly and this solved my issue:
from google.api_core.retry import Retry
documents = db.collection(collection_name).stream(retry=Retry())
A collection of 190K items is exported in 5 minutes in my case. Originally, the iteration also has been interrupted after 60 seconds.
Counterintuitively, as mentioned in the docs, .stream() has a cumulative timeout for an entire collection consumption, and not a single item or chunk retrieval.
So, if your collection has 1000 items and every item processing takes 0.5 seconds, total consumption time will sum up to 500 seconds which is greater than the default (undocumented) timeout of 60 seconds.
Also counterintuitively, a timeout argument of the CollectionReference.stream method does not override the max_rpc_timeout_millis mentioned in the documentation. In fact, it behaves like a client-side timeout, and the operation is effectively timed out after min(max_rpc_timeout_millis / 1000, timeout) seconds.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Resources