python time out of stream method on gcp firestore - python-3.x

I am using GCP firestore. For some reason, I am querying all the documents present in a collection. I am using the python API.
Code I am using
db=firestore.Client()
documents = db.collection(collection_name).stream()
for doc in tqdm(documents):
#some time consuming operation.(2-3 seconds)
Everything runs fine but after 1 minute, the for loop ends.
I thought maybe the connection was getting timed out. I found this on the documentation page.
The underlying stream of responses will time out after the max_rpc_timeout_millis value set in
the GAPIC client configuration for the RunQuery API. Snapshots not consumed from the iterator
before that point will be lost.
My question is how can I modify this timeout value, to suit my needs. Thank you.

In my case, the 503 The datastore operation timed out, or the data was temporarily unavailable. response from Firestore has also been causing AttributeError: '_UnaryStreamMultiCallable' object has no attribute '_retry'.
This looks like retry policy is not set, though Python's firebase_admin package is capable of retrying timeout errors too. So, I have just configured a basic Retry object explicitly and this solved my issue:
from google.api_core.retry import Retry
documents = db.collection(collection_name).stream(retry=Retry())
A collection of 190K items is exported in 5 minutes in my case. Originally, the iteration also has been interrupted after 60 seconds.
Counterintuitively, as mentioned in the docs, .stream() has a cumulative timeout for an entire collection consumption, and not a single item or chunk retrieval.
So, if your collection has 1000 items and every item processing takes 0.5 seconds, total consumption time will sum up to 500 seconds which is greater than the default (undocumented) timeout of 60 seconds.
Also counterintuitively, a timeout argument of the CollectionReference.stream method does not override the max_rpc_timeout_millis mentioned in the documentation. In fact, it behaves like a client-side timeout, and the operation is effectively timed out after min(max_rpc_timeout_millis / 1000, timeout) seconds.

Related

How to limit execution of python flask function to single instance

I have a Python Flask page which is extremely slow to generate. It takes about 1 minute to pull all the data from external APIs, process the data before returning the page. Fortunately, the data is valid for up to 1 hour so I can cache the result and return cached results quickly for most of the requests.
This works well except for the minute after the cache expires. If 10 requests were made within that single minute, there will be 10 calls to veryslowpage() function, this eats up the HTTPS connection pool due to the external API calls and eats up memory due to the data processing, affecting other pages on the site. Is there a method to limit this function to a single instance, so 10 requests will result in only 1 call to veryslowpage() while the rest wait until the cached result is ready?
from flask import Flask, request, abort, render_template
from flask_caching import Cache
#app.route('/veryslowpage', methods=['GET'])
#cache.cached(timeout=3600, query_string=True)
def veryslowpage():
data = callexternalAPIs()
result = heavydataprocessing(data)
return render_template("./index.html", content=result)
You could simply create a function that periodically fetch the data from API (every hour) and store it in your database. Then in your route function read the data from your database instead of external API.
A better approach is creating a very simple script and call it (in another thread) in your app/init.py that fetch the data every one hour and update the database.
You could create a file or a database entry that contains the information that you are calculating the response in a different thread. Then, your method would check if such a file exists and if it does, let it wait and check for the response periodically.
You could also proactively create the data once every hour (or every 59 minutes if that matters) so you always have a fresh response available. You could use something like APScheduler for this.

Elasticsearch 429 Too Many Requests _bulk with synchronous requests

I am using AWS Elasticsearch service. On dev environment there is t3.small instance.
I have approx 15 000 records that I want to index as a bulk. What I do is splitting this amount on chunks of 250 items each (or lower than 10 MiB). And run _bulk request with refresh="wait_for" option, one by one, and waiting until request is finished before sending the next one.
At some point, approximately on 25 iteration, the request is immediately fails with message
429 Too Many Requests /_bulk
Just in case, if chunk size will be 500 this will fail on 25/2 request (around 12)
It doesn't tell anything more. Just only this, I cannot understand why this happens if there is no anything else that could send bulk requests in parallel with me. I checked that the data size is lesser than 10MB.
What I already have
I send each request consistently, awaiting the previous one
Bulk request size is lesser than 10MiB
Each bulk request contains no more than 250 records in it (+ 250 to indicate that this is indexing)
I am using refresh="wait_for"
And even have 2 seconds delay before sending a new request (which I strongly want to remove)
Adding new instances or increasing storage space doesn't help at all
What could be the reason of having that error? How can I be guaranteed that my request will not be failed if I send everything consistently? Is there any additional option I can pass to be sure that bulk request is completely finished?
A 429 error message as a write rejection indicates a bulk queue error. The es_rejected_execution_exception[bulk] indicates that your queue is full and that any new requests are rejected. When the number of requests to the Elasticsearch cluster exceeds the bulk queue size (threadpool.bulk.queue_size), this bulk queue error occurs. A bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using.
You can consult this link https://aws.amazon.com/premiumsupport/knowledge-center/resolve-429-error-es/ and check the write rejection best practices

Overcoming Azure Vision Read API Transactions-Per-Second (TPS) limit

I am working on a system where we are calling Vision Read API for extracting the contents from raster PDF. Files are of different sizes, ranging from one page to several hundred pages.
Files are stored in Azure Blob and there will be a function to push files to Read API once when all files are uploaded to blob. There could be hundreds of files.
Therefore, when the process starts, a large number of documents are expected to be sent for text extraction per second. But Vision API has limit of 10 transactions per second including read.
I am wondering what would be best approach? Some type of throttling or queue?
Is there any integration available (say with queue) from where the Read API will pull documents and is there any type of push notification available to notify about completion of read operation? How can I prevent timeouts due to exceeding 10 TPS limit?
Per my understanding , there are 2 key points you want to know :
How to overcome 10 TPS limit while you have lot of files to read.
Looking for a best approach to get the Read operation status and
result.
Your question is a bit broad,maybe I can provide you with some suggestions:
For Q1, Generally ,if you reach TPS limit , you will get a HTTP 429 response , you must wait for some time to call API again, or else the next call of API will be refused. Usually we retry the operation using something like an exponential back off retry policy to handle the 429 error:
2.1) You need check the HTTP response code in your code.
2.2) When HTTP response code is 429, then retry this operation after N seconds which you can define by yourself such as 10 seconds…
For example, the following is a response of 429. You can set your wait time as (26 + n) seconds. (PS: you can define n by yourself here, such as n = 5…)
{
"error":{
"statusCode": 429,
"message": "Rate limit is exceeded. Try again in 26 seconds."
}
}
2.3) If step 2 succeed, continue the next operation.
2.4) If step 2 fail with 429 too, retry this operation after N*N seconds (you can define by yourself too) which is an exponential back off retry policy..
2.5) If step 4 fail with 429 too, retry this operation after NNN seconds…
2.6) You should always wait for current operation to succeed, and the Waiting time will be exponential growth.
For Q2,, As we know , we can use this API to get Read operation status/result.
If you want to get the completion notification/result, you should build a roll polling request for each of your operation at intervals,i.e. each 10 seconds to send a check request.You can use Azure function or Azure automation runbook to create asynchronous tasks to check read operation status and once its done , handle the result based on your requirement.
Hope it helps. If you have any further concerns , please feel free to let me know.

How kinesis keep the offset and push the record again when an event fails in lambda

I am new to AWS lambda and Kinesis. Please help with the following question
I have a kinesis stream as a source to lambda and the target is again kinesis. I have following queries.
The system doesnt want to lose a record.
if any of the records fails the processing in lambda, How it again pull into the lambda? How it keep the unprocessed records ? How kinesis track the offset to process the next record?
Please update.
From the AWS Lambda docs about using Lambda with Kinesis:
If your function returns an error, Lambda retries the batch until processing succeeds or the data expires. Until the issue is resolved, no data in the shard is processed. To avoid stalled shards and potential data loss, make sure to handle and record processing errors in your code.
In this context, also consider the Retention Period of Kinesis:
The retention period is the length of time that data records are accessible after they are added to the stream. A stream’s retention period is set to a default of 24 hours after creation. You can increase the retention period up to 168 hours (7 days)
As mentioned in the first quote, AWS will drop the event after the retention period is due. This means for you:
a) Take care that your Lambda function handles errors correctly.
b) If it's important to keep all records, also store them in a persistent storage, e.g. DynamoDB.
In addition to that, you should read about duplicate Lambda executions as well. There is a great blog post available explaining how you can achieve an idempotent implementation. And read here on another StackOverflow question & answer.

Why does querying Azure table storage with ExecuteQuery return fewer results than ExecuteQuerySegmented?

I'm curious if Azure can timeout on ExecuteQuery or silently error if there is a memory limit that is causing ExecuteQuery to return fewer records than ExecuteQuerySegmented.
When I run ExecuteQuery, I get a total of 1,223,749 records.
When I run ExecuteQuerySegmented, I get a total of 1,482,504 records.
The two queries are:
(the ExecuteQuerySegmented is inside of a do/while that handles the token)
var queryResult = table.ExecuteQuerySegmented(new TableQuery<RecordType>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, PartitionValue, token);
var query = new TableQuery<RecordType>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, PartitionValue));
results.AddRange(table.ExecuteQuery(query));
A call to a Table service API can include a server timeout interval, specified in the timeout parameter of the request URI. If the server timeout interval elapses before the service has finished processing the request, the service returns an error.
The maximum timeout interval for Table service operations is 30 seconds. The Table service automatically reduces any timeouts larger than 30 seconds to the 30-second maximum.
The Table service enforces server timeouts as follows:
Query operations: During the timeout interval, a query may execute for up to a maximum of five seconds. If the query does not complete within the five-second interval, the response includes continuation tokens for retrieving remaining items on a subsequent request. See Query Timeout and Pagination for more information.
I'd recommend checking the following documentation.
In case if the query is taking more than 30secs, this may mean that the records shown by ExecuteQuery are caused by the timeout.

Resources