Elasticsearch 429 Too Many Requests _bulk with synchronous requests - node.js

I am using AWS Elasticsearch service. On dev environment there is t3.small instance.
I have approx 15 000 records that I want to index as a bulk. What I do is splitting this amount on chunks of 250 items each (or lower than 10 MiB). And run _bulk request with refresh="wait_for" option, one by one, and waiting until request is finished before sending the next one.
At some point, approximately on 25 iteration, the request is immediately fails with message
429 Too Many Requests /_bulk
Just in case, if chunk size will be 500 this will fail on 25/2 request (around 12)
It doesn't tell anything more. Just only this, I cannot understand why this happens if there is no anything else that could send bulk requests in parallel with me. I checked that the data size is lesser than 10MB.
What I already have
I send each request consistently, awaiting the previous one
Bulk request size is lesser than 10MiB
Each bulk request contains no more than 250 records in it (+ 250 to indicate that this is indexing)
I am using refresh="wait_for"
And even have 2 seconds delay before sending a new request (which I strongly want to remove)
Adding new instances or increasing storage space doesn't help at all
What could be the reason of having that error? How can I be guaranteed that my request will not be failed if I send everything consistently? Is there any additional option I can pass to be sure that bulk request is completely finished?

A 429 error message as a write rejection indicates a bulk queue error. The es_rejected_execution_exception[bulk] indicates that your queue is full and that any new requests are rejected. When the number of requests to the Elasticsearch cluster exceeds the bulk queue size (threadpool.bulk.queue_size), this bulk queue error occurs. A bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using.
You can consult this link https://aws.amazon.com/premiumsupport/knowledge-center/resolve-429-error-es/ and check the write rejection best practices

Related

azure logic apps iteration control

i am using logic app to process batch of records. lets say i started to process 1000 records with batch of each 500.i put the condition in untill loop that till all records (1000) processed keep running untill loop. the first it pick up 500 records and start processing that . during processing of first 500 records if any network issue or other issue occured then its coming out of until loop and left anothr batch of 500 records.
my question is how can i continue another batch of 500 records if even first batch of 500 records gets failed?
One workaround that works is to change the flow from using until flow to just using the batch messages trigger mentioning the message count. So that the mentioned number of messages will be batched and released at once in your case it will be 500.

Encountered a retryable error. Will Retry with exponential backoff 413

Logstash keep encountering following error message that logs cannot be sent to AWS ElasticSearch.
[2021-04-28T16:01:28,253][ERROR][logstash.outputs.amazonelasticsearch]
Encountered a retryable error. Will Retry with exponential backoff
{:code=>413,
:url=>"https://search-xxxx.ap-southeast-1.es.amazonaws.com:443/_bulk"}
That's why I always need to restart logstash and cannot configure why it causes that issue. Regarding Logstash documentation I reduce pipeline.batch.size size to 100 but it didn't help. Please let me know how to resolve that issue. Thanks.
pipeline.batch.size: 125
pipeline.batch.delay: 50
A 413 response is "payload too large". It does not make much sense to retry this, since it will probably recur forever and shut down the flow of events through the pipeline. If there is a proxy or load balancer between logstash and elasticsearch then it is possible that that is returning the error, not elasticsearch, in which case you may need to reconfigure the proxy.
The maximum payload size accepted by amazonelasticsearch will depend on what type of instance you are running on (scroll down to Network Limits). For some instance types it is 10 MB.
In logstash, a batch of events is divided into 20 MB chunks as it is sent to elasticsearch. The 20 MB limit cannot be adjusted (unless you want to edit the source and build your own plugin). However, if there is a single large event it has to be sent in one request, so it is still possible for a request larger than that to be sent.
Since 20 MB is bigger than 10 MB this is going to be a problem if your batch size is over 10 MB. I do not think you have any visibility into the batch size other than the 413 error. You will have to keep reducing the pipeline.batch.size until the error goes away.
I've fixed issue that we need to adjust as to choose correct ES instance size based on max_content_length.
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-limits.html

Would SQS batch size max limit result in slower processing through Lambdas?

I'm aware that AWS has allowed SQS to be one of the event source mappings for Lambdas. I'm glad that this is possible now as I would then not have to poll from the queue every few seconds through a cron job. However, it appears that the maximum possible value for batchSize is limited to 10. From my understanding, the batchSize is the number of messages a single Lambda invocation will receive from the queue.
This sounds like it could be an issue for me because, in my case, I may have a few hundreds of thousands of messages at a time in the queue. Those messages don't need any heavy processing; they just need to be parsed and saved to the database as a record. It's pretty simple.
If the batchSize is limited to only 10 messages per retrieval, I foresee a few issues that I may have:
It may actually take a long time to finish processing the messages on the queue.
Not only is 10 messages per retrieval slow, since the messages are very simple to process, processing only 10 messages in a single Lambda invocation sounds a little wasteful because, given the simplicity of what is needed to be done to process the messages, I'm pretty sure it can process at least a few thousands messages in a single Lambda invocation.
Having only 10 messages per retrieval may also mean that I need to make more write operations to my database because each of these messages need to inserted as a record on the database.
Are my concerns valid in this case? If so, is there anything else I can do with SQS and Lambdas to overcome those concerns?
Your assumption about a limit of 10 is correct.
Lambda will spin up more instances to run in parallel, if there are more messages available. See Scaling and Processing. This means that if there are 1000 messages available, Lambda might spin up 100 concurrent executions to quickly process all the messages.
Once a lambda function has processed the 10 messages of a batch, it continues with processing other batches. As lambda bills in 100ms intervals, the wasted time is minimal.
As for the database writes you could pre-process the messages before inserting them into the queue.
In that case you need to let you lambda function fetch the messages from the queue and process them rather than lambda getting triggered via SQS. Probably have a cloud watch event which can trigger lambda for you depending upon what your use case is.
Please note that SQS has a limit of max 10 messages in one go but you could write the code to make it much more efficient.
One of the package which is very efficient at is squiss-ts
In this case you could let your lambda function run for 15 mins (max time) and let it process as many messages possible. Idempotency is the key when you are desinging these kind of applications so in case if message wasn't processed in this run, it will be processed in the next run.
Downside of using this approach is that you need to scale your lambda's manually depending on how many messages you are anticipating.
You're right that a larger batch size seems appropriate for your use case.
As of late 2020, if you specify a batch window in seconds, you can then specify a batch size of up to 10,000 messages.
So with this new option you can now configure your lambda to wait and receive much larger batches per invocation.

Overcoming Azure Vision Read API Transactions-Per-Second (TPS) limit

I am working on a system where we are calling Vision Read API for extracting the contents from raster PDF. Files are of different sizes, ranging from one page to several hundred pages.
Files are stored in Azure Blob and there will be a function to push files to Read API once when all files are uploaded to blob. There could be hundreds of files.
Therefore, when the process starts, a large number of documents are expected to be sent for text extraction per second. But Vision API has limit of 10 transactions per second including read.
I am wondering what would be best approach? Some type of throttling or queue?
Is there any integration available (say with queue) from where the Read API will pull documents and is there any type of push notification available to notify about completion of read operation? How can I prevent timeouts due to exceeding 10 TPS limit?
Per my understanding , there are 2 key points you want to know :
How to overcome 10 TPS limit while you have lot of files to read.
Looking for a best approach to get the Read operation status and
result.
Your question is a bit broad,maybe I can provide you with some suggestions:
For Q1, Generally ,if you reach TPS limit , you will get a HTTP 429 response , you must wait for some time to call API again, or else the next call of API will be refused. Usually we retry the operation using something like an exponential back off retry policy to handle the 429 error:
2.1) You need check the HTTP response code in your code.
2.2) When HTTP response code is 429, then retry this operation after N seconds which you can define by yourself such as 10 seconds…
For example, the following is a response of 429. You can set your wait time as (26 + n) seconds. (PS: you can define n by yourself here, such as n = 5…)
{
"error":{
"statusCode": 429,
"message": "Rate limit is exceeded. Try again in 26 seconds."
}
}
2.3) If step 2 succeed, continue the next operation.
2.4) If step 2 fail with 429 too, retry this operation after N*N seconds (you can define by yourself too) which is an exponential back off retry policy..
2.5) If step 4 fail with 429 too, retry this operation after NNN seconds…
2.6) You should always wait for current operation to succeed, and the Waiting time will be exponential growth.
For Q2,, As we know , we can use this API to get Read operation status/result.
If you want to get the completion notification/result, you should build a roll polling request for each of your operation at intervals,i.e. each 10 seconds to send a check request.You can use Azure function or Azure automation runbook to create asynchronous tasks to check read operation status and once its done , handle the result based on your requirement.
Hope it helps. If you have any further concerns , please feel free to let me know.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Resources