Encountered a retryable error. Will Retry with exponential backoff 413

Encountered a retryable error. Will Retry with exponential backoff 413 - logstash

Logstash keep encountering following error message that logs cannot be sent to AWS ElasticSearch.
[2021-04-28T16:01:28,253][ERROR][logstash.outputs.amazonelasticsearch]
Encountered a retryable error. Will Retry with exponential backoff
{:code=>413,
:url=>"https://search-xxxx.ap-southeast-1.es.amazonaws.com:443/_bulk"}
That's why I always need to restart logstash and cannot configure why it causes that issue. Regarding Logstash documentation I reduce pipeline.batch.size size to 100 but it didn't help. Please let me know how to resolve that issue. Thanks.
pipeline.batch.size: 125
pipeline.batch.delay: 50

A 413 response is "payload too large". It does not make much sense to retry this, since it will probably recur forever and shut down the flow of events through the pipeline. If there is a proxy or load balancer between logstash and elasticsearch then it is possible that that is returning the error, not elasticsearch, in which case you may need to reconfigure the proxy.
The maximum payload size accepted by amazonelasticsearch will depend on what type of instance you are running on (scroll down to Network Limits). For some instance types it is 10 MB.
In logstash, a batch of events is divided into 20 MB chunks as it is sent to elasticsearch. The 20 MB limit cannot be adjusted (unless you want to edit the source and build your own plugin). However, if there is a single large event it has to be sent in one request, so it is still possible for a request larger than that to be sent.
Since 20 MB is bigger than 10 MB this is going to be a problem if your batch size is over 10 MB. I do not think you have any visibility into the batch size other than the 413 error. You will have to keep reducing the pipeline.batch.size until the error goes away.

I've fixed issue that we need to adjust as to choose correct ES instance size based on max_content_length.
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-limits.html

Related

Logstash(7.10) circuit breaking exception

Do anyone know how to solve this circuit breaking exception in logstash (7.10).
[2022-09-23T14:38:22,920][INFO ][logstash.outputs.elasticsearch][main][299ec4f1e5994d0fe7b59d4e4d29f50e734f0d6401d909dc198ecbc402ca3983] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [30050216410/27.9gb], which is larger than the limit of [29581587251/27.5gb], real usage: [30050158840/27.9gb], new bytes reserved: [57570/56.2kb], usages [request=0/0b, fielddata=2602183/2.4mb, in_flight_requests=57570/56.2kb, model_inference=0/0b, accounting=1486405036/1.3gb]", "bytes_wanted"=>30050216410, "bytes_limit"=>29581587251, "durability"=>"PERMANENT"})
[2022-09-23T14:38:22,920][INFO ][logstash.outputs.elasticsearch][main][299ec4f1e5994d0fe7b59d4e4d29f50e734f0d6401d909dc198ecbc402ca3983] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>14}
^C
I have tried multiple options like
To increase the jvm to 16g under /etc/logstash/jvm.options but still the same issue.
Restart logstash and Elastic nodes
Is there a way where I can discard this 27.9gb data or any other better way to resolve this issue.
Thank you!

Hazelcast causing heavy traffic between nodes

NOTE: Found the root cause in application code using hazelcast which started to execute after 15 min, the code retrieved almost entire data, so the issue NOT in hazelcast, leaving the question here if anyone will see same side effect or wrong code.
What can cause heavy traffic between Hazelcast (v3.12.12, also tried 4.1.1) 2 nodes ?
It holds maps with lot of data, no new entries are added/removed within that time, only map values are updated.
Java 11, Memory usage 1.5GB out of 12GB, no full GCs identified.
Following JFR the high IO is from:
com.hazelcast.internal.networking.nio.NioThread.processTaskQueue()
Below chart of Network IO, 15 min after start traffic jumps from 15 to 60 MB. From application perspective nothing changed after these 15 min.

This smells garbage collection, you are most likely to be running into long gc pauses. Check your gc logs, which you can enable using verbose gc settings on all members. If there are back-to-back GCs then you should do various things:
increase the heap size
tune your gc, I'd look into G1 (with -XX:MaxGCPauseMillis set to a reasonable number) and/or ZGC.

Elasticsearch 429 Too Many Requests _bulk with synchronous requests

I am using AWS Elasticsearch service. On dev environment there is t3.small instance.
I have approx 15 000 records that I want to index as a bulk. What I do is splitting this amount on chunks of 250 items each (or lower than 10 MiB). And run _bulk request with refresh="wait_for" option, one by one, and waiting until request is finished before sending the next one.
At some point, approximately on 25 iteration, the request is immediately fails with message
429 Too Many Requests /_bulk
Just in case, if chunk size will be 500 this will fail on 25/2 request (around 12)
It doesn't tell anything more. Just only this, I cannot understand why this happens if there is no anything else that could send bulk requests in parallel with me. I checked that the data size is lesser than 10MB.
What I already have
I send each request consistently, awaiting the previous one
Bulk request size is lesser than 10MiB
Each bulk request contains no more than 250 records in it (+ 250 to indicate that this is indexing)
I am using refresh="wait_for"
And even have 2 seconds delay before sending a new request (which I strongly want to remove)
Adding new instances or increasing storage space doesn't help at all
What could be the reason of having that error? How can I be guaranteed that my request will not be failed if I send everything consistently? Is there any additional option I can pass to be sure that bulk request is completely finished?

A 429 error message as a write rejection indicates a bulk queue error. The es_rejected_execution_exception[bulk] indicates that your queue is full and that any new requests are rejected. When the number of requests to the Elasticsearch cluster exceeds the bulk queue size (threadpool.bulk.queue_size), this bulk queue error occurs. A bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using.
You can consult this link https://aws.amazon.com/premiumsupport/knowledge-center/resolve-429-error-es/ and check the write rejection best practices

Elasticsearch cluster size/architecture

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers .
we are having more than 850K documents generated each day for 20 machines . each of them has it own index .
Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index .
we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores .
but each time we reach about 10-14M doc/index we are getting a network error .
Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]
I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .
To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.
we also tried to tune the elasticsearch spark connector but with no result .
So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .
NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .

I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*
Some things which will impact what cluster size you need will be:
How many documents
Average size of each document
How many messages/second are being indexed
Disk IO speed
I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.
I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.
Hope that helps a bit!

This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.
The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:
1 Spark partition by 1 elasticsearch shard
The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.
With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.

How to tune "spark.rpc.askTimeout"?

We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?

I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string