Logstash(7.10) circuit breaking exception - logstash

Do anyone know how to solve this circuit breaking exception in logstash (7.10).
[2022-09-23T14:38:22,920][INFO ][logstash.outputs.elasticsearch][main][299ec4f1e5994d0fe7b59d4e4d29f50e734f0d6401d909dc198ecbc402ca3983] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [30050216410/27.9gb], which is larger than the limit of [29581587251/27.5gb], real usage: [30050158840/27.9gb], new bytes reserved: [57570/56.2kb], usages [request=0/0b, fielddata=2602183/2.4mb, in_flight_requests=57570/56.2kb, model_inference=0/0b, accounting=1486405036/1.3gb]", "bytes_wanted"=>30050216410, "bytes_limit"=>29581587251, "durability"=>"PERMANENT"})
[2022-09-23T14:38:22,920][INFO ][logstash.outputs.elasticsearch][main][299ec4f1e5994d0fe7b59d4e4d29f50e734f0d6401d909dc198ecbc402ca3983] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>14}
^C
I have tried multiple options like
To increase the jvm to 16g under /etc/logstash/jvm.options but still the same issue.
Restart logstash and Elastic nodes
Is there a way where I can discard this 27.9gb data or any other better way to resolve this issue.
Thank you!

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Encountered a retryable error. Will Retry with exponential backoff 413

Logstash keep encountering following error message that logs cannot be sent to AWS ElasticSearch.
[2021-04-28T16:01:28,253][ERROR][logstash.outputs.amazonelasticsearch]
Encountered a retryable error. Will Retry with exponential backoff
{:code=>413,
:url=>"https://search-xxxx.ap-southeast-1.es.amazonaws.com:443/_bulk"}
That's why I always need to restart logstash and cannot configure why it causes that issue. Regarding Logstash documentation I reduce pipeline.batch.size size to 100 but it didn't help. Please let me know how to resolve that issue. Thanks.
pipeline.batch.size: 125
pipeline.batch.delay: 50
A 413 response is "payload too large". It does not make much sense to retry this, since it will probably recur forever and shut down the flow of events through the pipeline. If there is a proxy or load balancer between logstash and elasticsearch then it is possible that that is returning the error, not elasticsearch, in which case you may need to reconfigure the proxy.
The maximum payload size accepted by amazonelasticsearch will depend on what type of instance you are running on (scroll down to Network Limits). For some instance types it is 10 MB.
In logstash, a batch of events is divided into 20 MB chunks as it is sent to elasticsearch. The 20 MB limit cannot be adjusted (unless you want to edit the source and build your own plugin). However, if there is a single large event it has to be sent in one request, so it is still possible for a request larger than that to be sent.
Since 20 MB is bigger than 10 MB this is going to be a problem if your batch size is over 10 MB. I do not think you have any visibility into the batch size other than the 413 error. You will have to keep reducing the pipeline.batch.size until the error goes away.
I've fixed issue that we need to adjust as to choose correct ES instance size based on max_content_length.
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-limits.html

Elasticsearch cluster size/architecture

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers .
we are having more than 850K documents generated each day for 20 machines . each of them has it own index .
Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index .
we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores .
but each time we reach about 10-14M doc/index we are getting a network error .
Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]
I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .
To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.
we also tried to tune the elasticsearch spark connector but with no result .
So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .
NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .
I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*
Some things which will impact what cluster size you need will be:
How many documents
Average size of each document
How many messages/second are being indexed
Disk IO speed
I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.
I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.
Hope that helps a bit!
This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.
The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:
1 Spark partition by 1 elasticsearch shard
The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.
With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.

Cassandra resets connection under heavy insert workload

I have a cassandra cluster of 8 nodes, cassandra 1.0.8.
I'm trying to do lots of small inserts in a loop with batch_mutate(). After some time (~200K inserts) server resets connection with following exception:
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:157)
at org.apache.cassandra.thrift.Cassandra$Client.send_batch_mutate(Cassandra.java:998)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:986)
at org.scale7.cassandra.pelops.Mutator$1.execute(Mutator.java:46)
at org.scale7.cassandra.pelops.Mutator$1.execute(Mutator.java:42)
at org.scale7.cassandra.pelops.Operand.tryOperation(Operand.java:56)
at org.scale7.cassandra.pelops.Mutator.execute(Mutator.java:51)
...
Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
... 25 more
Otherwise than it, the cluster works fine. Server logs are clean.
What might be causing this problem?
Thanks!
I've found the cause: batch mutation size was exceeding the frame size of TFramedTransport.
There might two possible solutions: either to increase "thrift_framed_transport_size_in_mb" configuration property in cassandra.yaml, or to use smaller batch size.

Resources