i would be thankfull if a sophisticated user could name all possible solutions (best practices) how to fix Hector Client Timeouts like this:
Caused by: me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$multiget_slice_result.read(Cassandra.java:9628)
at org.apache.cassandra.thrift.Cassandra$Client.recv_multiget_slice(Cassandra.java:636)
at org.apache.cassandra.thrift.Cassandra$Client.multiget_slice(Cassandra.java:608)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl$10.execute(KeyspaceServiceImpl.java:388)
... 21 more
HECTOR:
Taken from the Hector Documentation :https://github.com/rantav/hector/wiki/User-Guide
I found the following related to timeouts:
1.) cassandraThriftSocketTimeout
CASSANDRA:
1.) rpc_timeout_in_ms: 10000 (in cassandra.yaml)
What other Settings are available related to timeouts both on Hector and on Cassandra side? I have time! So I simply want to wait longer! But I have not found the settings therefor to wait longer.
Thanks
Markus
From the cassandra.thrift API in the Apache Cassandra source tree regarding TimeoutException:
"RPC timeout was exceeded. either a node failed mid-operation, or load was too high, or the requested op was too large."
In short you were asking for too much data. What sort of query were you sending? Can you post a code snippet of such?
Related
Do anyone know how to solve this circuit breaking exception in logstash (7.10).
[2022-09-23T14:38:22,920][INFO ][logstash.outputs.elasticsearch][main][299ec4f1e5994d0fe7b59d4e4d29f50e734f0d6401d909dc198ecbc402ca3983] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [30050216410/27.9gb], which is larger than the limit of [29581587251/27.5gb], real usage: [30050158840/27.9gb], new bytes reserved: [57570/56.2kb], usages [request=0/0b, fielddata=2602183/2.4mb, in_flight_requests=57570/56.2kb, model_inference=0/0b, accounting=1486405036/1.3gb]", "bytes_wanted"=>30050216410, "bytes_limit"=>29581587251, "durability"=>"PERMANENT"})
[2022-09-23T14:38:22,920][INFO ][logstash.outputs.elasticsearch][main][299ec4f1e5994d0fe7b59d4e4d29f50e734f0d6401d909dc198ecbc402ca3983] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>14}
^C
I have tried multiple options like
To increase the jvm to 16g under /etc/logstash/jvm.options but still the same issue.
Restart logstash and Elastic nodes
Is there a way where I can discard this 27.9gb data or any other better way to resolve this issue.
Thank you!
we are using Cassandra 1.2.9 + BAM 2.5 for API analysis.
We have scheduled a job to do cassandra data purge. This data purge job is divived into three steps.
The 1st step is to query the original column family and then insert them into the temporary columnFamily_purge.
The 2nd step is to delete from the orinal column family by adding tombstone,and insert the data from columnFamily_purge into the original column family.
The 3rd step is to drop the temporary columnFamily_purge
The 1st works well, but the 2nd step frequently crashes the cassandra servers during Hadoop map tasks,which makes Cassandra unavailable.The exception stacktrack is as follows:
2016-08-23 10:27:43,718 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 47338 from the native implementation
2016-08-23 10:27:43,720 WARN org.apache.hadoop.mapred.Child: Error running child
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:390)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:244)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.deleteRow(AbstractColumnFamilyTemplate.java:173)
at org.wso2.carbon.bam.cassandra.data.archive.mapred.CassandraMapReduceRowDeletion$RowKeyMapper.map(CassandraMapReduceRowDeletion.java:246)
at org.wso2.carbon.bam.cassandra.data.archive.mapred.CassandraMapReduceRowDeletion$RowKeyMapper.map(CassandraMapReduceRowDeletion.java:139)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Could someone help on this what may lead to this problem? Thanks!
This can happen due to 3 reasons.
1) Cassandra servers are down. I don't thing this is the case in your setup.
2) Network issues
3) The load is higher than what cluster can handle.
How do you delete data? Using a hive script?
After I increase the number of open files and max thread number,the problem is gone.
In Cassandra 2.2.4 (cql 5.0.1):
I got an error OperationTimedOut: errors={}, last_host=127.0.0.1.
How can i access the ~/.cassandra/cqlshrc to set the value client_timeout = 20?
Thanks
You will probably need to increase timeouts in your cassandra.yaml or allowing client to wait longer wont make a difference. The default read timeout is 5s, write 2s. The cqlsh default 10s exceeds that already so your timeout is probably from the coordinator (unless coordinator is hosed). That said just add a
[connection]
client_timeout = 20
in your ~/.cassandra/cqlshrc file.
Should probably try to address the slowness in your request as well, either your C* node is under distress or your query has issues. A timeout shouldn't happen in normal uses.
I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.
I have 5 nodes in my ring with SimpleTopologyStrategy and replication_factor=3. I inserted 1M rows using stress tool . When am trying to read the row count in cqlsh using
SELECT count(*) FROM Keyspace1.Standard1 limit 1000000;
It fails with error:
Request did not complete within rpc_timeout.
It fetches for limit 100000. Fails even for 500000.
All my nodes are up. Do I need to increase the rpc_timeout?
Please help.
You get this error because the request is timing out on the server side. One should know that this is a very expensive operation in Cassandra as others have pointed out.
Still, if you really want to do this you should update your /etc/cassandra/cassandra.yaml file and change the range_request_timeout_in_ms parameter. This will be valid for all your range queries.
Example to set a 40 second timeout:
range_request_timeout_in_ms: 40000
You will probably have to adjust at the client side as well. When using cqlsh as a client this is accomplished by creating/updating your configuration file for cqlsh under ~/.cassandra/cqlshrc and add the client_timeout parameter to the connection section.
Example to set a 40 second timeout:
[connection]
client_timeout=40
It takes a long time to read in 1M rows so that is probably why it is timing out. You shouldn't use count like this, it is very expensive since it has to read all the data. Use Cassandra counters if you need to count lots of items.
You should also check your Cassandra logs to confirm there aren't any other issues - sometimes exceptions in Cassandra lead to timeouts on the client.
If you can live with an approximate row count, take a look at this answer to Row count of a column family in Cassandra.