Issue with dsbulk unload - cassandra

I am getting the below messages while unloading using dsbulk. I am not able to figure out what this means
[s0|347101951|0] Error sending cancel request. This is not critical (the request will eventually time out server-side). (HeartbeatException: null)
Not sending heartbeat because a previous one is still in progress. Check that advanced.heartbeat.interval is not lower than advanced.heartbeat.timeout.
Thanks

"Error sending cancel request" is typical of continuous paging queries. It seems the coordinator is in trouble for some reason, which is why you are also seeing heartbeat failures. Dsbulk may be putting too much load on the cluster.
You didn't mention which version of dsbulk exactly, but assuming 1.4+ I would recommend trying the following actions (individually or combined):
Disable continuous paging with dsbulk.executor.continuousPaging.enabled = false (this is likely to slow down dsbulk).
Use smaller page sizes, e.g. 1000 rows:
If not using continuous paging: datastax-java-driver.basic.request.page-size = 1000 .
If using continuous paging: datastax-java-driver.advanced.continuous-paging.page-size = 1000.
Throttle dsbulk to reduce the load on the cluster
Either "soft" throttle by limiting the number of concurrent requests, e.g. 128:
DSBulk < 1.6: dsbulk.executor.maxInFlight = 128.
DSBulk >= 1.6: dsbulk.engine.maxConcurrentQueries = 128.
Or "hard" throttle by limiting the number of requests per second, e.g. 500: dsbulk.executor.maxPerSecond = 500.

Related

Why does DynamoDB performance decrease with parallel reads?

With AWS-XRay tracing enabled on my lambda function i've found that as the number of parallel requests increases to dynamodb the performance of the read's decreases.
Here is an example of the XRay Traces:
Above you can see that the first set of GetItem requests execute in under 300ms. This set only has 6 async read requests running in parallel. The next set of read requests all execute in on average atleast 1.5 seconds - with 57 async read requests running in parallel.
Thoughts on what this could be due to:
this may be due to a "cold start" feature as dynamodb adds capacity to deal with parallel reads? (This dyanmodb instance is pay-by-request, not provisioned)
Additionally, i recognize that this may not be related parallel requests at all, but it may be a good place to start asking questions. Wondering if anyone knows what could be causing such a dramatic performance decrease.

Elastic search could not write all entries: May be es was overloaded

I have an application where I read csv files and do some transformations and then push them to elastic search from spark itself. Like this
input.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + type).save()
I have several nodes and in each node, I run 5-6 spark-submit commands that push to elasticsearch
I am frequently getting Errors
Could not write all entries [13/128] (Maybe ES was overloaded?). Error sample (first [5] error messages):
rejected execution of org.elasticsearch.transport.TransportService$7#32e6f8f8 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#4448a084[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 451515]]
My Elasticsearch cluster has following stats -
Nodes - 9 (1TB space,
Ram >= 15GB ) More than 8 cores per node
I have modified following parameters for elasticseach
spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false
Could anyone suggest, What can I fix to get rid of these errors?
This occurs because the bulk requests are incoming at a rate greater than elasticsearch cluster could process and the bulk request queue is full.
The default bulk queue size is 200.
You should handle ideally this on the client side :
1) by reducing the number the spark-submit commands running concurrently
2) Retry in case of rejections by tweaking the es.batch.write.retry.count and
es.batch.write.retry.wait
Example:
es.batch.write.retry.wait = "60s"
es.batch.write.retry.count = 6
On elasticsearch cluster side :
1) check if there are too many shards per index and try reducing it.
This blog has a good discussion on criteria for tuning the number of shards.
2) as a last resort increase the thread_pool.index.bulk.queue_size
Check this blog with an extensive discussion on bulk rejections.
The bulk queue in your ES cluster is hitting its capacity (200) . Try increasing it. See this page for how to change the bulk queue capacity.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
Also check this other SO answer where OP had a very similar issue and was fixed by increasing the bulk pool size.
Rejected Execution of org.elasticsearch.transport.TransportService Error

com.datastax.driver.core.exceptions.BusyPoolException

Whenever I insert data in table in Cassandra, more than 1000 and fetching the data by id, it throws the following exception:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [localhost/127.0.0.1] Pool is busy (no available connection and the queue has reached its max size 256)))
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:213)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:49)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:277)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.onFailure(RequestHandler.java:340)
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
at com.google.common.util.concurrent.Futures$ImmediateFuture.addListener(Futures.java:153)
at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1776)
at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1713)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.query(RequestHandler.java:299)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:274)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:117)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:97)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
at com.outworkers.phantom.builder.query.CassandraOperations$class.scalaQueryStringToPromise(CassandraOperations.scala:67)
at com.outworkers.phantom.builder.query.InsertQuery.scalaQueryStringToPromise(InsertQuery.scala:31)
at com.outworkers.phantom.builder.query.CassandraOperations$class.scalaQueryStringExecuteToFuture(CassandraOperations.scala:31)
at com.outworkers.phantom.builder.query.InsertQuery.scalaQueryStringExecuteToFuture(InsertQuery.scala:31)
at com.outworkers.phantom.builder.query.ExecutableStatement$class.future(ExecutableQuery.scala:80)
at com.outworkers.phantom.builder.query.InsertQuery.future(InsertQuery.scala:31)
at nd.cluster.data.store.Points.upsert(Models.scala:114)
I have solved above issue using PoolingOptions.
val poolingOptions = new PoolingOptions()
.setConnectionsPerHost(HostDistance.LOCAL, 1, 200)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 256)
.setNewConnectionThreshold(HostDistance.LOCAL, 100).setCoreConnectionsPerHost(HostDistance.LOCAL, 200)
val builder1 = ContactPoint.local
.noHeartbeat()
.withClusterBuilder(_.withoutJMXReporting()
.withoutMetrics().withPoolingOptions(poolingOptions)).keySpace("nd")
Now it is working even with 1l data. But i am not sure about its efficiency.
Could anyone please help me ?
This means that you are submitting too many requests, and not waiting for the futures to complete before submitting more.
The default maximum number of requests per connection is 1024. If this number is exceeded for all connections, the connection pool will enqueue some requests, up to 256. If the queue gets full, a BusyPoolException is thrown. Of course you can increase the max number of requests per connection, and the number of max connections per host. But the real solution is of course to throttle your thread. You could e.g. submit your requests by batches of 1,000 and then wait on the futures to complete before submitting more, or use a semaphore to regulate the total number of pending requests and make sure they don't exceed a certain number (in theory, this number must stay below num_hosts * max_connections_per_host * max_requests_per_connection – in practice, I don't suggest going above 1,000 as it probably won't bring you more throughput).
You may find this links useful.
https://github.com/redisson/redisson/issues/438
https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/p3CwOL0kNrs http://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling

High Native-Transport-Requests All time Blocked

After running tpstats on all nodes. I see a lot of nodes having high number of ALL TIME BLOCKED NTR. We have a 4 node cluster and the values for NTR ALL TIME BLOCKED are :
NODE 1: 23953
NODE 2: 2935
NODE 3: 15229
NODE 4: 5951
I know ALL TIME BLOCKED is bad and hence worried as to what I am doing wrong.
This pool handles cql requests, so it is the number of active CQL requests allowed. Its limited to prevent too many active ones from OOMing your system (ie each returning large blobs). This effectively applies backpressure to your client application to slow down. Unfortunately if you have small requests this isnt ideal and hurts your throughput so in CASSANDRA-11363 they added a setting to make the space tradeoff for small bursty workloads.
If you upgrade to 2.2.8+ you can set the max queue size of that threadpool with -Dcassandra.max_queued_native_transport_requests=4096

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.

Resources