HBase multithreading client performance - multithreading

We use hbase 1.2.4 in pseudo-destributed mode and java api to read information. Our client makes 20k rps to hbase, but it loads only 30% of cpu and computation takes about 5 hours. I tried to split data and run 4 clients on same machine in separate jvm and had 80k rps with computation time of approximately one hour. This is not the solution to satisfy me. Profiling has shown many blocking of connection threads.
I have also tried to use ipc pool options of hbase client, but it had not improved performance too much.
If anyone had some similar propblem, please give me some advice.

Setting connection pool size much larger then count of reading threads did the trick.
Configuration conf;
...
String poolSize = "128";
conf.set(HConstants.HBASE_CLIENT_IPC_POOL_SIZE, poolSize);
conf.set(HConstants.HBASE_CLIENT_IPC_POOL_TYPE, "RoundRobin");
...
Connection conn = ConnectionFactory.createConnection(conf);
...

Related

Why does DynamoDB performance decrease with parallel reads?

With AWS-XRay tracing enabled on my lambda function i've found that as the number of parallel requests increases to dynamodb the performance of the read's decreases.
Here is an example of the XRay Traces:
Above you can see that the first set of GetItem requests execute in under 300ms. This set only has 6 async read requests running in parallel. The next set of read requests all execute in on average atleast 1.5 seconds - with 57 async read requests running in parallel.
Thoughts on what this could be due to:
this may be due to a "cold start" feature as dynamodb adds capacity to deal with parallel reads? (This dyanmodb instance is pay-by-request, not provisioned)
Additionally, i recognize that this may not be related parallel requests at all, but it may be a good place to start asking questions. Wondering if anyone knows what could be causing such a dramatic performance decrease.

dask processes tasks twice

I noticed that a tasks of a dask graph can be executed several times by different workers.
Also I see that log in the scheduler console (Don't know if it can be related to resilience):
"WARNING - Lost connection to ... while sending result: Stream is
closed"
Is there a way to impede dask to execute the same task twice on different workers ?
Note that i'm using:
dask 0.15.0
distributed 1.15.1
Thx
Bertrand
The short answer is "no".
Dask reserves the right to call your function many times. This might occur if a worker goes down or if Dask does some load balancing and moves some tasks around the cluster while at the same time they've just started.
However you can significantly reduce the likelihood of a task running multiple times by turning off work stealing:
def turn_off_stealing(dask_scheduler):
dask_scheduler.extensions['stealing']._pc.stop()
client.run(turn_off_stealing)

Single multithread Java Client with DataStax Java Driver for Apache Cassandra not utilizing system resources

I’d appreciate any guidance on optimal setup for multi-threaded, high throughput low latency Java Client using DataStax Java Driver for Apache Cassandra. I appreciate ‘roll-your-own’ bench-marking is not recommended, but this task is also aimed at a proof-of-concept for a real-world application to achieve high TPS.
Setup:
Client Side : Java 8 Client, configurable number of multi-threaded executor threads (facilitated by lmax disruptor), cassandra-driver-core-3.0.0.jar, running on Redhat 6.3, 24 core machine, dl360s
Server side : 3 node Cassandra Cluster (apache-cassandra-2.2.4, on Redhat 6 with Java 8) , Replication Factor = 3 , running on Redhat 6.3, 24 core machine dl360s
Testing
With cl=LOCAL_QUORUM tests have been in the region of 3.5K INSERTS and 6.5K READS per second from a relatively simple schema, with latency circa 6 and 2 milliseconds respectively, with CPU usage circa 20% across the box.
Problem
However the problem I can not solve is that - when I create multiple separate instances of my load client-application I can achieve significantly higher TPS summed across instances, and greater CPU usage than I can ever achieve within a single JVM. This suggests that my Java Client Application is neither IO or CPU bound, nor is the server-side Cassandra cluster the bottleneck. Likewise when I stub out the Cassandra call, I achieve much higher TPS thus giving me confidence that the application is not suffering from any contention.
So my question is: Is this a common problem – that one single Java Client using DataStax Java Driver for Apache Cassandra is somehow limited on it’s throughput? and assuming not can anyone point me in the right direction to investigate.
I have tested multiple sequences (READs and WRITEs), and also both execute and executeAsync, with variable number of concurrent threads. As you’d expect I see higher numbers with executeAsync but still the same limitation within my app.
I have tested with multiple Connection Pooling settings, and have tried creating/building 1 Cluster Instance per client-application, and multiple cluster instances per application, and varying CoreConnections, maxRequestsPerConnection and newConnectionThreshold values but thus far with no success.
My current best results were with 50 executor threads, 5 instances ;MaxRequestsPerConnection(L) = 1024; ;NewConnectionThreshold(L) = 800; CoreConnectionsPerHost(L) = 20
This yielded ~4K TPS BUT only using 18% of the CPU, and when I start a separate Application Instance I achieve 7.5K TPS across both using 30% CPU, but I can not achieve this 7.5K within the save JVM
Code: Create Cluster
LoadBalancingPolicy tokenAwarePolicy =
new TokenAwarePolicy(new RoundRobinPolicy());
Cluster cluster = Cluster.builder()
.addContactPoints(node)
.withLoadBalancingPolicy(tokenAwarePolicy)
.withPoolingOptions(new PoolingOptions()) // Have tried various options here
.build();
Code: Prepare Statement (once)
String insertSqlString = "INSERT INTO " + keySpaceName + ".test_three ("
+ "user_id, field_a, field_b, field_c, field_d) values "
+ "( ?, ?, ?, ?);";
statementInsertDataTablePS = session.prepare(insertSqlString);
statementInsertDataTablePS.setConsistencyLevel(configuredConsistencyLevel); //2
Code: Execute
BoundStatement boundStatement = new BoundStatement(statementInsertDataTablePS);
session.executeAsync(boundStatement.bind(
sequence, // userID
sequence + "value_for_field_a",
sequence + "value_for_field_b",
sequence + "value_for_field_c",
sequence + "value_for_field_d") );

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.

How do I control transaction in Datastax java driver

We are planning to Use datastax 2.0. driver in our application . We have following scenario in our application .There are two different transactions, one for increasing the consumption and the other for decreasing the consumption that can be done at the same time.
For Example:
Let us assume Repair_Qty = 10 From Machine 1 - I am doing a new repair and so the new Repair_Qty should be 10 + 1 .. i.e. 11 From Machine 2 at the same time, someone else is canceling a Repair. The Repair_Qty should be 11-1 = 10. However, as the transaction happened at same time and as there is no Transaction lock, the new Repair Qty will be 10-1 = 9 which is wrong.
I want to know if there is some mechanism for WRITE-READ_WRITE lock support in the datastax java driver.
Please help.
Regards,
Arun
I would suggest you to do that at your application level somehow. As cassandra is eventually consistent, those kind of operations tend to fail.

Resources