Scaling in Cassandra - cassandra

I tested throughput performance of Cassandra cluster with 2,3 and 4 nodes. There was significant improvement when I used 3 nodes(as compared to 2), however, the improvement wasn't so significant when I used 4 nodes, instead of 3.
Given below are specs of the 4 nodes:
N->No. of physical CPU cores, Ra->Total RAM, Rf->Free RAM
Node 1: N=16, Ra=189 GB, Rf=165 GB
Node 2: N=16, Ra=62 GB, Rf=44 GB
Node 3: N=12, Ra=24 GB, Rf=38 GB
Node 4: N=16, Ra=189 GB, Rf=24 GB
All nodes are on RHEL 6.5
Case 1(2 nodes in the cluster, Node 1 and Node 2)
Throughput: 12K ops/second
Case 2(3 nodes in the cluster, Node 1, Node 2 and Node 3)
Throughput: 20K ops/second
Case 3(All 4 nodes in the cluster)
Throughput: 23K ops/second
1 operation involved 1 read + 1 write(Read/write takes place on the same row)(Row cache can't be used). In all cases, Read consistency =2 and Write Consistency =1. Both read and write were asynchronous. The client application used Datastax's C++ driver and was being run with 10 threads.
Given below are the keyspace and table details:
CREATE KEYSPACE cass WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '2'} AND durable_writes = true;
CREATE TABLE cass.test_table (
pk text PRIMARY KEY,
data1_upd int,
id1 int,
portid blob,
im text,
isflag int,
ms text,
data2 int,
rtdata blob,
rtdynamic blob,
rtloc blob,
rttdd blob,
rtaddress blob,
status int,
time bigint
) WITH bloom_filter_fp_chance = 0.001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Some parameters from YAML are given below(All 4 nodes used similar YAML files):
commitlog_segment_size_in_mb: 32
concurrent_reads: 64
concurrent_writes: 256
concurrent_counter_writes: 32
memtable_offheap_space_in_mb: 20480
memtable_allocation_type: offheap_objects
memtable_flush_writers: 1
concurrent_compactors: 2
Some parameters from jvm.options are given below(all nodes used same values):
### CMS Settings
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=4
-XX:MaxTenuringThreshold=6
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
-XX:+CMSClassUnloadingEnabled
Given below are some client's connection specific parameters:
cass_cluster_set_max_connections_per_host ( ms_cluster, 20 );
cass_cluster_set_queue_size_io ( ms_cluster, 102400*1024 );
cass_cluster_set_pending_requests_low_water_mark(ms_cluster, 50000);
cass_cluster_set_pending_requests_high_water_mark(ms_cluster, 100000);
cass_cluster_set_write_bytes_low_water_mark(ms_cluster, 100000 * 2024);
cass_cluster_set_write_bytes_high_water_mark(ms_cluster, 100000 * 2024);
cass_cluster_set_max_requests_per_flush(ms_cluster, 10000);
cass_cluster_set_request_timeout ( ms_cluster, 12000 );
cass_cluster_set_connect_timeout (ms_cluster, 60000);
cass_cluster_set_core_connections_per_host(ms_cluster,1);
cass_cluster_set_num_threads_io(ms_cluster,10);
cass_cluster_set_connection_heartbeat_interval(ms_cluster, 60);
cass_cluster_set_connection_idle_timeout(ms_cluster, 120);
Is there anything wrong with the configurations due to which Cassandra didn't scale much when number of nodes were increased from 3 to 4?

During a test, you may check ThreadPools using nodetool tpstats.
You will be able to see if some stages have too many pending (or blocked) tasks.
If there are no issues with ThreadPools, may be you cloud launch a benchmark using cassandra-stress in order to see if the limitation comes from your client?
I don't know if it is only for test purpose but as far as I know, Read before Write data is an antipattern with Cassandra.

Related

Cassandra unable to query sum of rows from a table

I am using Cassandra database for capturing and saving a simple network sniffer data, but because the number of rows in the table is greater than 20M+ rows, it is unable to run any aggregate function such as sum or count.
Following is my table schema:
CREATE TABLE db.uinfo (
id timeuuid,
created timestamp,
dst_ip text,
dst_mac text,
dst_port int,
protocol int,
src_ip text,
src_mac text,
src_port int,
PRIMARY KEY (id, created)
) WITH CLUSTERING ORDER BY (created ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Now when I run the query(with or without limit):
select src_ip, sum(data) as total from db.uinfo;
It throws me the following error:
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
Any chance any of you good people could help me do the same? I have tried changing the timeouts in the cqlshrc and cassandra.yaml respectively. I have even tried starting the cqlsh using:
cqlsh --connect-timeout=120 --request-timeout=120.
I am using [cqlsh 5.0.1 | Cassandra 3.11.4 | CQL spec 3.4.4 | Native protocol v4]
This kind of queries won't work with Cassandra when you have relatively big data in it - such kind of queries requires the scanning of the whole database, and reading all data in it. Cassandra is great when you know the partition that you want to hit, and as such, send query only to individual servers where they could be processed very effectively. So aggregation functions work best only within partition.
If you need this kind of queries done - the common suggestion is to use Spark to read data in parallel, and perform aggregations. You can do this using Spark Cassandra Connector, but it will be slower than usual queries - maybe dozens of seconds, or even minutes, depending on the size of data, hardware for Spark jobs, etc.
If you need this kind of queries performed very often, then you need to look to other technologies, but it's hard to say who will perform well in such situation.

Can we add new Datacenter in existing cluster With higher Replication Factor

In Cassandra , to add new Datacenter with higher replication factor in the same cluster throws an error says some range with replication factor 1 is not found in any source Datacenter.
I have Datacenter with (X- RF = 2), and (Y -RF = 1) . I want to add Datacenter (Z - RF = 3).
I have added a nodes in Datacenter Z .
But on
nodetool rebuild -- X
It Fails with an error
java.lang.IllegalStateException: unable to find sufficient sources for streaming range (-3685074324747697686,-3680615207285604279] in keyspace with replication factor 1
Basic Details of all Column family :
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Your current configuration, as shown in your keyspace definition, is for a single DC. You need to ALTER KEYSPACE it - to include another DC into it. This would start the replication process: keys that will be read/written would be replicated to the new DC. To fully copy all the data, you would need (in addition) to use the nodetool rebuild -- DC2 command
What you need to do is ALTER KEYSPACE and then nodetool rebuild. Copy the text that you get from DESCRIBE KEYSPACE keyspace_name but without CREATE in the beginning. Add the new datacenter in the replication.
ALTER KEYSPACE keyspace_name WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1, 'datacenter2' : 3 };
Then do:
nodetool rebuild -- datacenter1
Based on your comments -
"I have alter my keyspaces but for the default keyspace like (system_distributed) it throws error"
Other than user specific keyspaces, make sure that default system keyspaces like "system_distributed" are on "NetworkTopologyStrategy" (No keyspaces on SimpleStratergy for multi-DC, except local strategy )
Reference: Point-2 https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html

Connection timed out while dropping keyspace

I'm having trouble deleting a keyspace.
The keyspace in question has 4 tables similar to this one:
CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = false;
CREATE TABLE demo.t1 (
t11 int,
t12 timeuuid,
t13 int,
t14 text,
t15 text,
t16 boolean,
t17 boolean,
t18 int,
t19 timeuuid,
t110 text,
PRIMARY KEY (t11, t12)
) WITH CLUSTERING ORDER BY (t13 DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX t1_idx ON demo.t1 (t14);
CREATE INDEX t1_deleted_idx ON demo.t2 (t15);
When I want to delete the keyspace using the command:
Session session = cluster.connect();
PreparedStatement prepared = session.prepare("drop keyspace if exists " + schemaName);
BoundStatement bound = prepared.bind();
session.execute(bound);
Then the query gets timed out (or takes over 10 seconds to execute), even when the tables are empty:
com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.0.1:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:39)
I tried this on multiple machines and the result was the same. I'm using Cassandra 3.9. A similar thing happens using cqlsh. I know that I can increase the read timeout in the cassandra.yaml file, but how can I make this dropping faster? Another thing is that if I do two consecutive requests, the first one gets timed out and the second one goes through fast.
Try to run it with increased timeout:
cqlsh --request-timeout=3600 (in seconds, default 10 deconds)
There's should be also same setting on driver level. Review timeout session in this link:
http://docs.datastax.com/en/developer/java-driver/3.1/manual/socket_options/
Increasing timeout just hides the issue away and is usually a bad idea. Have a look at this answer: https://stackoverflow.com/a/16618888/7413631

Cassandra: Fetching data but not populating cache as query does not query from the start of the partition

I'm running Cassandra version 3.3 on a fairly beefy machine. I would like to try out the row cache so I have allocated 2 GBs of RAM for row caching and configured the target tables to cache a number of their rows.
If I run a query on a very small table (under 1 MB) twice with tracing on, on the 2nd query I see a cache hit. However, when I run a query on a large table (34 GB) I only get cache misses and see this message after every cache miss:
Fetching data but not populating cache as query does not query from the start of the partition
What does this mean? Do I need a bigger row cache to be able to handle a 34 GB table with 90 million keys?
Taking a look at the row cache source code on github, I see that clusteringIndexFilter().isHeadFilter() must be evaluating to false in this case. Is this a function of my partitions being too big?
My schema is:
CREATE TABLE ap.account (
email text PRIMARY KEY,
added_at timestamp,
data map< int, int >
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': '100000'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
The query is simply SELECT * FROM account WHERE email='sample#test.com'

Cassandra: Data loss after adding new node

We had a two nodes cassandra cluster which we want to expand to four.
We followed the procedure described there: http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html
But after adding two more nodes (at same time, with a 2 minutes interval as recommended in the documentation), we experienced some data loss. In some column families, there was missing elements.
Here is the nodetool netstats:
[centos#ip-10-11-11-187 ~]$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.11.11.187 5.63 MB 256 ? 0e596912-d649-4eed-82a4-df800c422634 2c
UN 10.11.1.104 748.79 MB 256 ? d8b96739-0858-4926-9eb2-27c96ca0a1c4 2c
UN 10.11.11.24 7.11 MB 256 ? e3e76dcf-2c39-42e5-a34e-9e986d4a9f7c 2c
UN 10.11.1.231 878.91 MB 256 ? cc1b5cfd-c9d0-4ca9-bbb1-bce4b2deffc1 2c
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
I don't quite understand if the "Note" is bad or not.
When we added the nodes, we put the two first servers - already available in the cluster - in the seeds of the configuration of the first added node. For the second added node, we put the newly added node in the seeds as well.
We are using EC2Snitch, and the listen_address has been set to the above addresses on each server.
We didn't run the cleanup yet, but we tried to run a repair, and there was written that nothing were to be repaired in our keycap.
Here is how our cluster was created:
CREATE KEYSPACE keyspace_name WITH replication = {'class': 'NetworkTopologyStrategy', 'us-west-2': '1'} AND durable_writes = true;
And the options of all of our tables:
CREATE TABLE keyspace_name."CFName" (
// ...
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
The data reappears if I decommission the new nodes.
EDIT: It was actually an error when reading the documentation... A colleague did set auto_bootstrap to false instead of setting it to true...
You should perform nodetool rebuild on the new nodes after you add them with auto_bootstrap: false
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsRebuild.html
HTH
Well you can specify the keyspace name to remove the Node which in this case is
nodetool status keyspace_name

Resources