Having performance issues using NetworkTopologyStrategy on a production keyspace with replication factor 4 across multiple datacenters (DCs located in 4 worldwide locations). Each DC has 3 nodes with pretty good hardware (70GB RAM, 5TB SSDs, etc.).
Same keyspace performs well in a SimpleStrategy using 4 node cluster in AWS, but running same queries in Production environment results in poor query times (select * from my_table is 6ms in AWS and 271ms in production).
Table "my_table" (name changed for privacy) is defined as:
CREATE TABLE my_table (
rec_type text,
group_id int,
rec_id timeuuid,
user_id int,
content text,
created_on timestamp,
PRIMARY KEY ((rec_type, group_id), rec_id)
) WITH bloom_filter_fp_chance = 0.1
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'class' : 'LeveledCompactionStrategy'
};
This is a newly-created table in Production with occasional updates and low tombstone count.
Query trace is below:
Looks like range requests are taking the most amount of time. What would be the cause of the delay?
Network latency between nodes in the same DC is <1ms and latency between DCs is around 50-60ms.
EDIT:
Below is the query trace for a select * from my_table where rec_type = 'abc' and group = 1 LIMIT 300 query (2 screenshots because trace is so long):
Related
Our Java Application doing a batch inserts on 1 of the table,
That table schema is something like..
CREATE TABLE "My_KeySpace"."my_table" (
key text,
column1 varint,
column2 bigint,
column3 text,
column4 boolean,
value blob,
PRIMARY KEY (key, column1, column2, column3, column4)
) WITH CLUSTERING ORDER BY ( column1 DESC, column2 DESC, column3 ASC, column4 ASC )
AND COMPACT STORAGE
AND bloom_filter_fp_chance = 0.1
AND comment = ''
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = 'NONE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'chunk_length_in_kb' : 64,
'class' : 'LZ4Compressor',
'enabled' : true
}
AND compaction = {
'class' : 'LeveledCompactionStrategy',
'sstable_size_in_mb' : 5
};
gc_grace_seconds = 0 in above schema. Because of this I am getting following warning:
2019-02-05 01:59:53.087 WARN [SharedPool-Worker-5 - org.apache.cassandra.cql3.statements.BatchStatement:97] Executing a LOGGED BATCH on table [My_KeySpace.my_table], configured with a gc_grace_seconds of 0. The gc_grace_seconds is used to TTL batchlog entries, so setting gc_grace_seconds too low on tables involved in an atomic batch might cause batchlog entries to expire before being replayed.
I have seen Cassandra code, this warning is there for obvious reasons at: this line
Any solution without changing batch code in application??
Should I increase gc_grace_seconds?
In Cassandra, batches aren't the way to optimize inserts into database - they are usually used mostly for coordinating writing into multiple tables, etc. If you're using the batches for insertion into multiple partitions, you're even get worse performance.
The better throughput for inserts you can get from using asynchronous commands execution (via executeAsync), and/or by using batches but only for inserts that are targeting the the same partition.
I have a cassandra instance running in production in a project where I have connected hundreds of thousands of sensors to dump data to the cassandra server using c# cassandradriver.
The Server has a 2 TB SSD.
I am not experiencing any performance issues as of Right now, but I plan to add more sensors,
I have only one KeySpace and only one Table in that KeySpace. The structure of the Table is as follows
CREATE TABLE xxxxkeyspace.sensorreadings (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'chunk_length_in_kb' : 64,
'class' : 'LZ4Compressor',
'crc_check_chance' : 1.0,
'enabled' : true
}
AND compaction = {
'bucket_high' : 1.5,
'bucket_low' : 0.5,
'class' : 'SizeTieredCompactionStrategy',
'enabled' : true,
'max_threshold' : 32,
'min_sstable_size' : 50,
'min_threshold' : 4,
'tombstone_compaction_interval' : 86400,
'tombstone_threshold' : 0.2,
'unchecked_tombstone_compaction' : false
};
when i run the nodetool status it says that the keyspace has occupied more than 550GB of data on the SSD.
Will there be any issues during production If I add more nodes to the cluster during run-time. Note I can not tolerate a downtime greater than an hour.
You can add new node without any downtime, once a new node is added in existing ring, range-movement starts, once all data corresponding to acquired tokens are replicated on new node bootstrap process is completed. After bootstrap process finish for new node, all responsible write request are forwarded to this node as per its tokens.
Now only thing remains is cleanup the data from old nodes for tokens which have been moved to new node, this could be done anytime later using nodetool cleanup command.
I am trying to run a stress test using the cassandra-stress tool with profiles on a 6 node cluster with a replication factor=3.
./cassandra-stress user profile=/path/to/cassandra_stress.yaml duration=2h ops\(insert=20,select=10\) **cl=local_quorum** no-warmup -node nodeaddress -transport truststore=/path/to/tls/truststore.jks truststore-password=***** -rate threads=5 -log level=verbose file=/path/to/log -graph file=graph_.html title='Graph' & 2>1
The execution stops at some point with a ReadTimeout and the logs show the following:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_QUORUM (2 replica were required but only 1 acknowledged the write)
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ALL (3 replica were required but only 2 acknowledged the read)
I am not sure why it is taking cl=local_quorum for writes but not for reads. Any insights would be helpful.
Profile
# Keyspace Name keyspace: d3 keyspace_definition: | CREATE KEYSPACE d3 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'} AND DURABLE_WRITES = true;
# Table name table: stress_offheap_long table_definition: | CREATE TABLE d3.stress_offheap_long (
dart_id timeuuid,
dart_version_id timeuuid,
account_id timeuuid,
amount double,
data text,
state text, PRIMARY KEY (dart_id, dart_version_id) ) WITH CLUSTERING ORDER BY (dart_version_id DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys':'ALL', 'rows_per_partition':'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
columnspec:
- name: dart_id
size: gaussian(36..64)
population: uniform(1..10M)
- name: art_version_id
size: gaussian(36..64)
- name: account_id
size: gaussian(36..64)
population: uniform(1..10M)
- name: amount
size: fixed(1)
- name: data
size: gaussian(5000..20000)
- name: state
size: gaussian(1..2)
population: fixed(1)
### Batch Ratio Distribution Specifications ###
insert:
partitions: fixed(1)
select: fixed(1)/1000
batchtype: UNLOGGED # Unlogged batches
#
# A list of queries you wish to run against the schema
#
queries:
select:
cql: select * from stress_offheap_long where dart_id = ? and dart_version_id=? LIMIT 1
fields: samerow
In my spark job I am reading data from cassandra using java cassandra util. My query reads like-
JavaRDD<CassandraRow> cassandraRDD = functions.cassandraTable("keyspace","column_family").
select("timeline_id","shopper_id","product_id").where("action=?", "Viewed")
My row key level is set on action column. When I am running my spark job its causing the over utilisation of cpu but when I remove the filter on the action column its working fine.
Please find below the create table script for the column family-
CREATE TABLE keyspace.column_family (
action text,
timeline_id timeuuid,
shopper_id text,
product_id text,
publisher_id text,
referer text,
remote_ip text,
seed_product text,
strategy text,
user_agent text,
PRIMARY KEY (action, timeline_id, shopper_id)
) WITH CLUSTERING ORDER BY (timeline_id DESC, shopper_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
What I am suspecting is as action_item is the row key, all data is getting served from single node (hot spot) and thats why that nodes CPU might be shooting up. Also while reading there is only a single partition of RDD getting created in the spark job. Any help will be appreciated.
Ok you're having a data model issue here. action = partition key so all similar actions are stored in a single partition = (one node + replicas).
How many distinct actions do you have in total ? Your intuition about having hotspot is justified.
You probably need a different partition key OR need to add an extra column to the partition key to let Cassandra distributes the data evenly on the cluster.
Read this blog post : http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/
I am receiving a OperationTimedOut error while running an alter table command in cqlsh. How is that possible? Since this is just a table metadata update, shouldn't this operation run almost instantaneously?
Specifically, this is an excerpt from my cqlsh session
cqlsh:metric> alter table metric with gc_grace_seconds = 86400;
OperationTimedOut: errors={}, last_host=sandbox73vm230
The metric table currently has a gc_grace_seconds of 864000. I am seeing this behavior in a 2-node cluster and in a 6-node 2-datacenter cluster. My nodes seem to be communicating fine in general (e.g. I can insert in one and read from the other). Here is the full table definition (a cyanite 0.1.3 schema with DateTieredCompactionStrategy, clustering and caching changes):
CREATE TABLE metric.metric (
tenant text,
period int,
rollup int,
path text,
time bigint,
data list<double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH CLUSTERING ORDER BY (time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'timestamp_resolution': 'SECONDS', 'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = 'NONE';
I realize at this point the question is pretty old, and you may have either figured out the answer or otherwise moved on, but wanted to post this in case others stumbled upon it.
The default cqlsh request timeout is 10 seconds. You can adjust this by starting up cqlsh with the --request-timeout option set to some value that allows your ALTER TABLE to run to completion, e.g.:
cqlsh --request-timeout=1000000