I have a single-node cassandra cluster, 32 cores CPU, 32GB memory and RAID of 3 SSDs, totally around 2.5TB. and i also have another host with 32 cores and 32GB memory, on which i run a Apache Spark.
I have a huge history data in cassandra, maybe 600GB. There're approx more than 1 million new records every day which come from Kafka. And I need to query these new rows every day. But Cassandra failed. I'm confused.
My scheme of Cassandra table is:
CREATE TABLE rainbow.activate (
rowkey text,
qualifier text,
act_date text,
info text,
log_time text,
PRIMARY KEY (rowkey, qualifier)
) WITH CLUSTERING ORDER BY (qualifier ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activate_act_date_idx ON rainbow.activate (act_date);
CREATE INDEX activate_log_time_idx ON rainbow.activate (log_time);
cause the source data maybe contains some duplicative data, so i need to use a primary key to drop the duplicative records. there're two index on this table, the act_date is a date string like '20151211', the log_time is a datetime string like '201512111452', that is the log_time separates records more finer.
if i select records using log_time, cassandra works. but it fails using act_date.
at the first, spark job exit with an error:
java.io.IOException: Exception during execution of SELECT "rowkey", "qualifier", "info" FROM "rainbow"."activate" WHERE token("rowkey") > ? AND token("rowkey") <= ? AND log_time = ? ALLOW FILTERING: All host(s) tried for query failed (tried: noah-cass01/192.168.1.124:9042 (com.datastax.driver.core.OperationTimedOutException: [noah-cass01/192.168.1.124:9042] Operation timed out))
i try to increase the spark.cassandra.read.timeout_ms to 60000. But the job post another error as follow:
java.io.IOException: Exception during execution of SELECT "rowkey", "qualifier", "info" FROM "rainbow"."activate" WHERE token("rowkey") > ? AND token("rowkey") <= ? AND act_date = ? ALLOW FILTERING: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
i don't know how to solve this problem, i read the docs on spark-cassandra-connector but i don't find any tips.
so would you like to give some advise to help me solve this problem.
thanks very much!
Sounds like an unusual setup. If you have two machines it would be more efficient to configure Cassandra as two nodes and run Spark on both nodes. That would spread the data load and you'd generate a lot less traffic between the two machines.
Ingesting so much data every day and then querying arbitrary ranges of it sounds like a ticking time bomb. When you start getting frequent time out errors it is usually a sign of an inefficient schema where Cassandra cannot do what you are asking in an efficient way.
I don't see the specific cause of the problem, but I'd consider adding another field to the partition key, such as the day so that you could restrict your queries to a smaller subset of your data.
Related
I'm trying to get cassandra setup, and having some issues where google and other questions here are not helpful.
From cqlsh, I get NoHostAvailable: when I try to query tables after creating them:
Connected to DS Cluster at 10.101.49.129:9042.
[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh> use test;
cqlsh:test> describe kv;
CREATE TABLE test.kv (
key text PRIMARY KEY,
value int
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
cqlsh:test> select * from kv;
NoHostAvailable:
All of the nodes are up and running according to nodetool.
When I try to connect from Spark, I get something similar -- everything works fine I can manipulate and connect to tables, until I try to access any data, and then it fails.
val df = sql.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "test2", "table" -> "words")).load
df: org.apache.spark.sql.DataFrame = [word: string, count: int]
df.show
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 25, HOSTNAME): java.io.IOException: Failed to open native connection to Cassandra at {10.101.49.129, 10.101.50.24, 10.101.61.251, 10.101.49.141, 10.101.60.94, 10.101.63.27, 10.101.49.5}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:325)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
...
Caused by: java.lang.NoSuchMethodError: com.google.common.util.concurrent.Futures.withFallback(Lcom/google/common/util/concurrent/ListenableFuture;Lcom/google/common/util/concurrent/FutureFallback;Ljava/util/concurrent/Executor;)Lcom/google/common/util/concurrent/ListenableFuture;
at com.datastax.driver.core.Connection.initAsync(Connection.java:177)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:731)
at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:251)
I apologize if this is a naive question, and thank you in advance.
NoHostAvailable
Replication in Cassandra is done via one of two strategies which can be specified on a particular keyspace.
SimpleStrategy :
Represents a naive approach and spreads data globally among nodes based on the token ranges owned by each node. There is no differentiation between nodes that are in different datacenters.
There is one parameter for SimpleStrategy which chooses how many replicas for any partition will exist within the entire cluster.
NetworkTopologyStrategy :
Represents a per Datacenter replication strategy. With this strategy data is replicated based on the token ranges owned by nodes but only within a datacenter.
This means that if you have a two datacenters with nodes [Token] and a full range of [0-20]
Datacenter A : [1], [11]
Datacenter B : [2], [12]
Then with simple strategy the range would be viewed as being split like this
[1] [2-10] [11] [12 -20]
Which means we would end up with two very unbalanced nodes which only own a single token.
If instead we use NetworkTopologyStrategy the responsibilities look like
Datacenter A : [1-10], [11-20]
Datacenter B : [2-11], [12-01]
The strategy itself can be described with a dictionary as a parameter which lists each datacenter and how many replicas should exist in that datacenter.
For example you can set the replication as
'A' : '1'
'B' : '2'
Which would create 3 replicas for the data, 2 replicas in B but only 1 in A.
This is where a lot of users run into trouble since you could specify
a_mispelled : '4'
Which would mean that a datacenter which doesn't exist should have replicas for that particular keyspace. Cassandra would then respond whenever doing requests to that keyspace that it could not obtain replicas because it can't find the datacenter.
With VNodes you can get skewed replication (if required) by giving different nodes different numbers of VNodes. Without VNodes it just requires shrinking the ranges covered by nodes which have less capacity.
How data gets read
Regardless of the replication, data can be read from any node because the mapping is completely deterministic. Given a keyspace, table and partition key, Cassandra can determine on which nodes any particular token should exist and obtain that information as long as the Consistency Level for the query can be met.
Guava Error
The error you are seeing most commonly comes from a bad package of Spark Cassandra Connector being used. There is a difficulty with working with the Java Cassandra Driver and Hadoop since both require different (incompatible) versions of Guava. To get around this the SCC provides builds with the SCC guava version shaded but re-including the Java Driver as a dependency or using an old build can break things.
for me it looks like two issues:
1st for cqlsh you seem to have missconfigured the replication factor of your keyspace. What's the RF you've used there?
See also the datastax documentation.
For the spark issue it seems like that some of the google guava dependency isn't compatible with your driver?
In the latest guava release there was an API change. See
java.lang.NoClassDefFoundError: com/google/common/util/concurrent/FutureFallback
I have a couple of Cassandra tables on which tombstone compaction is constantly being run and I believe this is the reason behind high CPU usage by the Cassandra process.
Settings I have include:
compaction = {'tombstone_threshold': '0.01',
'tombstone_compaction_interval': '1', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
default_time_to_live = 1728000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
In one of the tables I write data to it every minute. Because of the TTL that is set, a whole set of rows expire every minute too.
Is the constant compaction due to the low tombstone_threshold and tombstone_compaction_interval ?
Can someone give a detailed explanation of tombstone_threshold and tombstone_compaction_interval. The Cassandra document doesn't explain it too well.
So the tombstone compaction can fire assuming the SSTable is as least as old as the compaction interval. SStables are created as things are compacted. the threshold is how much of the sstable is tombstones before compacting just for tombstones instead of joining sstables.
You are using leveled and have a 20 day ttl it looks like. You will be doing a ton of compactions as well as tombstone compactions just to keep up. Leveled will be the best to make sure you don't have old tombstone eating up disk space of the default compactors.
If this data is time-series which is sounds like it is you may want to consider using TWCS instead. This will create "buckets" which are each an sstable once compacted so once the ttl for the data in that table expires the compactor can drop the whole sstable which is much more efficient.
TWCS is available as a jar you need to add to the classpath for 2.1 and we use it currently in production. It has been added in the 3.x series of Cassandra as well.
I’d like to know if there is a way to have a Cassandra node join the ring only after it has finished streaming and compaction. The issue I’m experiencing is that when I add a node to my cluster, it streams data from the other nodes then joins the ring, at this point it begins a lot of compactions, and the compactions take a very long time to complete (greater than a day), during this time CPU utilization on that node is nearly 100%, and bloom filter false positive ratio is very high as well which happens to be relevant to my use case. This causes the whole cluster to experience an increase in read latency, with the newly joined node in particular having 10x the typical latency for reads.
I read this post http://www.datastax.com/dev/blog/bootstrapping-performance-improvements-for-leveled-compaction which has this snippet about one way to possibly improve read latency when adding a node.
“Operators usually avoid this issue by passing -Dcassandra.join_ring=false when starting the new node and wait for the bootstrap to finish along with the followup compactions before putting the node into the ring manually with nodetool join”
The documentation on the join_ring option is pretty limited but after experimenting with it it seems that streaming data and the later compaction can’t be initiated until after I run nodetool join for the new host, so I’d like to know how or if this can be achieved.
Right now my use case is just deduping records being processed by a kafka consumer application. The table in cassandra is very simple, just a primary key, and the queries are just inserting new keys with a ttl of several days and checking existence of a key. The cluster needs to perform 50k reads and 50k writes per second at peak traffic.
I’m running cassandra 3.7 My cluster is in EC2 originally on 18 m3.2xlarge hosts. Those hosts were running at very high (~90%) CPU utilization during compactions which was the impetus for trying to add new nodes to the cluster, I’ve since switched to c3.4xlarge to give more CPU without having to actually add hosts, but it’d be helpful to know at what CPU threshold I should be adding new hosts since waiting until 90% is clearly not safe, and adding new hosts exacerbates the CPU issue on the newly added host.
CREATE TABLE dedupe_hashes (
app int,
hash_value blob,
PRIMARY KEY ((app, hash_value))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '90PERCENTILE';
One thing to check that may help avoid the 100% CPU due to compactions is the setting for "compaction_throughput_mb_per_sec".
By default it should be 16MB but check if you have this disabled (set to 0).
nodetool getcompactionthroughput
You can also set the value dynamically
nodetool setcompactionthroughput 16
(set in cassandra.yaml to persist after restart).
What will be an ideal way to query cassandra by a partition key using the Spark Connector. I am using where to pass in the key but that causes cassandra to add ALLAOW FILTERING under the hood which in turn causes timeouts.
current set up :
csc.cassandraTable[DATA]("schema", "table").where("id =?", "xyz").map( x=> print(x))
here id is the partition(not primary) key
I have a composite primary key and using only the partition key for query
Update :
yes , I am getting an exception with this :
Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
none of my partitions have more than a 1000 records and I am running a single cassandra node
ALLOW FILTERING is not going to affect your query if you use a where clause on the entire partition key. If the query is timing out it may mean your partition is just very large or the full partition key was not specified
EDIT:
Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Means that the your queries are being sent to machines which do not have a replica of the data you are looking for. Usually this means that the replication of the keyspace is not set correctly or that the connection host is incorrect. The LOCAL part of LOCAL_ONE means that the query is only allowed to succeed if the data is available on the LOCAL_DC.
With this in mind you have 3 options
Change the initial connection target of your queries
Change the replication of your keyspace
Change the consistency level of your queries
Since you only have 1 machine, Changing the replication of your keyspace is probably the right thing to do.
I am building an application which process very large data(more that 3 million).I am new to cassandra and I am using 5 node cassandra cluster to store data. I have two column families
Table 1 : CREATE TABLE keyspace.table1 (
partkey1 text,
partkey2 text,
clusterKey text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey1)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Table 2 : CREATE TABLE keyspace.table2 (
partkey1 text,
partkey2 text,
clusterKey2 text,
attributes text,
PRIMARY KEY ((partkey1, partkey2), clusterKey2)
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
note : clusterKey1 and clusterKey2 are randomly generated UUID's
My concern is on nodetool cfstats
I am getting good throughput on Table1 with stats :
SSTable count: 2
Space used (total): 365189326
Space used by snapshots (total): 435017220
SSTable Compression Ratio: 0.2578485727722293
Memtable cell count: 18590
Memtable data size: 3552535
Memtable switch count: 171
Local read count: 0
Local read latency: NaN ms
Local write count: 2683167
Local write latency: 1.969 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 352
where as for table2 I am getting very bad read performance with stats :
SSTable count: 33
Space used (live): 212702420
Space used (total): 212702420
Space used by snapshots (total): 262252347
SSTable Compression Ratio: 0.1686948750752438
Memtable cell count: 40240
Memtable data size: 24047027
Memtable switch count: 89
Local read count: 24027
Local read latency: 0.580 ms
Local write count: 1075147
Local write latency: 0.046 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 688
I was wondering why table2 is creating 33 SSTables and why is the read performance very low in it. Can anyone help me figure out what I am doing wrong here?
This is how I query the table :
BoundStatement selectStamt;
if (selectStamt == null) {
PreparedStatement prprdStmnt = session
.prepare("select * from table2 where clusterKey1 = ? and partkey1=? and partkey2=?");
selectStamt = new BoundStatement(prprdStmnt);
}
synchronized (selectStamt) {
res = session.execute(selectStamt.bind("clusterKey", "partkey1", "partkey2"));
}
In another thread, I am doing some update operations on this table on different data the same way.
In case of measuring throughput, I measuring number of records processed per sec and its processing only 50-80 rec.
When you have a lot of SSTables, the distribution of your data among those SSTables is very important. Since you are using SizeTieredCompactionStrategy, SSTables get compacted and merged approximately when there are 4 SSTables the same size.
If you are updating data within the same partition frequently and at different times, it's likely your data is spread across many SSTables which is going to degrade performance as there will be multiple reads of your SSTables.
In my opinion, the best way to confirm this is to execute cfhistograms on your table:
nodetool -h localhost cfhistograms keyspace table2
Depending on the version of cassandra you have installed, the output will be different, but it will include a histogram of number of SSTables read for a given read operation.
If you are updating data within the same partition frequently and at different times, you could consider using LeveledCompactionStrategy (When to use Leveled Compaction). LCS will keep data from the same partition together in the same SSTable within a level which greatly improves read performance, at the cost of more Disk I/O doing compaction. In my experience, the extra compaction disk I/O more than pays off in read performance if you have a high ratio of reads to writes.
EDIT: With regards to your question about your throughput concerns, there are a number of things that are limiting your throughput.
A possible big issue is that unless you have many threads making that same query at a time, you are making your request serially (one at a time). By doing this, you are severely limiting your throughput as another request can not be sent until you get a response from Cassandra. Also, since you are synchronizing on selectStmt, even if this code were being executed by multiple threads, only one request could be executed at a time anyways. You can dramatically improve throughput by having multiple worker threads that make the request for you (if you aren't already doing this), or even better user executeAsync to execute many requests asynchronously. See Asynchronous queries with the Java driver for an explanation on how the request process flow works in the driver and how to effectively use the driver to make many queries.
If you are executing this same code each time you make a query, you are creating an extra roundtrip by calling 'session.prepare' each time to create your PreparedStatement. session.prepare sends a request to cassandra to prepare your statement. You only need to do this once and you can reuse the PreparedStatement each time you make a query. You may be doing this already given your statement null-checking (can't tell without more code).
Instead of reusing selectStmt and synchronizing on it, just create a new BoundStatement off of the single PreparedStatement you are using each time you make a query. This way no synchronization is needed at all.
Aside from switching compaction strategies (this is expensive, you will compact hard for a while after the change) which as Andy suggests will certainly help your read performance, you can also tune your current compaction strategy to try to get rid of some of the fragmentation:
If you have pending compactions (nodetool compactionstats) -- then try to catch up by increasing compactionthrottling. Keep concurrent compactors to 1/2 of your CPU cores to avoid compaction from hogging all your cores.
Increase bucket size (increase bucket_high, drop bucket low)- dictates how similar sstables have to be in size to be compacted together.
Drop Compaction threshold - dictates how many sstables must fit in a bucket before compaction occurs.
For details on 2 and 3 check out compaction subproperties
Note: do not use nodetool compact. This will put the whole table in one huge sstable and you'll loose the benefits of compacting slices at a time.
In case of emergencies use JMX --> force user defined compaction to force minor compactions
You have many SSTable's and slow reads. The first thing you should do is to find out how many SSTable's are read per SELECT.
The easiest way is to inspect the corresponding MBean: In the MBean domain "org.apache.cassandra.metrics" you find your keyspace, below it your table and then the SSTablesPerReadHistorgram MBean. Cassandra records min, max, mean and also percentiles.
A very good value for the 99th percentile in SSTablesPerReadHistorgram is 1, which means you normally read only from a single table. If the number is about as high as the number of SSTable's, Cassandra is inspecting all SSTable's. In the latter case you should double-check your SELECT, whether you are doing a select on the whole primary key or not.