Slow range queries in Cassandra - cassandra

I am working on a single node. I have the following table to store a list of documents:
CREATE TABLE my_keyspace.document (
status text,
date timestamp,
doc_id text,
raw_content text,
title text,
PRIMARY KEY (status, date, doc_id)
) WITH CLUSTERING ORDER BY (date ASC, doc_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX doc_id_idx ON my_keyspace.document (doc_id);
I am doing a lot of queries like:
SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND data>=start_date AND data<=end_date;
For some reason it is very slow, at first the warnings that I had were this:
[2016-07-26 18:10:46] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
[2016-07-26 18:10:52] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
Thinking the issue was linked to having too many tombestones I did:
ALTER TABLE my_keyspace.document WITH gc_grace_seconds = '0';
and then:
nodetool compact my_keyspace document
Now I don't have any warning but the queries are still very slow and often timeout. No message is displayed in any logs concerning the timeout.The number of documents I am having is roughly 200k instances. Those documents are distributed over a 20 days period, about 4500 documents have status='PROCESSED' each day. The queries answer time vary depending of the date range: about 3 seconds for a one day time range, 15 secs for 4 days and timeout for 2 weeks. Also, I disabled the swap. The version of Cassandra I am using is 3.5.
Recently I've noticed that giving the precise columns to extract instead of * is improving the response time a bit, but the system is still too slow.
EDIT: Computing partition size as proposed by Reveka
So, following the formula:
Number of rows = 20 * 4500 = 90,000
Number of columns = 19
Number of primary keys = 3
Number of static column = 0
So the number of values is 90000*(19-3)=1,440,000
For the size of the partition, I got to an estimate of about 1.2GB.
This might be a bit big. But how can I modify my partition key to still be able to do the same range queries while having smaller partitions? I could have a composite partition key containing the status and the day extracted from date, but wouldn't I have to then specify the day before being able to query by range:
SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND day='someday' AND date>='start_date' AND date<='end_date';
Which forces me to do one query per day.

I see that your primary key consists of status, date and doc_id and you only use status as your partition key. That means that all the documents of the same status regardless of date will be put in the same partition. I guess that is a lot of information for one partition. Cassandra works well in partitions that are 100MB (or a couple of hundred MB in later versions) big, see here. Datastax D220 cource (it is free you just need to create an account) has a video that shows you how to calculate your partition size. You can post the results to your analysis so we can further help you. :)
EDIT: After the size analysis
You will have to make your partition by date in order to have smaller partition. That means that now you will not be able to query by range. A workaround for this would be to do multiple queries based on the range you want. For example: if you want to do a query for range 12 August to 14 August you split by day and do three queries, one for 12 August, one for 13 and one for 14. Again though, if your range is big you will end up retrieving gb of data. I do not know your use case, but I am going to make a guess that you don't need gb worth of files everytime you do a date range query. Can you give me more info on your use case (a.k.a what do you want to do?)
ps. I can't write comments yet so I can only advice you through this answer

Related

Why do I sometimes have 10,000+ tombstones when I don't do DELETEs?

When doing a repair on a Cassandra node, I sometimes see a lot of tombstone logs. The error looks like this:
org.apache.cassandra.db.filter.TombstoneOverwhelmingException: Scanned over 100001 tombstone rows during query 'SELECT * FROM my_keyspace.table_foo WHERE token(<my params>) >= token(<my params>) AND token(<my params>) <= 2988334221698479200 LIMIT 2147385647' (last scanned row partition key was ((<my params>), 7c650d21-797e-4476-93d5-b1248e187f22)); query aborted
I have read here that tombstones are inserted as a way to mark a record as deleted. However, I don't see any code in this project that runs a delete on this table - just a read and an insert. What am I missing - how can I prevent these TombStoneOverwhelmingExceptions?
Here is the table definition:
CREATE TABLE my_keyspace.table_foo(
foo1 text,
year int,
month int,
foo2 text,
PRIMARY KEY ((foo1, year, month), foo2)
) WITH CLUSTERING ORDER BY (foo2 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 6912000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99PERCENTILE';
However, I don't see any code in this project that runs a delete on this table - just a read and an insert.
The code might not be running DELETEs, but the table definition tells Cassandra to delete anything >= 80 days old. TTLs create tombstones.
AND default_time_to_live = 6912000
So the thought behind TTLs in a time series model, is that they are typically ordered by timestamp in descending order. What ends up happening, is that most use cases tend to care only about recent data, and the descending order by timestamp causes the tombstones to end up on the "bottom" of the partition, where they are rarely (if ever) queried.
To create that effect, you'd need to create a new table with a definition something like this:
PRIMARY KEY ((foo1, year, month), created_time, foo2)
) WITH CLUSTERING ORDER BY (created_time DESC, foo2 ASC)
#anthony, here is my pov.
As a first step, don't let tombstones inserted into the table
Use the full primary key during the read path so we skip having to read the tombstones. Data modeling is key to designing the tables based on your access patterns required on the reading side
We could go and adjust min_threshold and set it to 2 to do some aggressive tombstone eviction
Similarly, we could tweak common options (for e.g. unchecked_tombstone_compaction set to true or other properties/options) to evict them faster
I would encourage you to view a similar question and the answers that are documented here

Cassandra UUID partition key and partition size

Given a table
CREATE TABLE sensors_by_id (
id uuid,
time timeuuid,
some_text text,
PRIMARY KEY (id, time)
)
Will this scale when there are a lot of entries? I´m not sure, if a UUID field is sufficient as a good partition key or is there a need to create some artificial key like week_first_day or something similar?
It's really depends on how will you insert your data - if you generate the UUID really randomly for every insert, then the chance of duplicates is very low, and you'll get so-called "skinny rows" (a lot of partitions with 1 row inside). Even if you start to get the duplicates, there will be not so many for every row...
It could be a problem with partition size cause cassandra has limit for disk size per one partition.
Good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB.
It is easy to calculate partition size by using that formula
You can read more about data modeling here.
So in your case with current schema for 1 000 000 rows count per one partition with average size 100 byte for some_text column will be:
Number of Values: (1000000 * (3 - 2 - 0) + 0) = 1000000
Partition Size on Disk: (16 + 0 + (1000000 * 116) + (8 * 1000000))
= 124000016 bytes (118.26 Mb)
So as you can see you out of limit with 118.26 Mb per one partition. So you need optimize your partition keys.
I calculated it using my open source project - cql-calculator.

Insert query replaces rows having same data field in Cassandra clustering column

I'm learning Cassandra, started off with v3.8. My sample keyspace/table looks like this
CREATE TABLE digital.usage (
provider decimal,
deviceid text,
date text,
hours varint,
app text,
flat text,
usage decimal,
PRIMARY KEY ((provider, deviceid), date, hours)
) WITH CLUSTERING ORDER BY (date ASC, hours ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Using a composite PRIMARY KEY with partition key as provider and deviceId, so that the uniqueness and distribution is done across the cluster nodes. Then the clustering keys are date and hours.
I have few observatons:
1) For a PRIMARY KEY((provider, deviceid), date, hours), while inserting multiple entries for hours field, only latest is logged and the previous are disappeared.
2) For a PRIMARY KEY((provider, deviceid), date), while inserting multiple entries for same date field, only latest is logged and the previous are disappeared.
Though i'm happy with above(point-1) behaviour, want to know whats happening in the background. Do I have to understand more about the clustering order keys?
PRIMARY KEY is meant to be unique.
Most of RDBMS throws error if you insert duplicate value in PRIMARY KEY.
Cassandra does not do Read before Write. It creates a new version of record with latest timestamp. When you insert data with same values for columns in primary key, new data will be created with latest timestamp and while querying (SELECT) record with only latest timestamp is returned back.
Example:
PRIMARY KEY((provider, deviceid), date, hours)
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test","test")
---- This will create a new record with let's say timestamp as 1
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test1","test1")
---- This will create a new record with let's say timestamp as 2
SELECT app,flat FROM digital.usage WHERE provider=1.0 AND deviceid='a' AND date='2017-07-27' AND hours=1
Will give
------------
| app | flat |
|-----|------|
|test1|test1 |
------------

Getting Cassandra tombstone_warn_threshold error

We have a Cassandra setup on our production. There are couple of tables with around 20M records in it. To reduce the number of records we deleted the unwanted records and have also set up ttl to remove data after some time. We have setup the grace period to 1 day now. We have also ran nodetool repair on each Cassandra node (one at a time). We have total 5 nodes in cluster with replication_factor as 3. Cassandra version is 2.1.14
In Cassandra log I constantly see the below error:
WARN [SharedPool-Worker-33] 2017-02-23 06:09:02,617 SliceQueryFilter.java:320 - Read 207 live and 3059 tombstone cells in event for key: 101:10001Njh:22017 (see tombstone_warn_threshold). 5000 columns were requested, slices=[-]
I ran the command nodetool cfhistograms myekyspace event; and below is the output of the same
I not able to analyze the above output fully, but I do know the sstable count is too high.
Any idea as to what we can do to fix this or optimize our Cassandra .
java heap size is set to 8 GB and we are using CMS garbage collection.
Output of nodetool cfstats mykeyspace.event
Table Structure
#chris-lohfink - Updated the question with the cfstats details and
CREATE TABLE vcs.events (
v_id text,
c_id text,
e_month int,
sid text,
e_id timeuuid,
cr_p_id text,
e_bucket text,
e_media map<text, text>,
e_meta map<text, text>,
e_met map<text, double>,
tag set<text>,
etime timestamp,
etype text,
isfin boolean,
r_mode text,
state text,
PRIMARY KEY ((v_id, c_id, e_month), sid, e_id)
) WITH CLUSTERING ORDER BY (sid ASC, e_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX events_id_idx ON mykeyspace.event (e_id);
CREATE INDEX events_type_idx ON mykeyspace.event (etype);
CREATE INDEX events_finalized_idx ON mykeyspace.event (isfin);
CREATE INDEX idx_state ON mykeyspace.event (state);
When you delete data in Cassandra the data, it is not removed immediately, instead Cassandra creates tombstones indicating the row/column is deleted. Tombstones are stored till the gc_grace_seconds.
In your case you have 300K records deleted daily, which indicated more tombstones are created and affecting your performance. You should work on your data model to avoid this errors.
See the slides from 34 to 42 about deletes and TTL in http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252
Also see the impact of data models on tombstones from below Cassandra docs:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

How to perform query with cassandra's timestamp column as WHERE condition

I have the following Cassandra table
cqlsh:mydb> describe table events;
CREATE TABLE mydb.events (
id uuid PRIMARY KEY,
country text,
insert_timestamp timestamp
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX country_index ON mydb.events (country);
CREATE INDEX insert_timestamp_index ON mydb.events (insert_timestamp);
As you can see, index is already created on insert_timestamp column.
I had gone through https://stackoverflow.com/a/18698386/3238864
I though the following is the correct query
cqlsh:mydb> select * from events where insert_timestamp >= '2016-03-01 08:27:22+0000';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_timestamp >= <value>'"
cqlsh:mydb> select * from events where insert_timestamp >= '2016-03-01 08:27:22+0000' ALLOW FILTERING;
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_timestamp >= <value>'"
But, query with country column as WHERE condition does work.
cqlsh:mydb> select * from events where country = 'my';
id | country | insert_timestamp
--------------------------------------+---------+--------------------------
53167d6a-e125-46ff-bacf-f5b267de0258 | my | 2016-03-01 08:27:22+0000
Any idea why query with timestamp as condition doesn't work? Is there anything wrong with my query syntax?
Any idea why query with timestamp as condition doesn't work? Is there anything wrong with my query syntax?
Native Cassandra secondary index is limited to = predicate. To enable inequality predicates you need to add ALLOW FILTERING but it will perform a full cluster scan :-(
If you can afford to wait for a couple of weeks, Cassandra 3.4 will be released with the new SASI secondary index which is much more efficient for range queries: https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
Index in cassandra are quite different from index in relational DB. One of the difference is, range query in a cassandra index is not allowed at all. Usually, range query only works with clustering keys (it also could work with partition keys if ByteOrderPartitioner is used, but it is not common), meaning you have to design your columnfamilies carefully for your potential query patterns. There are already many discussions in StackOverflow for the same topic.
To understand when to use cassandra's index (it is designed for quite specific cases) and its limitations, this is a good post,
Direct queries on secondary indices support only =, CONTAINS or
CONTAINS KEY restrictions.
Secondary index queries allow you to restrict the returned results
using the =, >, >=, <= and <, CONTAINS and CONTAINS KEY restrictions
on non-indexed columns using filtering.
So your query will work once you add ALLOW FILTERING to it.
select * from events where insert_timestamp >= '2016-03-01 08:27:22+0000' ALLOW FILTERING;
The link that you have mentioned in your question has timestamp column as clustering key. Hence it is working there.
As per the comment RangeQuery on secondary index is not alllowed upto 2.2.x version
FYI:
When Cassandra must perform a secondary index query, it will contact all the nodes to check the part of the secondary index located on each node.
Hence it is considered as anti-pattern in cassandra to have index on high cardinality column like timestamp.
You Should consider changing your data model to suit your queries.
Using cequel ORM
now = DateTime.now
today = DateTime.new(now.year, now.month, now.day, 0, 0, 0, now.zone)
tommorrow = today + (60 * 60 * 24);
MyObject.allow_filtering!.where("done_date" => today..tommorrow).select( "*" )
Has worked for me.

Resources