Cassandra clustering columns and performance - cassandra

So I have a materialized view defined as follows (changed the name a bit):
CREATE MATERIALIZED VIEW MYVIEW AS
SELECT *
FROM XXXX
WHERE id IS NOT NULL AND process_on_date_time IS NOT NULL AND poller_processing_status IS NOT NULL
PRIMARY KEY (poller_processing_status, process_on_date_time, id)
WITH CLUSTERING ORDER BY (process_on_date_time ASC, id ASC)
...
Based off of the definition, the data would be sorted by the PROCESS_ON_DATE_TIME column in ASC order (oldest first).
Now I have a query that runs as follows:
SELECT JSON * FROM MYVIEW
WHERE poller_processing_status='PENDING'
AND process_on_date_time<=1548775105000 LIMIT 250 ;
There are over 250 rows that match, so 250 are returned. Running this from CQLSH, it fetches 3 times - the first two fetch 100 rows and the last one fetches 50 rows. I enabled tracing via CQLSH with consistency LOCAL_ONE. Now of the first two fetches - they do EXACTLY the same steps, in the same order, same sstables, etc. One of them always takes 2 SECONDS, the other 8 ms. I can't for the life of me figure out why one takes over 2 seconds. The slow guy stalls on "Merged data from memtables and 3 sstables" (but again, but the first two fetches do EXACTLY the same thing, but one is slow and the next is fast - same merged sstables).
Move on to step 2. I thought, ok, we have a clustering column so it's sorted. What if I add an ORDER BY clause to sort the results. So I ran this:
SELECT JSON * FROM sop_cas.notification_by_status_ptd_mv
WHERE poller_processing_status='PENDING'
AND process_on_date_time<=1548775105000
ORDER BY process_on_date_time ASC LIMIT 250 ;
So basically the exact same query, specifying an order in the same order as the sorted data (clustering columns). Results? Same - over 2 seconds to complete. No improvement. Bummer.
Now one last test. I changed the sort in my query from ASC to DESC and give it a try. Results? Every time the query responds in less than 10 milliseconds. Huh???? This is what I'm trying to achieve - but why does the REVERSE sort run well?
I would figure that as I'm asking for records OLDER than something, an ASC sort would be best because the oldest records would be first and immediately grabbed. The other route, DESC, I would have newer records first and thus would have to skip over a bunch to find the first record older than my timestamp and then grab the next 250. Can anyone help me with this logic? Why does a DESC ORDER BY respond well while the ASC does not.
Using DSE 5.1.9
Thanks in advance.
-Jim

Related

Cassandra data model for high ingestion rate and delete operation

I am using the following Cassandra data model
ruleid - bigint
patternid - bigint
key - string
value - string
time - timestamp
event_uuid -time based uuid
partition key - ruleid, patterid
clustering key - event_uuid order by descending
Our ingestion rate is around 100 records per second per pattern id and there might be 10 000+ pattern ids.
Our query is fairly straightforward we query the last 100 000 records based on the desc uuid filtered by the partition key.
Also for our use case we would need to perform around 5 deletes per second on this per pattern ids.
However this leads to the so called tombstones and causes readtimeout on querying on the datastore again.
How to overcome the above issue?
It sounds like you are storing records into the table, doing some transformation/processing on the records, then deleting them.
But since you're deleting rows within partitions (instead of the partitions themselves), you have to iterate over the deleted rows (tombstones) to get to the live records.
The real problem though is reading too many rows which won't perform well. Retrieving 100K rows is going to be slow so consider paging through the result set.
With limited information you've provided, this is not an easy problem to solve. Cheers!

Most efficient way to get first N rows matching some criterion on ordinary (not clustering) columns

I want to return the first N rows from a Cassandra database filtering on some criterion, where the filtering is done on ordinary (not clustering) columns.
Let's assume a simple table like this:
CREATE TABLE test(
id UUID,
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)
Option 1
SELECT timestamp, value FROM test WHERE id=? AND value<? LIMIT ? ALLOW FILTERING
This is allowed, but ALLOW FILTERING is generally to be avoided. Having said that, is it really that bad if the query touches only the one partition?
Option 2
Set a very small paging size, e.g. N*10 (say) and then:
SELECT timestamp, value FROM test WHERE id=?
Read the results a page at a time, and stop reading as soon as sufficient suitable rows have been read. Is there any cost associated with the pages that have not yet been fetched? If not I'd guess this is the clear winner.
Option 3
Default paging, LIMIT the number of results to N*10, issue a new query if insufficient suitable rows returned:
SELECT timestamp, value FROM test WHERE id=? AND timestamp>? LIMIT ?
If there are insufficient suitable rows in the results, issue a new query starting just after the previous query result's last timestamp.
I'd like to know what is likely to be the best option.
I did some rough-and-ready benchmarking. To my surprise, I found that the ALLOW FILTERING option was orders of magnitude faster, at least in my test scenario. The other two options were heavily dependent on the LIMIT or page size, with smaller a LIMIT/page performing very much worse.
If the first suitable row is found in the first page/first query result then the three options are not far off comparable, but ALLOW FILTERING is still fastest.
The biggest surprise to me was that paging through results of a single large query performs little better than serial execution (i.e. non-concurrent) of multiple small queries. Could it be that each time the driver requests the next page of results, Cassandra in effect executes a new query for that page?
Clearly, these conclusions are heavily biased by the dataset being queried. However, the superiority of ALLOW FILTERING was so stark that I'd make the working assumption that this will be applicable in almost all cases.

Total row count in Cassandra

I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?
You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.
As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).

Why does aggregating paginated query takes less time than fetching the entire table

I have a table in my database and I have it indexed over three columns: PropertyId, ConceptId and Sequence. This particular table has about 90,000 rows in it and it is indexed over these three properties.
Now, when I run this query, the total time required is greater than 2 minutes:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
However, if I paginate the query like so:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the aggregate time (x goes from 0 to 8) required is only around 20 seconds.
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries and we're adding on the additional latency required for sequential network calls because I haven't parallelized this query at all. And, I know it's not a caching issue because running these queries one after the other does not affect the latencies very much.
So, my question is this: why is one so much faster than the other?
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries
Pagination queries some times works very fast,if you have the right index...
For example,with below query
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the maximum rows you might read is 20000 only..below is an example which proves the same
RunTimeCountersPerThread Thread="0" ActualRows="60" ActualRowsRead="60"
but with select * query.. you are reading all the rows
After a prolonged search into what's going on here, I discovered that the reason behind this difference in performance (> 2 minutes) was due to hosting the database on Azure. Since Azure partitions any tables you host on it across multiple partitions (i.e. multiple machines), running a query like:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
would run more slowly because the query pulls data from all the partitions in before ordering them, which could result in multiple queries across multiple partitions on the same table. By paginating the query over indexed properties I was looking at a particular partition and querying over the table stored there, which is why it performed significantly better than the un-paginated query.
To prove this, I ran another query:
SELECT *
FROM MSC_NPV
ORDER BY Narrative
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
This query ran anemically when compared to the first paginated query because Narrative is not a primary key and therefore is not used by Azure to build a partition key. So, ordering on Narrative required the same operation as the first query and additional operations on top of that because the entire table had to be gotten beforehand.

High number of tombstones with TTL columns in Cassandra

I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE user_actions (
company_id varchar,
employee_id varchar,
inserted_at timeuuid,
action_type varchar,
PRIMARY KEY ((company_id, employee_id), inserted_at)
) WITH CLUSTERING ORDER BY (inserted_at DESC);
Basically a composite partition key that is made up of a company ID and an employee ID, and a clustering column, representing the insertion time, that is used to order the columns in reverse chronological order (newest actions are at the beginning of the row).
Here's what an insert looks like:
INSERT INTO user_actions (company_id, employee_id, inserted_at, action_type)
VALUES ('acme', 'xyz', now(), 'started_project')
USING TTL 1209600; // two weeks
Nothing special here, except the TTL which is set to expire in two weeks.
The read path is also quite simple - we always want the latest 100 actions, so it looks like this:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
LIMIT 100;
The issue: I would expect that since we order in reverse chronological order, and the TTL is always the same amount of seconds on insertion - that such a query should not scan through any tombstones - all "dead" columns are at the tail of the row, not the head. But in practice we see many warnings in the log in the following format:
WARN [ReadStage:60452] 2014-09-08 09:48:51,259 SliceQueryFilter.java (line 225) Read 40 live and 1164 tombstoned cells in profiles.user_actions (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=1410169639669000, localDeletion=1410169639}
and on rare occasions the tombstone number is large enough to abort the query completely.
Since I see this type of schema design being advocated quite often, I wonder if I'm doing something wrong here?
Your SELECT statement is not giving an explicit sort order and is hence defaulting to ASC (even though your clustering order is DESC).
So if you change your query to:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
ORDER BY inserted_at DESC
LIMIT 100;
you should be fine
Perhaps data is reappearing because a node fails and gc_grace_seconds expired already, the node comes back into the cluster, and Cassandra can't replay/repair updates because the tombstone disappeared after gc_grace_seconds: http://www.datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
The 2.1 incremental repair sounds like it might be right for you: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html

Resources