secondary indexes for low cardinality columns cassandra - cassandra

we have a table with 15 million records, and ours is a 10 node cassandra cluster. We have a column which has close to 20 repeatable values. Is it advisable to build secondary index on this column?

Assuming completely uniform distribution on that column, then each column value would map to 750,000 rows. Now while the DataStax doc on When To Use An Index states that...
built-in indexes are best on a table having many rows that contain the indexed value.
750,000 rows certainly qualifies as "many." But even given that, remember that you're also talking about 14,250,000 rows that Cassandra has to ignore when fulfilling your query.
Also, unless you have a RF of 10 (and I doubt that you would with 10 nodes), you are going to incur network time as Cassandra works between all of the different nodes required to fulfill your query. For 750,000 rows, that's probably going to timeout.
The only way I think this could be efficient, would be to first restrict your query by a partition key. Using the secondary index while also restricting with a partition key will help Cassandra find your rows more quickly. Even so, with a dataset that big, I would re-evaluate your data model and try to figure out a different table to fulfill that query without requiring a secondary index.

Related

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

Difference between creating a secondary index vs creating an index CF manually in Cassandra

Can anyone tell me the differnce between creating a secondary index vs creating an index CF manually in Cassandra
Secondary indexes in Cassandra are stored and maintained on each node. Thus, when you filter by a secondary index, Cassandra will need to do the search on every node, and then return the combined results. Therefore, filtering by secondary indexes can be significantly slower than filtering by partition key (according to my tests it can be 10 times slower, depending on your data and topology).
Maintaining your own index table is more efficient for most use cases, but you need to deal with updating the index table on your own. Also, you will need to do two queries for retrieving your data: one that queries the index table, and another one for retrieving the actual data.
Another solution would be to duplicate your data completely, and create two tables with the same structure, but different keys.
If performance is your key concern, then go for an index table or a duplicated table. If you need simplicity and can afford some performance penalty, use secondary indexes, but I recommend to do some performance testing beforehand.

Cassandra Performance : Less rows with more columns vs more rows with less columns

We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.

Cassandra schema design: should more columns go into partition vs. cluster?

In my case I have a table structure like this:
table_1 {
entity_uuid text
,fk1_uuid text
,fk2_uuid text
,int_timestamp bigint
,cnt counter
,primary key (entity_uuid, fk1_uuid, fk2_uuid, int_timestamp)
}
The text columns are made up of random strings. However, only entity_uuid is truly random and evenly distributed. fk1_uuid and fk2_uuid have much lower cardinality and may be sparse (sometimes fk1_uuid=null or fk2_uuid=null).
In this case, I can either define only entity_uuid as the partition key or entity_uuid, fk1_uuid, fk2_uuid combination as the partition key.
And this is a LOOKUP-type of table, meaning we don't plan to do any aggregations/slice-dice based on this table. And the rows will be rotated out since we will be inserting with TTL defined for each row.
Can someone enlighten me:
What is the downside of having too many partition keys with very few
rows in each? Is there a hit/cost on the storage engine level?
My understanding is the cluster keys are ALWAYS sorted. Does that mean having text columns in a cluster will always incur tree
balancing cost?
Well you can tell where my heart lies by now. However, when all rows in a partition all TTL-ed out, that partition still lives, or is there a way they will be removed by the DB engine as well?
Thanks,
Bing
The major and possibly most significant difference between having big partitions and small partitions is the ability to do range scans. If you want to be able to do scan queries like
SELECT * FROM table_1 where entity_id = x and fk1_uuid > something
Then you'll need to have the clustering column for performance, otherwise this query would be difficult (a multi-get at best, full table scan at worst.) I've never heard of any cases where having too many partitions is a drag on performance but having too wide a partition (ie lots of clustering column values) can cause issues when you get into the 1B+ cell range.
In terms of the cost of clustering, it is basically free at write time (in memory sort is very very fast) but you can incur costs at read time as partitions become spread amongst various SSTables. Small partitions which are written once will not occur the merge penalty since they will most likely only exist in 1 SSTable.
TTL'd partitions will be removed but be sure to read up on GC_GRACE_SECONDS to see how Cassandra actually deals with removing data.
TL;DR
Everything is dependent on your read/write pattern
No Range Scans? No need for clustering keys
Yes Range Scans? Clustering keys a must

Resources