Why secondary indexes are less efficient in Cassandra? - cassandra

I read in Cassandra documentation that creating secondary index is less efficient as because in worst case it need to touch all nodes in order to find out the data of that non-key column.
But my doubt is even if we do not create secondary index, then also it will have to touch all nodes (in worst case) and find out where that particular row with this non-key column value resides.
Note: Yeah, I understand that it is possible that if the cardinality is high then the secondary index will contain(store) index for mostly all rows and in this way it is bad in terms of storage. But I want to know how not creating secondary index is efficient than creating secondary index?

Secondary indexes should be used only in specific cases, like, when you use them together with condition on partition key column, you have correct cardinality for data, etc.
For example, if we have following table:
create table test.test (
pk int,
c1 int,
val1 int,
val2 int,
primary key(pk, c1));
and you created a secondary index on the column val2, then following query will be very effective:
select * from test.test where pk = 123 and val2 = 10
because you restricted the execution of query only to the nodes that are replicas for pk with value 123.
But if you do
select * from test.test where val2 = 10
then Cassandra will need to go to the every node, and ask for data there - it will be much slower, and put a pressure to coordinating node.
Standard secondary indexes have other limitations, such as, search only for specific values, problems when column has very low or very high cardinality, etc. SASI indexes are better from design standpoint, although they are still experimental, and have problems with implementation.
You can find technical details about implementation of secondary indexes in the following blog post.
DataStax has other implementations in the commercial offering:
DSE Search that is based on the Apache Solr, so you get a lot of flexibility (full text search, range queries, etc.)
new implementation called SSTable Attached Indexes (SAI) - they are currently marked as beta, but they provide more flexibility than standard secondary indexes, with less overhead than DSE Search

Related

Defining primary/partition key for a table with small records

I have a table which contains market information at the current instance. The rows in this table will not exceed more than 700. Does it make sense to create primary keys in this instance?
I will be querying this table either to get all the data or search for 'symbol' or 'sector'.
CREATE TABLE current(
symbol text,
sector text,
open double,
high double,
low double,
close double,
volume bigint,
market_cap double,
shares double,
last_update timestamp,
PRIMARY KEY (symbol, sector)
)
What would be the best approach to ensure quick retrievals?
Yes, it makes sense to create a primary key on this table, as cassandra does not allow tables without a primary key. If you have no more than around 700 records in total, I would question the wisdom of using Cassandra for that. Cassandra is designed to efficiently handle billions of records, petabytes of data.
Assuming that you did do a little research before deciding on Cassandra, and still came to the conclusion that it is a good fit for your use case, I also assume that you understood that Cassandra is not a relational database, and that you know about partitioning keys and clustering keys. The partitioning key is hash-based, decides which node(s) in the Cassandra cluster are responsible for handling each row, and will only allow equality lookups. The clustering key is tree-based, decides on the ordering of the rows on a node, and will allow for equality and range lookups.
Also, Cassandra only allows you to do either full table scans or querying on at least the entire partitioning key. So querying on symbol or sector is not feasible without creating a secondary index, which is actually an antipattern, but may not be that bad since you only have 700 rows.

Trying to visual how wide and skinny rows are layed out

Can someone give and show me how the data is layed out when you design your tables for wide vs. skinny rows.
I'm not sure I fully grasp how the data is spread out with a "wide" row.
Is there a difference in how you can fetch the data or will it be the same i.e. if it is ordered it doesn't matter if the data is vertical (skinny) or horizontally (wide) organized.
Update
Is a table considered with if the primary key consists of more than one column?
Or table will have wide rows only if the partition key is a composite partition key?
Wide... Skinny... Terms that make your head explode... I prefer to oversimplify the thing as such:
All the tables have wide rows
You simply need to take care of how wide the rows gets
This allows me to think this as follow (mangling a bit the C* terminology):
Number of RECORDS in a partition
1 <--------------------------------------- ... 2Billion
^ ^
Skinny rows wide rows
The lesser records in a partition, the skinner is the "partition", and vice-versa.
When designing for C* I always keep in mind a couple of things:
I want to use "skinny partitions" when my data can be fetched with one query and it is fully contained in one record of one partition. Typical example is something along SELECT * FROM table WHERE username = 'xmas79'; where the table has a primary key in the form of PRIMARY KEY (username)that let me get all the data belonging to a particular username.
I want to use "wide rows" when my data can be fetched with one query and it is fully contained on multiple records of one partition. Typical examples are range queries like SELECT * FROM table WHERE sensor = 'pressure' AND time >= '2016-09-22';, where the table has a primary key in the form of PRIMARY KEY (sensor, time).
So, first approach for one shot queries, second approach for range queries. Beware that this second approach have the (major) drawback that you can keep adding data to the partition, and it will get wider and wider, hurting performances.
In order to control how wide your partitions are, you need to add something to the partition key. In the sensor example above, if your don't violate your requirements of course, you can "group" some measurements by date, eg you split the measures in a day-by-day groups, making the primary key like PRIMARY KEY ((sensor, day), time), where the partition key was transformed to (sensor, day). By this approach, you have full (well, let's say good at least) control on the wideness of your partitions.
You only need to find a good compromise between your query capabilities and the desired performance.
I suggest these three readings for further investigation on the details:
Wide Rows in Cassandra CQL
Does CQL support dynamic columns / wide rows?
CQL3 for Cassandra experts
Beware that in the 1. there's a mistake in the second to last picture: the primary key should be
PRIMARY KEY ((user_id, tweet_id))
with double parenthesis around the columns instead of one.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

How do secondary indexes work in Cassandra?

Suppose I have a column family:
CREATE TABLE update_audit (
scopeid bigint,
formid bigint,
time timestamp,
record_link_id bigint,
ipaddress text,
user_zuid bigint,
value text,
PRIMARY KEY ((scopeid, formid), time)
) WITH CLUSTERING ORDER BY (time DESC)
With two secondary indexes, where record_link_id is a high-cardinality column:
CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);
CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);
According to my knowledge Cassandra will create two hidden column families like so:
CREATE TABLE update_audit_id_idx(
record_link_id bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((record_link_id), scopeid, formid, time)
);
CREATE TABLE update_audit_user_zuid_idx(
user_zuid bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((user_zuid), scopeid, formid, time)
);
Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.
Consider the following query:
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How will this query execute 'under the hood' in Cassandra?
How will a high-cardinality column index (record_link_id) affect its performance?
Will Cassandra touch all nodes for the above query? Why?
Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How the above query will work internally in cassandra?
Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.
How high-cardinality column (record_link_id)index will affect the query performance for the above query?
High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.
Does cassandra will touch all nodes for the above query? WHY?
No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.
creating index over high-cardinality columns will be the fastest and best data model
The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:
If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.
...
In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.
Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.
Edit
How does having index on low cardinality index when there are millions of users scale even when we provide the primary key
It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.
Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).
Edit 20180913
For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination?
Consider the following diagram, taken from the Java Driver documentation (v3.6):
Basically, paging will cause the query to break itself up and return to the cluster for the next iteration of results. It'd be less likely to timeout, but performance will trend downward, proportional to the size of the total result set and the number of nodes in the cluster.
TL;DR; The more requested results spread over more nodes, the longer it will take.
Query with only secondary index is also possible in Cassandra 2.x
select * from update_audit where record_link_id=9897;
But this has a large impact on fetching data, because it reads all partitions on distributed environment. The data fetched by this query is also not consistent and could not relay on it.
Suggestion:
Use of Secondary index is considered to be a DIRT query from NoSQL Data Model view.
To avoid secondary index, we could create a new table and copy data to it. Since this is a query of the application, Tables are derived from queries.

What is the difference between a clustering column and secondary index in cassandra

I'm trying to understand the difference between these two and the scenarios in which you would prefer to use one over the other.
My specific use case is using cassandra as an event ingestion system backed by an analytics engine that interprets the event.
My model includes
event id (the partition key)
event time (a clustering column)
event type (i'm not sure whether to use clustering column or secondary index)
I figure the most common read scenario will be to get the events over a time range hence event time is the clustering column. A less frequent read scenario might involve further filtering the event query by event type.
A secondary index is pretty similar to what we know from regular relational databases. If you have a query with a where clause that uses column values that are not part of the primary key, lookup would be slow because a full row search has to be performed. Secondary indexes make it possible to service such queries efficiently. Secondary indexes are stored as extra tables, and just store extra data to make it easy to find your way in the main table.
So that's a good ol' index, which we already know about. So far, there's nothing new to cassandra and its distributed nature.
Partitioning and clustering is all about deciding how rows from the main table are spread among the nodes. This is unique to cassandara since it determines the distribution of data. So, the primary key consists of at least one column. The first column in the primary key is used as the partition key. The partition key is used to decide which node to store a row. If the primary key has additional columns, the columns are used to cluster the data on a given node - the data is stored in lexicographic order on a node by clustering columns.
This question has more specifics on clustering columns: Clustering Keys in Cassandra
So an index on a given column X makes the lookup X --> primary key efficient. The partition key (first column in the primary key) determines which node a row is stored on. Clustering columns (additional columns in the primary key) determine which order rows are stored in on their assigned node.
So your intuition sounds about right - the event ID is presumably guaranteed unique, so is great for building a primary key. Event time is a great way to order rows on disk on a given node.
If you never needed to lookup data by event type, eg, never had a query like SELECT * FROM Events WHERE Type = Warning, then you have no need for your additional indexes, but your demands for partitioning don't change. Indexes make it easy to serve queries with different predicates. Since you mentioned that you indeed were planning on performing queries like that, you do in fact likely want an index on your EventType column.
Check out the cassandra documentation: http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Cassandra uses the first column name in the primary key definition as the partition key.
...
In the case of the playlists table, the song_order is the clustering column. The data for each partition is clustered by the remaining column or columns of the primary key definition. On a physical node, when rows for a partition key are stored in order based on the clustering columns

Resources