SASI Index is node level or cluster level? - cassandra

Can anyone tell if we are creating SASI index then it will create index on node level or cluster level. How SASI different with normal Secondary Index? Is it worth to use or not?
However, I went through the documentation but did understand much about pros and cons.

When I studied the document found "SSTable attached secondary indexes (SASI) can be created on a non-collection column defined in a table" so it should be distibuted across the cluster.

Related

Cassandra secondary indices v. Lucene

I understand that Cassandra is a NoSQL db and patching it with many indices is not the way to go, but here I'm looking at solution for my analytics cluster, not for the production/real-time one.
So I think it makes sense to add indices to reduce the amount of data filtered by Spark.
How do native Cassandra secondary indices compare to Lucene's indices?
Many functionalities are not available with Cassandra alone, but what about things that you can do with both?
Is it better / does it make sense to only use Lucene?
Another advantage that I see is that I can install Lucene only on my analytics cluster, without overloading the real-time one with indices (and therefore improving the write performance on that side).
Don't bother with Lucene integration
Since Cassandra 3.4, we have a new secondary index called SASI that offers full text search and is quite performant.
Read this: https://github.com/apache/cassandra/blob/trunk/doc/SASI.md

how do multiple Cassandra secondary indices work?

As Cassandra does not have an execution plan, we were wondering how multiple secondary indices would work? i.e., if query was filtered by a different column order, which secondary index would get the preference and why?
We do know they are a bad practice and should be used for low cardinality sets or many duplicates but we were trying to leverage existing legacy cassandra tables and cannot use both cassandra secondary indices and SOLR indices at the same time, so don't have an option here.
Not much is discussed here either: http://www.datastax.com/docs/1.1/ddl/indexes
Secondary indexes are like lookup tables you create yourself, that cassandra manages. A node stores index info for rows it contains. Updates to an index on a node and the update of the data on that node is atomic. If multiple indexes are used in your query, only one will actually be used. I hope somebody can correct me on this, but from what I can tell, the first filter in your predicate is the one that'll be used.
Don't think of indexes as global lookups (in the general case). This will lead to annoying performance problems, etc. Think of indexes as a way of quickly getting to some columns inside of a partition where the column you want an equality filter on isn't the clustering key (or you want to be able to filter on the second clustering key without specifying the first one). If you hit a partition, then index performance is usually not bad. The information about low cardinality is correct - the higher the cardinality, the worse your index will perform.
Here's a short faq on indexes:
http://wiki.apache.org/cassandra/SecondaryIndexes

Like operator in cassandra

In SQL, we have an option to specify the LIKE operator in the where clause. Is there something like that in Cassandra? I am building a search feature for my site. All the data resides on Cassandra. So, it would be easier to search for keywords with LIKE operator.
No.You dont have such feature in cassandra. You gotto create a search engine on the data that is stored in cassandra to index the entries in cassandra may be. Cassandra serves as a container to hold your data and does not provide such features like full text search yet(I doubt if they will really as the storage is across SSTables).
If you need search capabilities on cassandra data, look no further than DSE:
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/srch/srchIntro.html

Why do we need secondary indexes in cassandra and how do they really work?

I was trying to understand why secondary indexes were even necessary on Cassandra.
I know that secondary indexes are used because:
"Secondary indexes allow for efficient querying by specific values using equality predicates (where column x = value y). Also, queries on indexed values can apply additional filters to perform operations such as range queries."
from: http://www.datastax.com/docs/0.7/data_model/secondary_indexes
But what I did not understand is why a query like:
get users where birth_date = 1973;
required that the birth_date had a secondary index. Why is it necessary for secondary indexes to even exist? Can't cassandra just go through the table and then return the values when the constrained is matched? Why do we need to treat things that we might want to query in that way in any special way?
I am assuming that the fact that cassandra is distributed and going through the whole table might not be easy due to each row key being allocated to a different node making it a little complicated. But I didn't really understand how making it distributed complicated the problem and how secondary indices resolved it (i.e. how does cassandra resolve this issue?).
Related to this question, is it true that secondary indexes and primary keys are the only things that can be queried in the for of SELECT * FROM column_family_table WHERE col_x = constraint? Why is the primary key special?
With amount of data these nosql databases meant to deal with, going for table scan or region scan is not an option. That's what Cassandra has restricted and allowed queries over non row key columns only if secondary indxes are enabled. That way such indices and data would be co located on same data node.
Hope it helps.
-Vivek

Storing a Lucene index in a Cassandra DB

Is there any way to use Apache Lucene and have it store values and retrieve values from a Cassandra cluster?
The hard way: implement a custom index type on top of Lucene and teach Cassandra to query it. There is also a two year old ticket open for this that you could watch.
The expensive way: buy a DataStax Enterprise license.
You should try out https://code.google.com/p/lucene-on-cassandra/ . Takes a different approach to the DataStax approach.

Resources