Search / Filter on primary key - cassandra

I need to filter on a column, like "SELECT * FROM code WHERE code='a';" to get all code that starts with "a". That is: "aa","ab","ac"
CREATE TABLE codes (
code text,
PRIMARY KEY (CODE)
);
Do you know how?

Like Search (%% in sql) is not possible in cassandra.
The only way to do this efficiently is to use a full-text search engine like https://github.com/tjake/Solandra (Solr-on-cassandra).

Datastax enterprise edition has integrated solr feature for such query. But still it has read performance hit.
Step 1) solr will search and get the list of keys
Step 2) these keys should traverse throw the entire cluster and get the data, again depends on CONSISTENCY LEVEL.
my recommendation is avoid such query, cassandra is not for that.

Related

How does ALLOW FILTERING work when we provide all of the partition keys?

I've read at least 50 articles on this and still don't know the answer ...
I know how partitioning, clustering and ALLOW FILTERING work, but can't figure out what is the situation behind using ALLOW FILTERING with all partition keys provided in a query.
I have a table like this:
CREATE TABLE IF NOT EXISTS keyspace.events (
date_string varchar,
starting_timestamp bigint,
event_name varchar,
sport_id varchar
PRIMARY KEY ((date_string), starting_timestamp, id)
);
How does query like this work ?
SELECT * FROM keyspace.events
WHERE
date_string IN ('', '', '') AND
starting_timestamp < '' AND
sport_id = 1 /* not in partitioning nor clustering key */
ALLOW FILTERING;
Is the 'sport_id' filtering done on records retreived earlier by the correctly defined keys ? Is ALLOW FILTERING still discouraged in this kind of query ?
How should I perform filtering in this particular situation ?
Thanks in advance
Yes, it should first filter out the partitions and then only will do the filtering on the non-key value and as per the experiment mentioned here : https://dzone.com/articles/apache-cassandra-and-allow-filtering
I think its safe to use the allow filtering after all the keys in most case.
It will highly depend on how much data you are filtering out as well - if the last condition of sport_id = 1 is trying to filter out most of the data then it will be a bad idea as it gives a lot of pressure to the database, so you need to consider the trade-offs here.
Its not a good idea to use an IN clause with the partition key - especially the above query doesnt look good because its using both IN clause on Partition key and the allow filtering.
Suggestion - Cassandra is very good at processing as many requests as you need in a second and the design idea should be to send more lighter queries at once than trying to send one query which does lot of work. So my suggestion would be to fire N calls to Cassandra each with = condition on partition key without filtering the last column and then combine and do final filter in the code (which ever language you are using I assume it can support sending all these calls parallel to the database). By doing so you will get the advantage in performance in long term when the data grows.

Cassandra query collection while specifying a partition key

I've been reading about indexes in Cassandra but I'm a little confused when it comes to creating an index on a collection like a set, list or map.
Let's say I have the following table and index on users like the following
CREATE TABLE chatter.channels (
id text PRIMARY KEY,
users set<text>
);
CREATE INDEX channels_users_idx ON chatter.channels (values(users));
INSERT INTO chatter.channels (id, users) VALUE ('ch1', {'jeff', 'jenny'});
In the docs, at least what I've found so far, says that this can have a huge performance hit because the indexes are created local on the nodes. And all the examples that are given query the tables like below
SELECT * FROM chatter.channels WHERE users CONTAINS 'jeff';
From my understanding this would have the performance hit because the partition key is not specified and all nodes must be queried. However, if I was to issue a query like below
SELECT * FROM chatter.channels WHERE id = 'ch1' AND users CONTAINS 'jeff';
(giving the partition key) then would I still have the performance hit?
How would I be able to check this for myself? In SQL I can run EXPLAIN and get some useful information. Is there something similar in Cassandra?
Cassandra provides tracing capability , this helps to trace the progression of reads and writes of queries in Cassandra.
To view traces, open -> cqlsh on one of your Cassandra nodes and run the following command:
cqlsh> tracing on;
Now tracing requests.
cqlsh> use [KEYSPACE];
I hope this helps in checking the performance of query.

How to create full / advance text search in scylladb and cassandra?

I have installed the latest version of scylladb and cassandra in my centos os. i have tried allow filtering in select query but i don't need it, I want advance search or full text search in it, i have google it but couldn't find any solution, when i create indexes and try to run the select query it gives error "server error: not implemented: indexes".
can any one help me please?
Scylla is actively working to enable secondary indexes. Expecting to have a working solution with 2.2 release
http://www.scylladb.com/product/technology/scylla-roadmap/
To currently support a full text search with Scylla, an auxiliary solution such as Solr or Elasticsearch is needed, the following link explains how to combine a Scylla and Elasticsearch
http://www.scylladb.com/2017/08/03/data-analytics-elastic-scylla/
If you are using cassandra version 3.4 or above then you can use SSTable Attached Secondary Index (SASI).
Using CQL, SSTable attached secondary indexes (SASI) can be created on a non-collection column defined in a table. Secondary indexes are used to query a table that uses a column that is not normally queryable, such as a non primary key column. SASI implements three types of indexes, PREFIX, CONTAINS, and SPARSE.
Learn more on : https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html
Or You could use Apache Solr or Elastic Search. So when ever any searchable data created, updated or deleted you have index or delete the data from solr or elastic search.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Regular expression search or LIKE type feature in cassandra

I am using datastax cassandra ver 2.0.
How do we search in cassandra column a value using regular expression.Is there way to achieve 'LIKE' ( as in sQL) functionality ?
I have created table with below schema.
CREATE TABLE Mapping (
id timeuuid,
userid text,
createdDate timestamp,
createdBy text,
lastUpdateDate timestamp,
lastUpdateBy text,
PRIMARY KEY (id,userid)
);
I inserted few test records as below.
id | userid | createdby
-------------------------------------+----------+-----------
30c78710-c00c-11e3-bb06-1553ee5e40dd | Jon | admin
3e673aa0-c00c-11e3-bb06-1553ee5e40dd | Jony | admin
441c4210-c00c-11e3-bb06-1553ee5e40dd | Jonathan | admin
I need to search records, where userid contains the word 'jon'.So that in results, i get all records, containing jon,jony,jonathan.
I know,there is no sql LIKE functionality in cassandra.
But is there any way to achieve it in cassandra ?
(NOTE: I am using datastax-java driver as client api).
Are you using DSE or the community version? In case of DSE, consider having a Solr node for these types of queries. If not, maybe use something like lucene / solr as an inverted index outside of cassandra for that particular functionality. That may be a hassle if all you have is cassandra set up, in which case, have a manual inverted index as Ananth suggested. One option is to keep rows of 2-3 character prefixes that hold indices to partitions. You could query those, find the appropriate partitions client side and then issue another query against the target data.
There is a lucene index for cassandra. You can use this on the community edition too and perform Regex searches
You don't have regular expressions check in cql for now. The basic usage of cassandra is having it function like a big data storage. The kind of functionality you had asked for can be done in your code portion in an optimised manner. If you are still persisting on this usage, my suggestion would be this
Column family 1:
Id- an unique id for your userid
Name - jonny(or any name you would like to use)
combinations- j,jon,jon ,etc and all possible combinations you want
query this and get the appropriate id for your query
Use that id I you column family instead of name directly. Query using that id.
Try to normalise such operations as much as possible. Cassandra is like your base to control. It provides availability of crucial data . Not the flexibility of SQL .

Resources