Querying data using partial data match? - cassandra

Given a table like:
CREATE TABLE customers (
id    bigint,
email text,
fullname text,
PRIMARY KEY (id)
);
I'd like to have the capability to occasionally search using a partial data match on email address or full name.
Apache Cassandra has support for SASI which I think enables this kind of querying.
How would this be done best when using CosmosDB and its Cassandra API?

Cosmos DB does not have SASI support but there is limited support for secondary indexes with the Cassandra API (see Wire protocol support for details).
You can create an index on email with:
CREATE INDEX customers_email_idx ON customers (email);
And you can query the table with:
SELECT id, fullname FROM customers WHERE email = ?
But partial matches on fullname is a challenge since you need SASI for it. Cheers!

Related

How to model data using Cassandra and Ignite together?

I'm researching how to model data having both Cassandra and Ignite together. So far the basic recommendation of data modeling in Cassandra (coming from this article) is clear: "model data around your queries". An author gives an example of "user lookup". We want to look up for users by their username or their email and according to him the best approach would be having two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
However things get confusing with Ignite on the top of Cassandra. Unfortunately I could not find any helpful examples or answers to the following questions:
Does having multiple tables that store user information mean having Ignite cache for each of these tables?
Does having compound primary key mean introducing new type for each key and use it as Ignite cache key?
Having Ignite means not having direct reads from Cassandra. Does it even make scene to bother modeling data following NoSql best practices? Would it be ok to just have one user table and let Ignite take care of queries by username or email.
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
age int
)
You should probably have one cache per Cassandra table.
If your original key is compound, so should Ignite key be.
You will need to use secondary indexes in Ignite to query by more than one field, and this means you will have to hold all data in Ignite (which is NOT necessary for pure caching scenario). This means enabling readThrough and writeThrough, doing loadCache and always doing all updates through Ignite. You will have to choose between "Ignite as cache for Cassandra" (stick to Cassandra's data layout, can hold partial data) and "Ignite as DB backed by Cassandra" (you can use layout optimal for Ignite, secondary indexes).

Azure Table Storage data modeling considerations

I have a list of users. A user can either login either using username or e-mail address.
As a beginner in azure table storage, this is what I do for the data model for fast index scan.
PartitionKey RowKey Property
users:email jacky#email.com nickname:jack123
users:username jack123 email:jacky#email.com
So when a user logs in via email, I would supply PartitionKey eq users:email in the azure table query. If it is username, Partition eq users:username.
Since it doesn't seem possible to simulate contains or like in azure table query, I'm wondering if this is a normal practice to store multiple row of data for 1 user ?
Since it doesn't seem possible to simulate contains or like in azure
table query, I'm wondering if this is a normal practice to store
multiple row of data for 1 user ?Since it doesn't seem possible to
simulate contains or like in azure table query, I'm wondering if this
is a normal practice to store multiple row of data for 1 user ?
This is a perfectly valid practice and in fact is a recommended practice. Essentially you will have to identify the attributes on which you could potentially query your table storage and somehow use them as a combination of PartitionKey and RowKey.
Please see Guidelines for table design for more information. From this link:
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with
different keys) to enable more efficient queries.

Search / Filter on primary key

I need to filter on a column, like "SELECT * FROM code WHERE code='a';" to get all code that starts with "a". That is: "aa","ab","ac"
CREATE TABLE codes (
code text,
PRIMARY KEY (CODE)
);
Do you know how?
Like Search (%% in sql) is not possible in cassandra.
The only way to do this efficiently is to use a full-text search engine like https://github.com/tjake/Solandra (Solr-on-cassandra).
Datastax enterprise edition has integrated solr feature for such query. But still it has read performance hit.
Step 1) solr will search and get the list of keys
Step 2) these keys should traverse throw the entire cluster and get the data, again depends on CONSISTENCY LEVEL.
my recommendation is avoid such query, cassandra is not for that.

Regular expression search or LIKE type feature in cassandra

I am using datastax cassandra ver 2.0.
How do we search in cassandra column a value using regular expression.Is there way to achieve 'LIKE' ( as in sQL) functionality ?
I have created table with below schema.
CREATE TABLE Mapping (
id timeuuid,
userid text,
createdDate timestamp,
createdBy text,
lastUpdateDate timestamp,
lastUpdateBy text,
PRIMARY KEY (id,userid)
);
I inserted few test records as below.
id | userid | createdby
-------------------------------------+----------+-----------
30c78710-c00c-11e3-bb06-1553ee5e40dd | Jon | admin
3e673aa0-c00c-11e3-bb06-1553ee5e40dd | Jony | admin
441c4210-c00c-11e3-bb06-1553ee5e40dd | Jonathan | admin
I need to search records, where userid contains the word 'jon'.So that in results, i get all records, containing jon,jony,jonathan.
I know,there is no sql LIKE functionality in cassandra.
But is there any way to achieve it in cassandra ?
(NOTE: I am using datastax-java driver as client api).
Are you using DSE or the community version? In case of DSE, consider having a Solr node for these types of queries. If not, maybe use something like lucene / solr as an inverted index outside of cassandra for that particular functionality. That may be a hassle if all you have is cassandra set up, in which case, have a manual inverted index as Ananth suggested. One option is to keep rows of 2-3 character prefixes that hold indices to partitions. You could query those, find the appropriate partitions client side and then issue another query against the target data.
There is a lucene index for cassandra. You can use this on the community edition too and perform Regex searches
You don't have regular expressions check in cql for now. The basic usage of cassandra is having it function like a big data storage. The kind of functionality you had asked for can be done in your code portion in an optimised manner. If you are still persisting on this usage, my suggestion would be this
Column family 1:
Id- an unique id for your userid
Name - jonny(or any name you would like to use)
combinations- j,jon,jon ,etc and all possible combinations you want
query this and get the appropriate id for your query
Use that id I you column family instead of name directly. Query using that id.
Try to normalise such operations as much as possible. Cassandra is like your base to control. It provides availability of crucial data . Not the flexibility of SQL .

Hector support for CQL3 specific features (Partition & Clustering keys) and Compact Storage option

I'm trying to leverage a specific feature of Apache Cassandra CQL3, which is partition and clustering keys for tables which are created with compact storage option.
For Eg.
CREATE TABLE EMPLOYEE(id uuid, name text, field text, value text, primary key(id, name , field )) with compact storage;
I've created the table via CQL3 and i;m able to insert rows successfully using the Hector API.
But I couldn't find right set of options in the hector api to create the table itself as i require.
To elaborate a little bit more:
In ColumnFamilyDefinition.java i couldnt see an option for setting storage option (as compact storage) and In ColumnDefinition.java, i couldnt find the option to say that this column is part of the Partition and Clustering Keys
Could you please give me an idea of whether i can use Hector for this (i.e. Creating table) or not and if i can do that, what are the options that i need to provide?
If you are not tied to Hector, you could look into the DataStax Java Driver which was created to use CQL3 and Cassandra's binary protocol.

Resources