Cassandra supports CONTAINS on collections:
CREATE TABLE contacts (
id int PRIMARY KEY,
firstName text,
lastName text,
phones map<text, text>,
emails set<text>
);
CREATE INDEX ON contacts (firstName);
CREATE INDEX ON contacts (keys(phones)); // Using the keys function to index the map keys
CREATE INDEX ON contacts (emails);
And it is possible to query the emails set and check for specific Email. Simply:
SELECT * FROM contacts WHERE emails CONTAINS 'Benjamin#oops.com';
What would be the solution if one wants to check for lack of an element, something like: DOES NOT CONTAIN? I couldn't find such functionality in CQL docs, is there any solution for that?
No Cassandra does not support such feature. I can guess you have gone through below article:
A deep look at the CQL WHERE clause
You have to get the whole collections and filter it from your application.
Cassandra also doesn't support the "is not null" operator nor the "not equals" operator.
These restrictions are because of how C* stores data and how it finds row and scans columns fast. C* stores data in a key-value pair (map of map). It can fast get data by indexing the keys. C* can find the particular item very fast by jumping to the location where the data resides. If you index collections:
Sets and lists can index all values found by indexing the collection column. Maps can index a map key, map value, or map entry.
To support 'not feature', C* has to get all the rows or item in a collection and filter out all the results to give you the result which is not very efficient. So C* does not support this. If you need this, you can handle it in your application knowing all the facts and considerations.
**Note: Using C* index has its own performance impacts. Make sure you know all the considerations and use cases.
All the cautions about using secondary indexes apply to indexing collections.
Indexing a collection
When to use an index
Cassandra CQL mapping -> How C* stores collections.
Related
I want to use the IN clause for the non-primary key column in Cassandra. Is it possible? if it is not is there any alternate or suggestion?
Three possible solutions
Create a secondary index. This is not recommended due to performance problems.
See if you can designate that column in the existing table as part of the primary key
Create another denormalised table that table is optimised for your query. i.e data model by query pattern
Update:
And also even after you move that to primary key, operations with IN clause can be further optimised. I found this cassandra lookup by list of primary keys in java very useful
I have a table in my Cassandra DB with columns userid, city1, city2 and city3. What would my query be if I wanted to retrieve all users that have "Paris" as a city? I understand Cassandra doesn't have OR so I'm not sure how to structure the query.
First - it's heavily depend on the structure of the table - if you have userid as partition key, you can of course use secondary index to search users in cities, but it's not optimal as it's fan-out call - request is sent to all nodes in the cluster. You can re-design to use the materialized view with city as partition key, but you may have problems if you have a lot users in some cities.
In general, if you need to select several values in the same column - you can use IN operator, but it's better not to use it for partition keys (parallel queries are better). If you need OR on different columns - you need to do parallel queries, and collect results on application side.
I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.
Im trying to learn cassandra but im confused with the terminology.
Many instances it says the row stores key/value pairs.
but, when I define a table its more like declaring a SQL table ie; you create a table and specify the column names and data types.
Can someone clarify this?
Cassandra is a column based NoSQL database. While yes at its lowest level it does store simple key-value pairs it stores these key-value pairs in collections. This grouping of keys and collections is analogous to rows and columns in a traditional relational model. Cassandra tables contain a schema and can be referenced (with restrictions) using a SQL-like language called CQL.
In your comment you ask about Apples being stored in a different table from oranges. The answer to that specific question is No it will be in the same table. However Cassandra tables have an additional concept call the Partition Key that doesn't really have an analgous concept in the relational world. Take for example the following table definition
CREATE TABLE fruit_types {
fruit text,
location text,
cost float,
PRIMARY KEY ((fruit), location)
}
In this table definition you will notice that we are defining the schema for the table. You will also notice that we are defining a PRIMARY KEY. This primary key is similar but not exactly like a relational concept. In Cassandra the PRIMAY KEY is made up of two parts the PARTITION KEY and CLUSTERING COLUMNS. The PARTITION KEY is the first fields specified in the PRIMARY KEY and can contain one or more fields delimitated by parenthesis. The purpose of the PARTITION KEY is to be hashed and used to define the node that owns the data and is also used to physically divide the information on the disk into files. The CLUSTERING COLUMNS make up the other columns listed in the PRIMARY KEY and amongst other things are used for defining how the data is physically stored on the disk inside the different files as specified by the PARTITION KEY. I suggest you do some additional reading on the PRIMARY KEY here if your interested in more detail:
https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Basically cassandra storage is like sparse matrix, earlier version has a command line tool called cqlsh which can show the exact storage foot print of your columnfamily(aka table in latest version). Later community decided to keep RDBMS kind of syntax for better understanding coz the query language(CQL) syntax is similar to sql.
main storage is key(partition) (which is hash function result of chosen partition column in your table and rest of the columns will be tagged to it like sparse matrix.
I have a table like this
CREATE TABLE my_table(
category text,
name text,
PRIMARY KEY((category), name)
) WITH CLUSTERING ORDER BY (name ASC);
I want to write a query that will sort by name through the entire table, not just each partition.
Is that possible? What would be the "Cassandra way" of writing that query?
I've read other answers in the StackOverflow site and some examples created single partition with one id (bucket) which was the primary key but I don't want that because I want to have my data spread across the nodes by category
Cassandra doesn't support sorting across partitions; it only supports sorting within partitions.
So what you could do is query each category separately and it would return the sorted names for each partition. Then you could do a merge of those sorted results in your client (which is much faster than a full sort).
Another way would be to use Spark to read the table into an RDD and sort it inside Spark.
Always model cassandra tables through the access patterns (relational db / cassandra fill different needs).
Up to Cassandra 2.X, one had to model new column families (tables) for each access pattern. So if your access pattern needs a specific column to be sorted then model a table with that column in the partition/clustering key. So the code will have to insert into both the master table and into the projection table. Note depending on your business logic this may be difficult to synchronise if there's concurrent update, especially if there's update to perform after a read on the projections.
With Cassandra 3.x, there is now materialized views, that will allow you to have a similar feature, but that will be handled internally by Cassandra. Not sure it may fit your problem as I didn't play too much with 3.X but that may be worth investigation.
More on materialized view on their blog.