Range Queries in Cassandra (CQL 3.0) - cassandra

One main part of Cassandra that I don't fully understand is its range queries. I know that Cassandra emphasizes distributed environment and focuses on performance, but probably because of that, it currently only support several types of ranges queries that it can finish efficiently, and what I would like to know is that: which types of range queries are supported by Cassandra.
As far as I know, Cassandra supports the following range queries:
1: Range Queries on Primary key with keyword TOKEN, for example:
CREATE TABLE only_int (int_key int PRIMARY KEY);
...
select * from only_int where token(int_key) > 500;
2: Range Queries with one equality condition on a secondary index with keyword ALLOW FILTERING, for example:
CREATE TABLE example (
int_key int PRIMARY KEY,
int_non_key int,
str_2nd_idx ascii
);
CREATE INDEX so_example_str_2nd_idx ON example (str_2nd_idx);
...
select * from example where str_2nd_idx = 'hello' and int_non_key < 5 allow filtering;
But I am wondering if I miss something and looking for a canonical answer which lists all types of range queries supported by the current CQL (or some work-around that allows more types of range queries).

You can look for clustering keys.
A primary key can be formed by a partitioning key and then by clustering keys.
for example definition like this one
CREATE TABLE example (
int_key int,
int_non_key int,
str_2nd_idx ascii,
PRIMARY KEY((int_key), str_2nd_idx)
);
will allow to you make queries like these without using token
select * from example where str_2nd_idx < 'hello' allow filtering;
Before creating a TABLE in cassandra you should start from thinking about queries and what you want to ask from the data model in cassandra.

Apart from the queries you mentioned, you can also have queries on "Composite Key" column families (well you need to design your DB using composite keys, if that fits your constrains). For an example/discussion on this take a look at Query using composite keys, other than Row Key in Cassandra. When using Composite Keys you can perform other types of queries, namely "range" queries that do not use the "partition key" (first element of the composite key) - normally you need to set the "allow filtering" parameter to allow these queries, and also can perform "order by" operations on those elements, which can be very interesting in many situations. I do think that composite key column families allow to overcome several (necessary) "limitations" (to grant performance) of the cassandra data model when compared with the "extremely flexible" (but slow) model of RDBMS...

1) Create table:
create table test3(name text,id int,address text,num int,primary key(name,id,address))with compact storage;
2) Inserting data into table:
insert into test3(name,id,address,num) values('prasad',1,'bangalore',1) ;
insert into test3(name,id,address,num) values('prasad',2,'bangalore',2) ;
insert into test3(name,id,address,num) values('prasad',3,'bangalore',3) ;
insert into test3(name,id,address,num) values('prasad',4,'bangalore',4) ;
3)
select * from test3 where name='prasad' and id <3;
4)
name | id | address | num
--------+----+-----------+-----
prasad | 1 | bangalore | 1
prasad | 2 | bangalore | 2

Related

Is there anyway to use LIKE in NoSQL Command on non primary Key?

I am selecting from Cassandra database using the LIKE operator on non primary key.
select * from "TABLE_NAME" where "Column_name" LIKE '%SpO%' ALLOW FILTERING;
Error from server: code=2200 [Invalid query] message="LIKE restriction is only
supported on properly indexed columns. parameter LIKE '%SpO%' is not valid."
Simply put, "yes" there is a way to query with LIKE on a non-Primary Key component. You can do this with a SASI (Storage Attached Secondary Index) Index. Here is a quick example:
CREATE TABLE testLike (key TEXT PRIMARY KEY, value TEXT) ;
CREATE CUSTOM INDEX valueIdx ON testLike (value)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS={'mode':'CONTAINS'};
As your query requires to match a string within a column, and not just a prefix or suffix you'll want to pass the CONTAINS option on index creation.
After writing some data, your query works for me:
> SELECT * FROM testlike WHERE value LIKE '%SpO%';
key | value
-----+--------------
C | CSpOblahblah
D | DSpOblahblah
(2 rows)
WARNING!!!
This query is extremely inefficient, and will probably time out in a large cluster, unless you also filter by a partition key in your WHERE clause. It's important to understand that while this functionality works similar to how a relational database would, that Cassandra is definitely not a relational database. It is simply not designed to handle queries which incur a large amount of network time polling multiple nodes for data.

How does ALLOW FILTERING work when we provide all of the partition keys?

I've read at least 50 articles on this and still don't know the answer ...
I know how partitioning, clustering and ALLOW FILTERING work, but can't figure out what is the situation behind using ALLOW FILTERING with all partition keys provided in a query.
I have a table like this:
CREATE TABLE IF NOT EXISTS keyspace.events (
date_string varchar,
starting_timestamp bigint,
event_name varchar,
sport_id varchar
PRIMARY KEY ((date_string), starting_timestamp, id)
);
How does query like this work ?
SELECT * FROM keyspace.events
WHERE
date_string IN ('', '', '') AND
starting_timestamp < '' AND
sport_id = 1 /* not in partitioning nor clustering key */
ALLOW FILTERING;
Is the 'sport_id' filtering done on records retreived earlier by the correctly defined keys ? Is ALLOW FILTERING still discouraged in this kind of query ?
How should I perform filtering in this particular situation ?
Thanks in advance
Yes, it should first filter out the partitions and then only will do the filtering on the non-key value and as per the experiment mentioned here : https://dzone.com/articles/apache-cassandra-and-allow-filtering
I think its safe to use the allow filtering after all the keys in most case.
It will highly depend on how much data you are filtering out as well - if the last condition of sport_id = 1 is trying to filter out most of the data then it will be a bad idea as it gives a lot of pressure to the database, so you need to consider the trade-offs here.
Its not a good idea to use an IN clause with the partition key - especially the above query doesnt look good because its using both IN clause on Partition key and the allow filtering.
Suggestion - Cassandra is very good at processing as many requests as you need in a second and the design idea should be to send more lighter queries at once than trying to send one query which does lot of work. So my suggestion would be to fire N calls to Cassandra each with = condition on partition key without filtering the last column and then combine and do final filter in the code (which ever language you are using I assume it can support sending all these calls parallel to the database). By doing so you will get the advantage in performance in long term when the data grows.

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

How to get a range of data from Cassandra

[cqlsh 5.0.1 | Cassandra 2.1.0 | CQL spec 3.2.0 | Native protocol v3]
table:
CREATE TABLE dc.event (
id timeuuid PRIMARY KEY,
name text
) WITH bloom_filter_fp_chance = 0.01;
How do I get a time range of data from Cassandra?
For example, when I try 'select * from event where id> maxTimeuuid('2014-11-01 00:05+0000') and minTimeuuid('2014-11-02 10:00+0000')', as seen here http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/timeuuid_functions_r.html
I get the following error: 'code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"'
Can I keep timeuuid as primary key and meet the requirement?
Thanks
Can I keep timeuuid as primary key and meet the requirement?
Not really, no. From http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
WHERE clauses can include a greater-than and less-than comparisons,
but for a given partition key, the conditions on the clustering column
are restricted to the filters that allow Cassandra to select a
contiguous ordering of rows.
You could try adding "ALLOW FILTERING" to your query... but I doubt that would work. And I don't know of a good way (and neither do I believe there is a good way) to tokenize the timeuuids. I'm about 99% sure the ordering from the partitioner would yield unexpected, bad results, even though the query itself would execute and appear correct until you dug into it.
As an aside, you should really check out a similar question that was asked about a year ago: time series data, selecting range with maxTimeuuid/minTimeuuid in cassandra
Short answer, No. Long answer, you can do something similar EG:
CREATE TABLE dc.event (
event_time timestamp,
id timeuuid,
name text,
PRIMARY KEY(event_time, id)
) WITH bloom_filter_fp_chance = 0.01;
The timestamp would presumably be truncated so that it only reflected a whole day (or hour or minute depending on the velocity of your data). Your where clause would have to include the "IN" parameter for the timestamps that are included in your timeuuid range.
If you use an appropriate chunking factor (how much you truncate your timestamp), you may even answer some of the questions you're looking for without using a range of timeuuids, just a simple where clause.
Essentially this allows you the leeway to make the kind of query you're looking for while respecting the restrictions in Cassandra. As Raedwald pointed out, you can't use the partition key in continuous ranges because of the underpinning nature of Cassandra as a large hash- That being said, Cassandra is well known to do some incredibly powerful things in time-series data.
Take a look at how Newts is doing time series for ranges. The author has a great set of slides and a talk describing the data model to get precisely what you seem to be looking for. https://github.com/OpenNMS/newts/
Cassandra can not do this kind of query because Cassandra is a key-value store implemented using a giant hash map, not a relational database. Just like an in memory hash map, the only way to find the key values within a sub range is to iterate through all the keys. That can be expensive enough for an in memory hash map, but for Cassandra it would be crippling.
Yes, you can do it by using spark with scala and spark-cassandra-connector!
I think you should keep your partition keys fewer by setting them to 'YYYY-MM-dd hh:00+0000' and filter on dates and hours only.
Then you could use something like:
case class TableKey(id: timeuuid)
val dates = Array("2014-11-02 10:00+0000","2014-11-02 11:00+0000","2014-11-02 12:00+0000")
val selected_data = sc.parallelize(dates).map(x => TableKey(_)).joinWithCassandraTable('dc', 'event')
And there you have your selected data rdd that you could collect:
val data = selected_data.collect
I had similar problem...

Cassandra 1.2 : Updating type in primary Key CQL3

We currently have a table defined as below
create table tableA(id int,
seqno int,
data text,
PRIMARY KEY((id), seqno))
WITH CLUSTERING ORDER BY (seqno DESC);
We need to update the type for the id column from int to text. We are wondering out of the two approaches, would be the most advisable.
ALTER TABLE tableA ALTER id TYPE varchar; (the command succeeds but then we have issues reading the data. Is this because the ALTER table doesn't update the underlying storage of the id column?)
COPY to/from oldtable/newtable. This works but we have issues with the RPC timeout (which we can change), but is this a bad idea on a table across a cluster?
We have checked the online docs and these are only 2 options we can find around this. are there other options??
Thanks
Paul
I would say option 1 isn't really supported. If your integers don't map to actual strings you're going to have problem, you're probably seeing key validation errors.
for option 2 you probably just need to copy smaller chunks of data for each read/write.

Resources