Cassandra query table based on row range - cassandra

I am new to cassandra. I am using cassandra-3.0 and datastax java driver for development. I would like to know whether cassandra provide any option to fecth the data based on rowkey range?
something like
select * from <table-name> where rowkey > ? and rowkey < ?;
If not, is there any other option in cassandra ( java/cql) to fetchdata based on row ranges?

Unfortunately, there really isn't a mechanism in Cassandra that works in the way that you are asking. The only way to run a range query on your partition keys (rowkey) is with the token function. This is because Cassandra orders its rows in the cluster by the hashed token value of the partition key. That value would not really have any meaning for you, but it would allow you to "page" through the a large table without encountering timeouts.
SELECT * FROM <table-name>
WHERE token(rowkey) > -9223372036854775807
AND token(rowkey) < -5534023222112865485;
The way to go about range querying on meaningful values, is to find a value to partition your rows by, and then cluster by a numeric or time value. For example, I can query a table of events by date range, if I partition my data by month (PRIMARY KEY(month,eventdate)):
aploetz#cqlsh:stackoverflow> SELECT * FROM events
WHERE monthbucket='201509'
AND eventdate > '2015-09-19' AND eventdate < '2015-09-26';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-25 06:00:00+0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-25 05:59:59+0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-22 06:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-20 05:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 06:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(5 rows)

Related

Cassandra CLUSTERING ORDER does not order data properly

I created a table that has timestamps in it but when I try to Cluster Order By the timestamp variable, it is not ordered properly.
To create the table I wrote:
CREATE TABLE videos_by_tag (
tag text,
video_id uuid,
added_date timestamp,
title text,
PRIMARY KEY ((tag), added_date, video_id))
WITH CLUSTERING ORDER BY (added_date ASC);
And the output I got when doing a SELECT * FROM videos_by_tag is:
tag | added_date | video_id | title
-----------+---------------------------------+--------------------------------------+------------------------------
datastax | 2013-04-16 00:00:00.000000+0000 | 5645f8bd-14bd-11e5-af1a-8638355b8e3a | What is DataStax Enterprise?
datastax | 2013-10-16 00:00:00.000000+0000 | 4845ed97-14bd-11e5-8a40-8338255b7e33 | DataStax Studio
cassandra | 2012-04-03 00:00:00.000000+0000 | 245e8024-14bd-11e5-9743-8238356b7e32 | Cassandra & SSDs
cassandra | 2013-03-17 00:00:00.000000+0000 | 3452f7de-14bd-11e5-855e-8738355b7e3a | Cassandra Intro
cassandra | 2014-01-29 00:00:00.000000+0000 | 1645ea59-14bd-11e5-a993-8138354b7e31 | Cassandra History
(5 rows)
As you can see the dates are out of order. There is a 2012 year value in the middle of the output.
You can fine-tune the display order using the ORDER BY clause. The partition key must be defined in the WHERE clause and the ORDER BY clause defines the clustering column to use for ordering.
Example:
SELECT * FROM videos_by_tag
WHERE tag = 'datastax' ORDER BY added_date ASC;
This is a very common misconception in Cassandra. The data is in fact ordered correctly in the sample data you posted.
The CLUSTERING ORDER applies to the sort order of the rows within a partition -- NOT across ALL partitions.
Using the example you posted, the clustering column added_date is correctly sorted in ascending order for the partition tag = 'datastax':
tag | added_date
-----------+---------------------------------
datastax | 2013-04-16 00:00:00.000000+0000
datastax | 2013-10-16 00:00:00.000000+0000
Similarly, added_date is sorted in ascending order for tag = 'cassandra':
tag | added_date
-----------+---------------------------------
cassandra | 2012-04-03 00:00:00.000000+0000
cassandra | 2013-03-17 00:00:00.000000+0000
cassandra | 2014-01-29 00:00:00.000000+0000
Like I said, the sort order only applies to rows within a partition.
It would be impossible to sort all rows in all partitions because such task does not scale. Imagine if you had billions of partitions in the table across hundreds of nodes. Every time you inserted a new row to any partition, Cassandra has to do a full table scan to sort the data and it just wouldn't make sense to do so. Cheers!

Order by in materialized view doesn't sort the results

I have a table with a structure like this:
CREATE TABLE kaefko.se_vi_f55dfeebae00d2b3 (
value text PRIMARY KEY,
id text,
popularity bigint);
With data that looks like this:
value | id | popularity
--------+------------------+------------
rally | 4eff16cb91f96cd6 | 2
reddit | 11aa39686ed66ba5 | 3
red | 552d7e95af481415 | 1
really | 756bfa499965863c | 1
right | c5850c6b08f7966b | 1
redis | 7f1d251f399442d7 | 1
And I've created a materialized view that should sort these values by the popularity from the biggest to the smallest ones:
CREATE MATERIALIZED VIEW kaefko.se_vi_f55dfeebae00d2b3_by_popularity AS
SELECT *
FROM kaefko.se_vi_f55dfeebae00d2b3
WHERE popularity IS NOT null
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
But the data in the materialized view looks like this:
value | popularity | id
--------+------------+------------------
rally | 2 | 4eff16cb91f96cd6
reddit | 3 | 11aa39686ed66ba5
really | 1 | 756bfa499965863c
right | 1 | c5850c6b08f7966b
redis | 1 | 7f1d251f399442d7
As you can see there are two main issues:
Data is not sorted as defined in the materialized view
There is just a part of all data in the materialized view
I'm not very experienced in Cassandra and I've already spent hours trying to find the reason why this happens with no avail. Could somebody please help me? Thank you <3
__
I'm using ScyllaDB 4.1.9-0 and cqlsh shows this:
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Alex's comment is 100% correct, the order is within the partition.
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
This means that the ordering of popularity is descending only for values where the 'value' field is the same - if I was to alter the data you used to show what this would look like as an example, you would get the following:
value | popularity | id
--------+------------+------------------
rally | 3 | 4eff16cb91f96cd6
rally | 2 | 11aa39686ed66ba5
really | 3 | 756bfa499965863c
really | 2 | c5850c6b08f7966b
really | 1 | 7f1d251f399442d7
The order is on a per partition key basis, not globally ordered.

Cassandra - CQL - Order by desc on partition key

I create a table in Cassandra for monitoring insert from an application.
My partition key is an int composed by year+month+day, my clustering key a timestamp and after that my username and some others fields.
I would like to display the last 5 inserts but it's seems that the partition key go before the "order by desc".
How can I get the correct result ? Normaly clustering key induces the order so why I get this result? (Thank in advance)
Informations :
Query : select tsp_insert, txt_name from ks_myKeyspace.myTable limit 5;
Result :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
Wanted :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
My table :
CREATE TABLE ks_myKeyspace.myTable(
idt_day int,
tsp_insert timestamp,
txt_name text, ...
PRIMARY KEY (idt_day, tsp_insert)) WITH CLUSTERING ORDER BY (tsp_insert DESC);
Ultimately, you are seeing the current order because you are not using a WHERE clause. You can see what's going on if you use the token function on your partition key:
aploetz#cqlsh:stackoverflow> SELECT idt_day,tsp_insert,token(idt_day),txt_name FROM mytable ;
idt_day | tsp_insert | system.token(idt_day) | txt_name
----------+---------------------------------+-----------------------+----------
20161028 | 2016-10-28 15:21:09.000000+0000 | 810871225231161248 | Jean
20161028 | 2016-10-28 15:21:01.000000+0000 | 810871225231161248 | Michel
20161028 | 2016-10-28 15:20:44.000000+0000 | 810871225231161248 | Quentin
20161031 | 2016-10-31 09:24:32.000000+0000 | 5928478420752051351 | Jacquie
20161031 | 2016-10-31 09:23:32.000000+0000 | 5928478420752051351 | Gabriel
(5 rows)
Results in Cassandra CQL will always come back in order of the hashed token value of the partition key (which you can see by using token). Within the partition keys, your CLUSTERING ORDER will be enforced.
That's key to understand... Result set ordering in Cassandra can only be enforced within a partition key. You have no control over the order that the partition keys come back in.
In short, use a WHERE clause on your idt_day and you'll see the order you expect.
It seems to me that you are getting the whole thing wrong. Partition keys are not used for ordering data, they are used only to know the location of your data in the cluster, specifically the node. Moreover, the order really matters inside a partition only...
Your query results really are unpredictable. Depending on which node is faster to answer (assuming a cluster and not a single node), you can get every time a different result. You should try to avoid selecting without partition restrictions, they don't scale.
You can however change your queries and perform one select per day, then you'd query for ordered data (your clustering key) in an ordered manner ( you manually chose the order of the days in your queries). And as a side note it would be faster because you could query multiple partitions in parallel.

Can Cassandra partition tables?

I'm inserting ~8 rows per sec, and I would like to have one big table with all rows and I want to partition this table into many tables every week.
e.g
select * from keyspace.rootTable; -> returns all rows from all tables
select * from keyspace.27-2016Table -> return all rows from week 27
At 86400 seconds per day and 604800 seconds per week, you'll be storing 691200 rows per day and 4838400 rows each week. Even without knowing how wide your rows are, that's too many to return in a single query. Cassandra is great for storing lots of data like this. But querying lots of data like this...not so much.
You would probably want to partition by hour, but even that would give you 28800 rows. That's at least semi-manageable, so let's go with that.
I'd build a table that looks like this, partitioning on week and hourBucket while clustering on writeTime:
CREATE TABLE youAreAskingCassandraForTooManyRows (
week text,
hourBucket text,
writeTime timestamp,
value text,
PRIMARY KEY ((week,hourBucket),writeTime))
WITH CLUSTERING ORDER BY (writeTime DESC);
Then I could query by a specific week and hour, just by the partition keys:
aploetz#cqlsh:stackoverflow> SELECT *
FROM youareaskingcassandrafortoomanyrows
WHERE week='201607-3' AND hourBucket ='20160713-14';
week | hourBucket | writetime | value
----------+--------------+--------------------------+--------
201607-3 | 20160713-14 | 2016-07-13 14:01:18+0000 | value6
201607-3 | 20160713-14 | 2016-07-13 14:01:14+0000 | value5
201607-3 | 20160713-14 | 2016-07-13 14:01:12+0000 | value4
201607-3 | 20160713-14 | 2016-07-13 14:01:10+0000 | value3
201607-3 | 20160713-14 | 2016-07-13 14:01:07+0000 | value2
201607-3 | 20160713-14 | 2016-07-13 14:01:04+0000 | value1
(6 rows)
Or even for a specific range, based on the clustering key writetime.
aploetz#cqlsh:stackoverflow> SELECT *
FROM youareaskingcassandrafortoomanyrows
WHERE week='201607-3' AND hourBucket ='20160713-14'
AND writetime > '2016-07-13 14:01:05+0000'
AND writetime < '2016-07-13 14:01:18+0000';
week | hourBucket | writetime | value
----------+--------------+--------------------------+--------
201607-3 | 20160713-14 | 2016-07-13 14:01:14+0000 | value5
201607-3 | 20160713-14 | 2016-07-13 14:01:12+0000 | value4
201607-3 | 20160713-14 | 2016-07-13 14:01:10+0000 | value3
201607-3 | 20160713-14 | 2016-07-13 14:01:07+0000 | value2
(4 rows)
select * from keyspace.rootTable; -> returns all rows from all tables
It should go without saying that if I think that querying an entire week's worth of 4 million-plus rows will be so huge that it will time-out, then querying your entire table is a monumentally bad idea.
Important to note, Cassandra is not a relational database. It is a distributed system, and thus running unbound queries (queries without a WHERE clause) introduces LOTS of network time into your equation. That's why you always want to specify at least a partition key(s) with all SELECT queries, because then you can guarantee that you'll be satisfying that query from a single node.
You should take a look at Patrick McFadin's article on Getting Started with Time Series Data Modeling. That should help you to understand how to partition data like this, and get you on the right path.

Range query - Data modeling for time series in CQL Cassandra

I have a table like this:
CREATE TABLE test ( partitionkey text, rowkey text, date
timestamp, policyid text, policyname text, primary key
(partitionkey, rowkey));
with some data:
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000
I want to be able to find:
1/ data from a particular partition key
2/ data from a particular partition key & rowkey
3/ Range query on date given a partitionkey
1/ and 2/ are trivial:
select * from test where partitionkey='p1';
partitionkey | rowkey | policyid | policyname | range
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
but what about 3/?
Even with an index it doesnt work:
create index i1 on test (date);
select * from test where partitionkey='p1' and date =
'2007-01-02';
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 plicy1 | 2007-01-02 00:00:00+0000
but
select * from test where partitionkey='p1' and
date > '2007-01-02';
Bad Request: No indexed columns present in
by-columns clause with Equal operator
Any idea?
thanks,
Matt
CREATE TABLE test ( partitionkey text, rowkey text, date timestamp,
policyid text, policyname text, primary key (partitionkey, rowkey));
First of all, you really should use more descriptive column names instead of partitionkey and rowkey (and even date, for that matter). By looking at those column names, I really can't tell what kind of data this table is supposed to be indexed by.
select * from test where partitionkey='p1' and date > '2007-01-02';
Bad Request: No indexed columns present in by-columns clause with Equal operator
As for this issue, try making your "date" column a part of your primary key.
primary key (partitionkey, rowkey, date)
Once you do that, I think your date range queries will function appropriately.
For more information on this, check out DataStax Academy's (free) course called Java Development With Apache Cassandra. Session 5, Module 104 discusses how to model time series data and that should help you out.

Resources