I create a table in Cassandra for monitoring insert from an application.
My partition key is an int composed by year+month+day, my clustering key a timestamp and after that my username and some others fields.
I would like to display the last 5 inserts but it's seems that the partition key go before the "order by desc".
How can I get the correct result ? Normaly clustering key induces the order so why I get this result? (Thank in advance)
Informations :
Query : select tsp_insert, txt_name from ks_myKeyspace.myTable limit 5;
Result :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
Wanted :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
My table :
CREATE TABLE ks_myKeyspace.myTable(
idt_day int,
tsp_insert timestamp,
txt_name text, ...
PRIMARY KEY (idt_day, tsp_insert)) WITH CLUSTERING ORDER BY (tsp_insert DESC);
Ultimately, you are seeing the current order because you are not using a WHERE clause. You can see what's going on if you use the token function on your partition key:
aploetz#cqlsh:stackoverflow> SELECT idt_day,tsp_insert,token(idt_day),txt_name FROM mytable ;
idt_day | tsp_insert | system.token(idt_day) | txt_name
----------+---------------------------------+-----------------------+----------
20161028 | 2016-10-28 15:21:09.000000+0000 | 810871225231161248 | Jean
20161028 | 2016-10-28 15:21:01.000000+0000 | 810871225231161248 | Michel
20161028 | 2016-10-28 15:20:44.000000+0000 | 810871225231161248 | Quentin
20161031 | 2016-10-31 09:24:32.000000+0000 | 5928478420752051351 | Jacquie
20161031 | 2016-10-31 09:23:32.000000+0000 | 5928478420752051351 | Gabriel
(5 rows)
Results in Cassandra CQL will always come back in order of the hashed token value of the partition key (which you can see by using token). Within the partition keys, your CLUSTERING ORDER will be enforced.
That's key to understand... Result set ordering in Cassandra can only be enforced within a partition key. You have no control over the order that the partition keys come back in.
In short, use a WHERE clause on your idt_day and you'll see the order you expect.
It seems to me that you are getting the whole thing wrong. Partition keys are not used for ordering data, they are used only to know the location of your data in the cluster, specifically the node. Moreover, the order really matters inside a partition only...
Your query results really are unpredictable. Depending on which node is faster to answer (assuming a cluster and not a single node), you can get every time a different result. You should try to avoid selecting without partition restrictions, they don't scale.
You can however change your queries and perform one select per day, then you'd query for ordered data (your clustering key) in an ordered manner ( you manually chose the order of the days in your queries). And as a side note it would be faster because you could query multiple partitions in parallel.
Related
I created a table that has timestamps in it but when I try to Cluster Order By the timestamp variable, it is not ordered properly.
To create the table I wrote:
CREATE TABLE videos_by_tag (
tag text,
video_id uuid,
added_date timestamp,
title text,
PRIMARY KEY ((tag), added_date, video_id))
WITH CLUSTERING ORDER BY (added_date ASC);
And the output I got when doing a SELECT * FROM videos_by_tag is:
tag | added_date | video_id | title
-----------+---------------------------------+--------------------------------------+------------------------------
datastax | 2013-04-16 00:00:00.000000+0000 | 5645f8bd-14bd-11e5-af1a-8638355b8e3a | What is DataStax Enterprise?
datastax | 2013-10-16 00:00:00.000000+0000 | 4845ed97-14bd-11e5-8a40-8338255b7e33 | DataStax Studio
cassandra | 2012-04-03 00:00:00.000000+0000 | 245e8024-14bd-11e5-9743-8238356b7e32 | Cassandra & SSDs
cassandra | 2013-03-17 00:00:00.000000+0000 | 3452f7de-14bd-11e5-855e-8738355b7e3a | Cassandra Intro
cassandra | 2014-01-29 00:00:00.000000+0000 | 1645ea59-14bd-11e5-a993-8138354b7e31 | Cassandra History
(5 rows)
As you can see the dates are out of order. There is a 2012 year value in the middle of the output.
You can fine-tune the display order using the ORDER BY clause. The partition key must be defined in the WHERE clause and the ORDER BY clause defines the clustering column to use for ordering.
Example:
SELECT * FROM videos_by_tag
WHERE tag = 'datastax' ORDER BY added_date ASC;
This is a very common misconception in Cassandra. The data is in fact ordered correctly in the sample data you posted.
The CLUSTERING ORDER applies to the sort order of the rows within a partition -- NOT across ALL partitions.
Using the example you posted, the clustering column added_date is correctly sorted in ascending order for the partition tag = 'datastax':
tag | added_date
-----------+---------------------------------
datastax | 2013-04-16 00:00:00.000000+0000
datastax | 2013-10-16 00:00:00.000000+0000
Similarly, added_date is sorted in ascending order for tag = 'cassandra':
tag | added_date
-----------+---------------------------------
cassandra | 2012-04-03 00:00:00.000000+0000
cassandra | 2013-03-17 00:00:00.000000+0000
cassandra | 2014-01-29 00:00:00.000000+0000
Like I said, the sort order only applies to rows within a partition.
It would be impossible to sort all rows in all partitions because such task does not scale. Imagine if you had billions of partitions in the table across hundreds of nodes. Every time you inserted a new row to any partition, Cassandra has to do a full table scan to sort the data and it just wouldn't make sense to do so. Cheers!
I have a table with a structure like this:
CREATE TABLE kaefko.se_vi_f55dfeebae00d2b3 (
value text PRIMARY KEY,
id text,
popularity bigint);
With data that looks like this:
value | id | popularity
--------+------------------+------------
rally | 4eff16cb91f96cd6 | 2
reddit | 11aa39686ed66ba5 | 3
red | 552d7e95af481415 | 1
really | 756bfa499965863c | 1
right | c5850c6b08f7966b | 1
redis | 7f1d251f399442d7 | 1
And I've created a materialized view that should sort these values by the popularity from the biggest to the smallest ones:
CREATE MATERIALIZED VIEW kaefko.se_vi_f55dfeebae00d2b3_by_popularity AS
SELECT *
FROM kaefko.se_vi_f55dfeebae00d2b3
WHERE popularity IS NOT null
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
But the data in the materialized view looks like this:
value | popularity | id
--------+------------+------------------
rally | 2 | 4eff16cb91f96cd6
reddit | 3 | 11aa39686ed66ba5
really | 1 | 756bfa499965863c
right | 1 | c5850c6b08f7966b
redis | 1 | 7f1d251f399442d7
As you can see there are two main issues:
Data is not sorted as defined in the materialized view
There is just a part of all data in the materialized view
I'm not very experienced in Cassandra and I've already spent hours trying to find the reason why this happens with no avail. Could somebody please help me? Thank you <3
__
I'm using ScyllaDB 4.1.9-0 and cqlsh shows this:
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Alex's comment is 100% correct, the order is within the partition.
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
This means that the ordering of popularity is descending only for values where the 'value' field is the same - if I was to alter the data you used to show what this would look like as an example, you would get the following:
value | popularity | id
--------+------------+------------------
rally | 3 | 4eff16cb91f96cd6
rally | 2 | 11aa39686ed66ba5
really | 3 | 756bfa499965863c
really | 2 | c5850c6b08f7966b
really | 1 | 7f1d251f399442d7
The order is on a per partition key basis, not globally ordered.
I am new to cassandra. I am using cassandra-3.0 and datastax java driver for development. I would like to know whether cassandra provide any option to fecth the data based on rowkey range?
something like
select * from <table-name> where rowkey > ? and rowkey < ?;
If not, is there any other option in cassandra ( java/cql) to fetchdata based on row ranges?
Unfortunately, there really isn't a mechanism in Cassandra that works in the way that you are asking. The only way to run a range query on your partition keys (rowkey) is with the token function. This is because Cassandra orders its rows in the cluster by the hashed token value of the partition key. That value would not really have any meaning for you, but it would allow you to "page" through the a large table without encountering timeouts.
SELECT * FROM <table-name>
WHERE token(rowkey) > -9223372036854775807
AND token(rowkey) < -5534023222112865485;
The way to go about range querying on meaningful values, is to find a value to partition your rows by, and then cluster by a numeric or time value. For example, I can query a table of events by date range, if I partition my data by month (PRIMARY KEY(month,eventdate)):
aploetz#cqlsh:stackoverflow> SELECT * FROM events
WHERE monthbucket='201509'
AND eventdate > '2015-09-19' AND eventdate < '2015-09-26';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-25 06:00:00+0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-25 05:59:59+0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-22 06:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-20 05:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 06:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(5 rows)
Let's say I have a table, something like this:
CREATE TABLE Users (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((user), seq)
);
This follows the desired pattern of Cassandra, with good distribution across partitions (assuming the default Murmur3 hash partitioner).
However, I also need to (rarely) perform range queries on and in time order. This doesn't seem possible in Cassandra. In reality I do need to access the data by group, so (group, time) is acceptable. Since there doesn't seem a way to have secondary index have multiple columns, I guess the right thing is to denormalize, into something like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((group), time)
) WITH CLUSTERING ORDER BY (time ASC);
This works entirely as it should, except that group is really low cardinality, let's say ('A','B','C'), and uneven distribution across users. Since queries on that table is rare, I'm not worried about hot nodes, but I am worried about uneven distribution, perhaps even a single node getting all.
Is this a common scenario and is there any way to mitigate this or are there alternative solutions?
One technique to help avoid hot-spots in Cassandra time series models, is in making use of a "time bucket." Essentially what you would do is determine the "happy medium" level of time precision that provides adequate data distribution, while also being known and semi-convenient to query by.
For the purposes of this example, I'll choose year and month ("yyyyMM"). Note: I have no idea if year and month will work for you...it's just an example. Once you determine your time bucket, you would add it as an additional partition key, like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time TIMEUUID,
yearmonth BIGINT,
PRIMARY KEY ((group, yearmonth), time)
) WITH CLUSTERING ORDER BY (time DESC);
After inserting some rows, queries like this will work:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, dateof(time), time, seq, user
FROM usersbygrouptime WHERE group='B' AND yearmonth=201505;
group | yearmonth | dateof(time) | time | seq | user
-------+-----------+--------------------------+--------------------------------------+-----+--------------------------------------
B | 201505 | 2015-05-16 10:04:10-0500 | ceda56f0-fbdc-11e4-bd43-21b264d4c94d | 1 | d57ba8a4-db24-440c-a983-b1dd6b0d2e27
B | 201505 | 2015-05-16 10:04:09-0500 | ce1cac40-fbdc-11e4-bd43-21b264d4c94d | 1 | 66d07cbb-a2ff-4d56-8fa1-14dfaf684474
B | 201505 | 2015-05-16 10:04:08-0500 | cd525760-fbdc-11e4-bd43-21b264d4c94d | 1 | 07b589ac-4d5f-401e-a34f-e3479e269e01
B | 201505 | 2015-05-16 10:04:06-0500 | cc76c470-fbdc-11e4-bd43-21b264d4c94d | 1 | 984f85b5-ea58-4cf8-b512-43abacb227c9
(4 rows)
Now that may or may not help you query-wise, so you will need to spend some time ensuring that you pick an appropriate time bucket. But, this does help in terms of data distribution in the ring, which you can see with the token function:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, token(group,yearmonth)
FROM usersbygrouptime ;
group | yearmonth | token(group, yearmonth)
-------+-----------+-------------------------
A | 201503 | -3784784210711042553
A | 201504 | -610775546464185720
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
(12 rows)
Notice how different tokens are generated for each group/yearmonth pair, even though some of them have the same group ("A").
I need to model and store financial data in Apache Cassandra.
Data is accessed by date and business unit, so currently my model uses the date and business unit id as a compound row key.
I want to use wide-rows so I can pull the figures for a whole day (and unit) in one query.
For any given day, for a particular business unit, I need to store a series of increasingly granular breakdowns, like so (ignore the figures, they're purely illustrative):
| rowkey | USD | GBP | JPY | etc ....
|-------------|-------|------|------|----------
| 31122014-1 | 112 | 3006 | 234 |
| 31122014-2 | 3378 | -12.4| 998 |
| 31122014-3 | -456 | 2034 | 127 |
And then a more detailed breakdown, using compound columns:
| rowkey | USD-D1 | USD-D2 | GBP-D1 | GBP-D2 | etc ....
|-------------|--------|--------|--------|------------------
| 31122014-1 | 65 | 54 | 175 | 29 |
| 31122014-2 | 2003 | -6.4 | 603 | 349 |
| 31122014-3 | -230 | -198 | -53 | 217 |
And then an even more detailed breakdown:
| rowkey | USD-D1-X1 | USD-D1-X2 | USD-D1-X3 | USD-D2-X1 | etc ....
|-------------|-----------|-----------|-----------|-----------|-------
| 31122014-1 | 23 | 16 | 98 | 29 |
| 31122014-2 | 389 | -3.2 | 237 | 119 |
| 31122014-3 | -105 | -67 | -28 | 178 |
Is this the best way to model these breakdowns using three separate column families (as shown here)?
Or does it make more sense to store only the most granular breakdown and then use some form of column aggregation (if it exists) to extract the less granular data-sets?
I know Cassandra's aggregation capability is limited / non existent, I haven't found anything in the API to suggest how I might aggregate across columns like this.
I know I could do the aggregation in the application tier, but then the question is about the trade offs between retrieving unnecessary data, moving computational overhead and maintaining additional column families. I'm hoping Cassandra provides some way of solving this at the data tier.
Depending on how you want you want the data to be modeled you can
Use your solution. In this you create a column family for more details
If you feel that there are far too column families or that you will always use the next column family, i would suggest making it part of the primary key as a clustering key or directly as part of the partition key
For example:
If according to your data model, if row key access is always going to include a currency you could model it like this
| rowkey |currency|
|---------------|--------|
| 31122014-1,GBP| 112 |
Obviously this will spread your data for a single rowkey much better, but will increase the number of row keys
You could use aggregation as well as custom types that cassandra allows.
Consider the following before you choose one of the stategies
a. Distribution of the rows across nodes
b. Sparse columns vs wide columns
c. Effects on row cache (if you are going to turn it on) and key cache
d. And the most important, your selection queries
I think your solution is likely to be effective. For Cassandra it's generally better to store data multiple in multiple places based on what queries you're expecting to run against it.
If you see each of these use cases as three separate use cases that will be queried at different times, then you've got a solid datamodel.
For what it's worth, this use case plays very well to the strengths of CQL which would model it as follows:
CREATE TABLE finance0 (
day DATE,
unit INT,
currency TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency)
);
CREATE TABLE finance1 (
day DATE,
unit INT,
currency TEXT,
sorter1 TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency, sorter1)
);
CREATE TABLE finance2 (
day DATE,
unit INT,
currency TEXT,
sorter1 TEXT,
sorter2 TEXT,
amount BIGINT,
PRIMARY KEY ((day, unit) currency, sorter1, sorter2)
);