I prepared the following table "keyspaceB.memobox"
DROP TABLE IF EXISTS keyspaceB.memobox;
CREATE TABLE IF NOT EXISTS keyspaceB.memobox (
pkey1 text,
pkey2 text,
id timeuuid,
name text,
memo text,
date timestamp,
PRIMARY KEY ((pkey1, pkey2),id,name)
) WITH CLUSTERING ORDER BY (id DESC,name DESC);
And I registered the following data.
INSERT INTO memobox (pkey1,pkey2,id,name,memo,date) VALUES ('a','b',now(),'tanaka','greet message1','2016-12-13');
INSERT INTO memobox (pkey1,pkey2,id,name,memo,date) VALUES ('a','b',now(),'yamamoto','greet message2','2016-12-13');
The following will succeed
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY id;
However, the following will fail. I would like to ask your professor what is wrong.
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY name;
■error
cqlsh:keyspaceb> SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY name;
InvalidRequest: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
cqlsh:keyspaceb>
There are two different types of keys in cassandra, partition key and clustering key.
The partition key determines which node the data gets stored, while the clusterning key determines the order in which the data gets stored in that parition(node).
In your case the partition key is pkey1 and pkey2. and the clustering key is id and name.
so the data in a partition will be stored based on the id and then name.
e.g if we have the following data
id |name
1 | abc
1 | xyz
2 | aaa
In this case the row with id 1 is stored first, also if two rows have the same id then the order is decided by name column.
So when you query the data like this
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY id;
cassandra finds the partitoin using pkey1 and pkey2 (aka the partition key) and then just return the data how it is stored on the disk.
However in the second case
SELECT * FROM memobox where pkey1='a' and pkey2='b' ORDER BY name;
since the data is not ordered by name alone,( it is first ordered by id and then by name). cassandra can not just blindly return the results , it has to do a lot more in order to correctly sort the results. Hence due to performance reasons this is not allowed.
That is why in the order by clause you have to specify the clustering columns in the order in which you specify them while creating the table(id and then name).
This is from another answer by #aaron
Where and Order By Clauses in Cassandra CQL
Cassandra achieves performance by using the clustering keys to sort
your data on-disk, thereby only returning ordered rows in a single
read (no random reads). This is why you must take a query-based
modeling approach (often duplicating your data into multiple query
tables) with Cassandra. Know your queries ahead of time, and build
your tables to serve them.
Related
i have problem with ordering data in cassandra Database.
this is my table structure:
CREATE TABLE posts (
id uuid,
created_at timestamp,
comment_enabled boolean,
content text,
enabled boolean,
meta map<text, text>,
post_type tinyint,
summary text,
title text,
updated_at timestamp,
url text,
user_id uuid,
PRIMARY KEY (id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
and when i run this query, i got the following message:
Query:
select * from posts order by created_at desc;
message:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Or this query return data without sorting:
select * from posts
There are couple of things you need to understand,
In your case the partition key is "id" and the clustering key is "created_at".
what that essentially means is any row will be stored in a partition based on the hash of "id"(depending on your hashing scheme by default it is Murmur3), now inside that partition the data is sorted based on your clustering key, in your case "created_at".
So if you query some data from that table by default the results which come are sorted based on your clustering order and the default sort order is the one which you specify while creating the table. However there is a gotcha there.
If yo do not specify the partition key in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of partition key(in your case id).
So in order to get the posts by that specific order. you have to specify the partition key like this
select * from posts WHERE id=1 order by created_at desc;
Note:
It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.
So essentially the above query is same as
select * from posts WHERE id=1
You can read more about this here http://www.datastax.com/dev/blog/we-shall-have-order
The error message is pretty clear: you cannot ORDER BY without restricting the query with a WHERE clause. This is by design.
The data you get when running without a WHERE clause actually are ordered, not with your clustering key, but by applying the token function to your partition key. You can verify the order by issuing:
SELECT token(id), id, created_at, user_id FROM posts;
where the token function arguments exactly match your PARTITION KEY.
I suggest you to read this and this to understand what you can/can't do.
I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.
In Cassandra, I understand that by default, given PRIMARY KEY(id1, id2), id1 will be partition key and id2 will be clustering key.
I want to know if can I define two partition keys without any clustering key as follows:
PRIMARY KEY ((id1, id2));
Your understanding is correct.
Your PRIMARY KEY ((id1, id2)) is correct and you are specifying one partition key consisting of two columns.
In the second case, you can query the data only by specifying both columns values. EG:
SELECT * FROM mytable WHERE id1=1 AND id2=3;
and queries like:
SELECT * FROM mytable WHERE id1=1;
will fail because id2 is part of your primary key.
I'm starting to using Cassandra but I'm getting some problems on "ordering" or "selecting".
CREATE TABLE functions (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_function, sort, id_subfunction)
);
This is my table.
If I execute this query
SELECT * FROM functions WHERE id_subfunction = 0 ORDER BY sort;
this is what I get.
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Where I'm doing wrong?
Thanks
PRIMARY KEY (id_function, sort, id_subfunction)
In Cassandra CQL the columns in a compound PRIMARY KEY are either partitioning keys or clustering keys. In your case, id_function (the first key listed) is the partitioning key. This is the key value that is hashed so that your data for that key can be evenly distributed on your cluster.
The remaining columns (sort and id_subfunction) are known as clustering columns, which determine the sort order of your data within a partition. This essentially means that your data will only be sorted by your clustering key(s) when a partitioning key is first designated in your WHERE clause.
You have two options:
1) Query this table by id_function instead:
SELECT * FROM functions WHERE id_function= 0 ORDER BY sort;
This will technically work, although I'm guessing that it won't give you the results that you are looking for.
2) The better option, is to create a "query table." This is a table designed to specifically handle your query by id_subfunction. It only differs from the original functions table in that the PRIMARY KEY is defined with id_subfunction as the partitioning key:
CREATE TABLE functionsbysubfunction (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_subfunction, sort, id_function)
);
This query table will allow this query to function as expected:
SELECT * FROM functionsbysubfunction WHERE id_subfunction = 0;
And you shouldn't need to indicateORDER BY, unless you want to specify either ASCending or DESCending order.
Remember with Cassandra, it is important to design your data model according to how you want to query your data. And that may not necessarily be the way that it originally makes sense to store it.
I get the above error when I try to use following cql statement, not sure whats wrong with it.
CREATE TABLE Stocks(
id uuid,
market text,
symbol text,
value text,
time timestamp,
PRIMARY KEY(id)
) WITH CLUSTERING ORDER BY (time DESC);
Bad Request: Only clustering key columns can be defined in CLUSTERING ORDER directive
But this works fine, can't I use some column which is not part of primary key to arrange my rows ?
CREATE TABLE timeseries (
... event_type text,
... insertion_time timestamp,
... event blob,
... PRIMARY KEY (event_type, insertion_time)
... )
... WITH CLUSTERING ORDER BY (insertion_time DESC);
"can't I use some column which is not part of primary key to arrange my rows?"
No, you cannot. From the DataStax documentation on the SELECT command:
ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY. This also applies to tables with more than two column components in the primary key.
Therefore, for your first CREATE to work, you will need to adjust your PRIMARY KEY to this:
PRIMARY KEY(id,time)
The second column of in a compound primary key is known as the "clustering column." This is the column that determines the on-disk sort order of data within a partitioning key. Note that last part in italics, because it is important. When you query your Stocks column family (table) by id, all "rows" of column values for that id will be returned, sorted by time. In Cassandra you can only specify order within a partitioning key (and not for your entire table), and your partitioning key is the first key listed in a compound primary key.
Of course the problem with this, is that you probably want id to be unique (which means that CQL will only ever return one "row" of column values per partitioning key). Requiring time to be part of the primary key negates that, and makes it possible to store multiple values for the same id. This is the problem with partitioning your data by a unique id. It might be a good idea in the RDBMS world, but it can make querying in Cassandra more difficult.
Essentially, you are going to need to revisit your data model here. For instance, if you wanted to query prices over time, you could name the table something like "StockPriceEvents" with a primary key of (id,time) or (symbol,time). Querying that table would give you the prices recorded for each id or symbol, sorted by time. Now that may or may not be of any value to your use case. Just trying to explain how primary keys and sort order work in Cassandra.
Note: You should really use column names that have more meaning. Things like "id," "time," and "timeseries" are pretty vague don't really describe anything about the context in which they are used.
While creating a Table in Cassandra with "CLUSTERING ORDER BY" option, make sure the clustering column is Primary column.
Below table created with clustering column ,but the clustering column "Datetime" is not a Primary key column. Hence below error.
ERROR_SCRIPT
cqlsh> CREATE TABLE IF NOT EXISTS cpdl3_spark_cassandra.log_data (
... IP text,
... URL text,
... Status text,
... UserAgent text,
... Datetime timestamp,
... PRIMARY KEY (IP)
... ) WITH CLUSTERING ORDER BY (Datetime DESC);
ERROR:
InvalidRequest: Error from server: code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
CORRECTED_SCRIPT (Where the "Datetime" is added into the Primary Key columns)
cqlsh> CREATE TABLE IF NOT EXISTS cpdl3_spark_cassandra.log_data (
... IP text,
... URL text,
... Status text,
... UserAgent text,
... Datetime timestamp,
... PRIMARY KEY (IP,Datetime)
... ) WITH CLUSTERING ORDER BY (Datetime DESC);