I want to use two fields as primary key (without clustering key).
PRIMARY KEY ((a, b)) => is that means a + b is the primary key, right? Or is it just partition key?
I'm confused
A primary key definition of:
PRIMARY KEY ((a, b))
...sets both a and b as a composite partition key. In this scenario, there is no clustering key.
This definition:
PRIMARY KEY (a, b)
...uses a as the partition key and b as the clustering key.
For more info, I recommend Carlo's famous answer to this question:
Difference between partition key, composite key and clustering key in Cassandra?
To add to Aaron's response, the brackets (( and )) combine the 2 columns into one partition key. This means that you need to provide both columns in your filter in order to query the table:
SELECT ... FROM ... WHERE a = ? AND b = ?
Neither of these queries are valid because they only filter on 1 of the 2:
SELECT ... FROM ... WHERE a = ?
SELECT ... FROM ... WHERE b = ?
For what it's worth, I've explained the terms "composite partition key" and "compound primary key" with some real examples to illustrate the differences in this post -- https://community.datastax.com/questions/6171/. Cheers!
Related
I specify 2 unique data types, but when one of them is different, it keeps adding records.
The table schema has a compound primary key, i.e. it is composed of a partition key (username) and clustering key (email). This means that each partition has one or more rows of emails.
It is a completely different schema to a table with just a simple primary key (only has a partition key, no clustering key) like this:
CREATE TABLE users_by_username (
username text,
...
PRIMARY KEY (username)
)
This table would only ever have one row in each partition. Cheers!
[UPDATE] If you want your table to be partitioned by BOTH username + email, you need to create a new table which has a composite partition key (partition key has two or more columns):
CREATE TABLE users_by_username_email (
username text,
email text,
...
PRIMARY KEY ( (username, email) )
)
Note the difference: BOTH columns are enclosed in a bracket so they are treated as one key.
I want to be able to use IN in any column in any order in Cassandra
So I have the next table:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
and this data:
INSERT INTO test (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO test (a, b, c) VALUES ('a2', 'b2', 'c2');
This query works:
SELECT * FROM test WHERE c IN ('c1', 'c2') AND b IN ('b1') ALLOW FILTERING;
But if you remove the b IN it gives this error:
SELECT * FROM test WHERE c IN ('c1', 'c2') ALLOW FILTERING;
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN
restrictions are not supported on indexed columns"
Seems like if I want to use IN in a column I should have used IN in some previous columns?
Is there a way to avoid this?
Modifying the Schema is valid but I need to use Cassandra and allow filtering through any columns (if there's no need to filter thought a columns then there would be no IN clause for that column).
Thanks for reading.
P.S: I know you are not supposed to use ALLOW FILTERING please assume there's no other way.
Edit: Seems like they may have fixed this?: https://issues.apache.org/jira/browse/CASSANDRA-14344
There is a lot of confusion cassandra's primary keys.
In order to respond to your question, i think you need to understand how cassandra primary keys are working internally.
When you are creating a Primary key with multiple fields like in your case:
CREATE TABLE test (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b, c));
"a" will be the partition key, you can imagine it as a hash that will chose the partition on which data will be distributed.
b and c will be the clustering keys, these keys will be like a sorted list of your data and c will be nested in each b value, that means that you have to provide b in order to do constraints on c.
The cassandra documentation states that you can only use In clause on last column of the partition key and the last of the clustering key, but attention you'll have to provide all the other clustering keys.
So basically there is no way to do that in one table.
You should think of a tradeOff of your query flexibility vs data duplication.
One solution will be to denormalize your data in 2 tables like this:
CREATE TABLE test1 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (a, b));
CREATE TABLE test2 (a TEXT, b TEXT, c TEXT, PRIMARY KEY (c, a, b));
By doing so, you will be able to query each table depending on your use case.
The following queries will work:
SELECT * FROM test2 WHERE c IN ('c1', 'c2');
SELECT * FROM test1 WHERE a IN ('a1', 'a2');
SELECT * FROM test1 WHERE b IN ('b1', 'b2') ALLOW FILTERING;
And so on, i think you got the point.
But really try to do the best tradeoff, in order to minimize the allow filtering usage. and remember that the queries on partition keys directly will be the fastest.
I have the following DDL:
CREATE TABLE mykeyspace.mytable (
a text,
b text,
c text,
d text,
e text,
starttime timestamp,
endtime timestamp,
PRIMARY KEY ((a, b, c), d, e, starttime, endtime)
) WITH CLUSTERING ORDER BY (d ASC, e ASC, starttime ASC, endtime ASC)
and I only have the following SELECT/DELETE query:
SELECT */DELETE FROM mytable WHERE a = ? AND b = ? AND C = ? AND d = ?;
I just wonder if the column d can be included as part of the composite partition key so a row lookup is enough instead of a row lookup + clustering column lookup? In this case it will improve performance as well?
The column d include in the composite partition key will absolutely improve performance
Your data will distribute well among the cluster.
Your SELECT query will be faster, no clustering level filtering is required
Your DELETE query will mark that partition as markedForDeleteAt, instead of inserting range tombstone
I feel that the more columns I have in the PARTITION KEY the better.
So my suggestion is to incluse as much columns as possible in the PARTITION KEY. It will improve SELECT query performances in general, and will avoid some tombstones problems as well (because you will delete at partition level, unless you recreate the partitions of course).
I have table in cassandra:
CREATE TABLE pica_pictures (
p int,
g text,
id text,
a int,
PRIMARY KEY ((p), g, id)
)
Then I try select data with query:
cqlsh> select * from picapica_realty.pica_pictures where p = 1 and g in ('1', '2');
Bad Request: Clustering column "g" cannot be restricted by an IN relation
I can't find cause of this behavior.
This may be a restriction due to your version of Cassandra. As Cedric noted, it works for him in 2.2 (or rather, didn't error-out).
However, as I read your question I recalled a slide from a presentation that I gave at Cassandra Day Chicago 2015. From CQL: This is not the SQL you are looking for, silde #15:
IN
Can only operate on the last partition key and/or the last clustering key.
At the time (April 2015) the most-current version of Cassandra was either 2.1.4 or 2.1.5.
As it stands (with Cassandra 2.1) you'll either need to adjust your primary key definition to PRIMARY KEY ((p), g), or adjust your WHERE clause to something like where p = 1 and g = 1 and id in ('id1', 'id2');
This does word with Cassandra 2.2.
cqlsh:ks> CREATE TABLE pica_pictures (
... p int,
... g text,
... id text,
... a int,
... PRIMARY KEY ((p), g, id)
... );
cqlsh:ks> select * from pica_pictures where p = 1 and g in ('1', '2');
p | g | id | a
---+---+----+---
(0 rows)
As your link describes this works because the the preceding columns are defined for equality and none of the queried columns are of a collection type.
After reading this blog at planetcassandra, I'm wondering how does a CQL3 composite index with 3 fields map in the thrift column family word, For e.g.:
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
karma int,
content text,
PRIMARY KEY (article_id, posted_at)
)
Here the column article_id will be mapped to the internal row key and posted_at will be mapped to (the first part of) the cell name.
What if the table design will be
CREATE TABLE comments (
author_id varchar,
posted_at timestamp,
article_id uuid,
author text,
karma int,
content text,
PRIMARY KEY (author_id, posted_at, article_id)
)
And will the internal row key mapped to 1st 2 fields of the composite index with article_id mapped to cell name, essentially slicing for as many articles upto 2 billion entries and any query on author_id and posted_at combination is one seek on the disk?
Is the behavior same for any number of fields in a composite key?
Your answers much appreciated.
The above observation is incorrect and the correct one is here
I've personally verified:
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id = partition key, posted_at:article_id = cluster key
First part of composite key (author_id) is called "Partition Key",
rest (posted_at,article_id) are remaining keys.
Cassandra stores columns differently when composite keys are used. Partition key
becomes row key. Remaining keys are concatenated with each column
name (":" as separator) to form column names. Column values remain
unchanged.
Remaining keys (other than partition keys) are ordered,
and it's not allowed to search on any random column, you have to
start with the first one and then you can move to the second one and
so on. This is evident from "Bad Request" error.
There's an excellent explanation by Aaron Morton # his site thelastpickle.
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id + posted_at = partition key, article_id = cluster key
hence be mindful of the disk seeks as you go by second method and see the row is not getting too wide and gives real benefit compared to the first case.
If you aren't crossing the 2 billion and well within the limits, don't overdo by adopting the 2nd method, as the dispersion of records happens on the combo key.