Can a Cassandra / CQL3 column family have a composite partition key? - cassandra

CQL 3 allows for a "compound" primary key using a definition like this:
CREATE TABLE timeline (
user_id varchar,
tweet_id uuid,
author varchar,
body varchar,
PRIMARY KEY (user_id, tweet_id)
);
With a schema like this, the partition key (storage engine row key) will consist of the user_id value, while the tweet_id will be compounded into the column name. What I am looking for, instead, is for the partition key (storage engine row key) to have a composite value like user_id:tweet_id. Obviously I could do something like key = user_id + ':' + tweet_id in my application, but is there any way to have CQL 3 do this for me?

Actually, yes you can. That functionality was added in this ticket:
https://issues.apache.org/jira/browse/CASSANDRA-4179
The format for you would be:
CREATE TABLE timeline (
user_id varchar,
tweet_id uuid,
author varchar,
body varchar,
PRIMARY KEY ((user_id, tweet_id))
);

Until 1.2 comes out, the answer is no. The partition key will always be the first component. As you said, the way to do this would be to create the composite key yourself. You shouldn't shy away from this as it's actually quite common.

Related

Not able to run multiple where clause without Cassandra allow filtering

Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )
The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.

Is cassandra suitable for analytics storing?

I'm willing to develop an open-source analytics project which will store visits, referers, devices (by kind, family etc.).
I'm fairly new to the cassandra world so I'm asking a lot of questions about modeling with it.
I have read a lot of documentation about it, here is a part of my datamodel:
create table visits(
id UUID,
remote_addr VARCHAR,
method VARCHAR,
user_agent VARCHAR,
status_code INT,
host VARCHAR,
protocol VARCHAR,
path VARCHAR,
data VARCHAR,
headers VARCHAR,
query_string VARCHAR,
referer_id UUID,
device_id UUID,
browser_id UUID,
platform_id UUID,
created_at TIMEUUID,
PRIMARY KEY (id, created_at) ) WITH CLUSTERING ORDER BY (created_at DESC);
create table referers(
id UUID PRIMARY KEY,
host VARCHAR,
path VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
create table browsers(
id UUID PRIMARY KEY,
key VARCHAR,
version VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
create table platforms(
id UUID PRIMARY KEY,
key VARCHAR,
version VARCHAR,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
seen_count INT );
With this model, if I want for example "all visits from status_code 200" I will have to create a secondary index, same for referers, devices, etc.
So do I need to create individual tables "visits_by_referers", "visits_by_devices" like so:
create table visits_by_referers(
visit_id UUID,
device_id UUID,
PRIMARY KEY (visit_id, device_id)
);
or am I completely wrong and cassandra is not suitable for this?
Thank you :)
Until 3.0 comes out with Materialized Views (https://issues.apache.org/jira/browse/CASSANDRA-6477), which will be HUGE for this type of use case, you need to create individual tables for things like 'visits by referrer' if you plan on doing direct querying.
What a lot of people tend to do is use a single large table, and then overlay something like Spark to actually read the data into memory and do much more complicated querying.

ORDER BY reloaded, cassandra

A given column family I would like to sort and to this I am trying to create a table with the option CLUSTERING ORDER BY. I always encounter the following errors:
1.) Variant A resulting in
Bad Request: Missing CLUSTERING ORDER for column userid
Statement:
CREATE TABLE test.user (
userID timeuuid,
firstname varchar,
lastname varchar,
PRIMARY KEY (lastname, userID)
)WITH CLUSTERING ORDER BY (lastname desc);
2.) Variant B resulting in
Bad Request: Only clustering key columns can be defined in CLUSTERING ORDER directive
Statement:
CREATE TABLE test.user (
userID timeuuid,
firstname varchar,
lastname varchar,
PRIMARY KEY (lastname, userID)
)WITH CLUSTERING ORDER BY (lastname desc, userID asc);
As far as I can see in the manual this is the correct syntax for creating a table for which I would like to run queries as "SELECT .... FROM user WHERE ... ORDER BY lastname". How could I achieve this? (The column 'lastname' I would like to keep as the first part of the primary key, so that I could use it in delete statements with the WHERE-clause.)
Thanks a lot, Tamas
You can only specify clustering order on your clustering keys.
PRIMARY KEY (lastname, userID)
)WITH CLUSTERING ORDER BY (lastname desc);
In your first example, your only clustering key is userID. Thus, it is the only valid entry for CLUSTERING ORDER BY.
PRIMARY KEY (lastname, userID)
)WITH CLUSTERING ORDER BY (lastname desc, userID asc);
The second example fails because you are specifying your partition key in CLUSTERING ORDER BY, and that's not going to work either.
Cassandra works by ordering CQL rows according to clustering keys, but only when a partition key is specified. This is because the whole idea of Cassandra wide-row modeling is to query by partition key, and read a series of ordered rows in one query operation.
I would like to run queries as "SELECT .... FROM user WHERE ... ORDER BY lastname".
Given this statement, I am going to suggest that you need another column in this model before it will work the way you want. What you need is an appropriate partition key for your users table. Say...like group. With your users partitioned by group, and clustered by lastname, your definition would look something like this:
CREATE TABLE test.usersbygroup (
userID timeuuid,
firstname varchar,
lastname varchar,
group text,
PRIMARY KEY (group,lastname)
)WITH CLUSTERING ORDER BY (lastname desc);
Then, this query will work, returning users (in this case) who are fans of the show "Firefly," ordered by lastname (descending):
SELECT * FROM usersbygroup WHERE group='Firefly Fans';
Read through this DataStax doc on Compound Keys and Clustering to get a better understanding.
NOTE: You don't need to specify ORDER BY in your SELECT. The rows will come back ordered by their clustering key(s), and ORDER BY cannot change that. All ORDER BY can really do, is alter the sort direction (DESCending vs. ASCending).
Clustering would be limited to whats defined in partitioning key, in your case (lastName + userId). So cassandra would store result in sorted order whose (lastName+userId) combination. Thats why u nned to give both for retrieval purpose. Its still not useful schema if you want to sort all data in table as last name as userId is unique(timeuuid) so clustering key would be of no use.
CREATE TABLE test.user (
userID timeuuid,
firstname varchar,
lastname varchar,
bucket int,
PRIMARY KEY (bucket)
)WITH CLUSTERING ORDER BY (lastname desc);
Here if u provide buket value say 1 for all user records then , all user would go in same bucket and hense it would retrieve all rows in sorted order of last name. (By no mean this is a good design, just to give you an idea).
Revised :
CREATE TABLE user1 (
userID uuid,
firstname varchar,
lastname varchar,
bucket int,
PRIMARY KEY ((bucket), lastname,userID)
)WITH CLUSTERING ORDER BY (lastname desc);

Is there a performance hit with simplifying a partition key?

Let's say I have the following unsimplified column family:
CREATE TABLE emp (
empID int,
deptID int,
first_name varchar,
last_name varchar,
PRIMARY KEY ((empID, deptID)));
The partition key is both empID and deptID.
Under the assumption I will only search this table using both of these fields, can I simplify the table and rewrite is as following?
CREATE TABLE emp2 (
empID_deptID text
first_name varchar,
last_name varchar,
PRIMARY KEY (empID_deptID));
Yes you can, but I don't see any added value in doing it. In your first code example, Cassandra concatenates empID and deptID for you.
In the precise example that you provided, there will be no difference. As a matter of fact, that is how it was done before composite partition keys became allowed in the previous versions.

Cassandra range slicing on composite key

I have columnfamily with composite key like this
CREATE TABLE sometable(
keya varchar,
keyb varchar,
keyc varchar,
keyd varchar,
value int,
date timestamp,
PRIMARY KEY (keya,keyb,keyc,keyd,date)
);
What I need to do is to
SELECT * FROM sometable
WHERE
keya = 'abc' AND
keyb = 'def' AND
date < '2014-01-01'
And that is giving me this error
Bad Request: PRIMARY KEY part date cannot be restricted (preceding part keyd is either not restricted or by a non-EQ relation)
What's the best way to solve this? Do I need to alter my columnfamily?
I also need to query those table with all keya, keyb, keyc, and date.
You cannot do it in cassandra. Moreover, such a range slicing is costlier too. You are trying to slice through a set of equalities that have the lower priority according to your schema.
I also need to query those table with all keya, keyb, keyc, and date.
If you are considering to solve this problem, considering having this schema. What i would suggest is to have the keys in a separate schema
create table (
timeuuid id,
keyType text,
primary key (timeuuid,keyType))
Use the timeuuid to store the values and do a range scan based on that.
create table(
timeuuid prevTableId,
value int,
date timestamp,
primary key(prevTableId,date))
Guess , in this way, your table is normalized for better scalability in your use case and may save a lot of disk space if keys are repetitive too.

Resources