How can I avoid a partition key from a too large size - cassandra

The partition size should not be greater than 10 MB (its recommended). How can I avoid this ?
I can use 2 Partition keys, but I can't query them, it does not fit in my query.
If you have 2 partition keys, and you execute a query, you have to call both partition keys like this:
user_id = partition key
post_id = partition key
SELECT * FROM posts WHERE user_id = 1 AND post_id = 1;
If you use this, you get an error because you don't call post_id
SELECT * FROM posts WHERE user_id = 1;
But I don't need post_id in my query or user_id I only always need one of them but not both. So how I avoid now the partition size ?

You can have table design like below -
CREATE TABLE user_post_by_month (
userid int,
year int,
month int,
postid int,
post text,
PRIMARY KEY ((userid,year,month),postid);
This design will not create a wide row. You can fetch the post by user for a particular month. If you are expecting too many post for a user in a month, you can add another bucket, for example a week.
CREATE TABLE user_post_by_week (
userid int,
year int,
month int,
week int,
postid int,
post text,
PRIMARY KEY ((userid,year,month,week),postid);

Related

Not able to run multiple where clause without Cassandra allow filtering

Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )
The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.

How it is ordered if I query by secondary index?

I created this table on cassandra.
CREATE TABLE user_event(
userId bigint,
type varchar,
createdAt timestamp,
PRIMARY KEY ((userId), createdAt)
) WITH CLUSTERING ORDER BY (createdAt DESC);
CREATE INDEX user_event_type ON user_event(type);
If I query by userId query result will be ordered by createdAt column.
SELECT * FROM user_event WHERE userId = 1;
But how it is ordered if I query by type? Can I get last SIGN_IN event?
SELECT * FROM user_event WHERE userId = 1 AND type = 'SIGN_IN' LIMIT 1;
Is there any guarantee that result is ordered by createdAt?
The key to understanding this scenario, is to remember that result set order can only be enforced within a partition. As you are still querying by partition key (userId) all data within each partition will still be ordered by createdAt (DESCending).
"Guarantee" is a strong word, and one that I am hesitant to use. The results queried in this way should maintain their on-disk sort order. I would definitely test it out. But as long as you provide userId as a part of the query, the results should be returned sorted by createdAt.

How should I design the schema to get the last 2 records of each clustering key in Cassandra?

Each row in my table has 4 values product_id, user_id, updated_at, rating.
I'd like to create a table to find out how many users changed rating during a given period.
Currently my schema looks like:
CREATE TABLE IF NOT EXISTS ratings_by_product (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((product_id ), updated_at , user_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, user_id ASC);
but I couldn't figure out the way to only get the last 2 rows of each user in a given time window.
Any advice on query or changing the schema would be appreciated.
Cassandra requires a query-based approach to table design. Which means that typically one table will serve one query. So to serve the query you are talking about (last two updated rows per user) you should build a table specifically designed to serve it:
CREATE TABLE ratings_by_user_by_time (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((user_id ), updated_at, product_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, product_id ASC );
Then you will be able to get the last two updated ratings for a user by doing the following:
SELECT * FROM ratings_by_user_by_time
WHERE user_id = 'Bob' LIMIT 2;
Note that you'll need to keep the two ratings tables in-sync yourself, and using a batch statement is a good way to accomplish that.

Cassandra - Is there a way to update column value for entire table

I have Cassandra table:
CREATE TABLE test (
network_id int,
date date,
score float,
id uuid,
user_id int,
user_name text,
PRIMARY KEY ((network_id, date), score, id))
WITH CLUSTERING ORDER BY (score DESC);
Query which I need to satisfy is:
"Give me all users which belongs to specific network for specific day sorted by score."
The problem is when user change his name (today) and when I have to execute query for some day in past my report will show old version of the name.
Changing column user_name to STATIC doesn't work because my table should be partitioned by day.
Any ideas how to solve this?
Thank You.
Since you have denormalized user_name for faster access, If the user_name updated you have to update all the copy of that user_name.
You need to maintain another table
CREATE TABLE network_by_user_id (
user_id int,
network_id int,
date date,
score float,
id uuid,
PRIMARY KEY (user_id, network_id, date, score, id)
);
So now whenever any user update their name you have to select all the record of that user from network_by_user_id table and for each record update user_name of base table
update test set user_name = 'New Name' where network_id = ? and date = ? and score = ? and id = ?
If the number of record for a user fastly increase over time, then the cost of update user_name will also fastly increase over time.
Another approach is to normalize the base table like below :
CREATE TABLE test (
network_id int,
date date,
score float,
id uuid,
user_id int,
PRIMARY KEY ((network_id, date), score, id)
);
CREATE TABLE users (
user_id int,
user_name text,
PRIMARY KEY (user_id)
);
For each user_id found in the base table you can query into users with execute async to get the user_name
Learn More about executeAsync
you can use SELECT command if you want to get any data from your Table

Cassandra data model with obsolete data removal possibility

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks.
I would like to model data with future removal possibility.
I have 100,000,000 records per day of this structure:
transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information
I will need to fetch data by user_name (I have about 5,000,000 users).
Also I will need to find transaction details by its id.
All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.
As much I have found, TTL-s expire column values, not rows.
So far I came across with this model, and as I understand it will imply really wide rows:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY ((transaction_date, user_name), transaction_id)
);
CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);
but this model does not allow deletions by transaction_date.
this also builds indexes with high cardinality, what cassandra docs strongly discourages
So what will be the correct model for this task?
EDIT:
Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model
CREATE TABLE user_transactions_YYYYMMDD (
user_name text,
transaction_id text,
transaction_time timestamp,
transaction_type int,
PRIMARY KEY (user_name)
);
YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.
Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (user_name, transaction_id)
);
So you can query using the partition key directly like this:
SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';
And for the id you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (transaction_id)
);
so the query could be something like this:
SELECT * FROM user_transactions WHERE transaction_id = 'ID';
By this way you dont need indexes.
About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).
Perhaps my answer will be a little useful.
I would have done so:
CREATE TABLE user_transactions (
date timestamp,
user_name text,
id text,
type int,
PRIMARY KEY (id)
);
CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);
No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.
INSERT INTO user_transactions ... USING TTL 86400;
will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.
But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL.
If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.
And indeed as a driver to work with Cassandra I recommend you use Astyanax.

Resources