Cassandra table structure suggestion and way of query - cassandra

I am trying to create a following hierarchy:
UserId as rowKey, Hourly time series as columns and inside each hourly column I want to have a user specific information such as hourly activity.
{
UserId:long
{
Timestamp:datetime{
pageview: integer,
clicks:integer
}
}
I've read that it is possible to achieve it using supercolumns but at the same time it was mentioned that supercolumns are outdated right now. If it is true, any alternatives I can use?
Could you please provide me CQL / Java thrift example how should I create and insert such type of structure in Cassandra?
Thanks!

You can user composite primary key for this, I add a table creation CQL query for the table. And you can use counter column for clicks.
CREATE TABLE user_click_by_hour(
userid long,
time_stamp timestamp,
clicks int,
pageview int,
PRIMARY KEY(userid,time_stamp)
)

If your information is subjected to a particular used and accessed together . For example,if you are at anytime , require both clicks and pageview, i would suggest you to use it as a json store
CREATE TABLE user_click_by_hour(
userid long,
time_stamp timestamp,
val text,
PRIMARY KEY(userid,time_stamp)
)
val is a json object containing clicks, pageview and etc.
Advantage
1.You need not worry about altering the table for adding extra column, which add a null value for each and every previous entry
If this data is designated to grow, you are bound to save a lot of space as there is one less column metadata in each node

Related

Data modelling to faciliate pruning/bulk update/delete in scylladb/cassandra

Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
As #nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
CREATE TABLE my_table (
a uuid,
b text,
c text,
d int,
e timestamp
PRIMARY KEY ((a, b), c)
);
CREATE MATERIALIZED VIEW my_mv AS
SELECT a,
b,
c
FROM my_table
WHERE time IS NOT NULL
PRIMARY KEY (b, a, c);
Then in your prune job you could select from my_mv based on b and then delete from my_table based on the values returned from the select query.
Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra DB Java: for each row gotten in query, fetch data stored in seperate table (eg counter table)?

Does anyone know of an efficient way to fetch counter data stored in a separate table for each row gotten in a query?
The tables are defined as follows
TABLE person (
id timeuuid,
name text,
many other attributes... );
TABLE person_counts(
id timeuuid, //same id as person
count1 counter,
PRIMARY KEY (id));
The goal is that when persons/single person are fetched, before returning to add the count then return. Is iterating over each person and querying person_counts the only way to achieve this? It needs to be a counter however since I need a certain Primary Key for the person table I cannot have a counter directly there it seems.
I am using datastax cassandra if it makes a difference.
For this insertion, you can use the batch operation, which is an atomic operation in cassandra datastax driver. Whenever you try to enter a record in persons table, you have to create a prepared query for persons table and persons_count table, and add the two prepared queries to a single batch and carry on the insertion. The advantage with batches is, it is an atomic operation i.e., either it inserts both records or none at all.
In the same way whenever you want to delete from persons table, create a batch and delete from both the persons and persons_count table. You can read more about them here:https://datastax.github.io/cpp-driver/topics/basics/batches/
Note: The two tables are independent and you can read entries of the two tables separately.Inserting via batch operation does not make them interlinked.
Now, for the requirement of fetching, you have to first query the count from the table and then go to the persons table. Probably, there is no other way as cassandra doesn't support any joins. Moreover, you have to specify the primary key for the persons table and other attributes, which help in finalising whether count should be in other table or it can be use in the same table. If you are fine with this implementation, you can use this:
create table persons(id uuid,
name text, count counter, primary key(id, name));
and update statement for the counter column. Then, there is no need of other table too.

Load data into Cassandra denormalized table

I understand that as Cassandra does not support join, we need to create denormalized table sometimes.
Given I need to get Item names for each item within a order given order Id, I create a table using:
CREATE TABLE order (
order_id int,
item_id int,
item_name,
primary key ((id), item_id)
);
I have two csv files to load data from, order.csv and item.csv, where order.csv contains order_id and item_id and item.csv contains item_id and item_name.
The question is how to load data from the csv file into the table I create? I insert data from order file first and it works fine. When I do a insertion of item, it will throw error saying missing primary key.
Any idea how I can insert data from different input files into the denormalized table? Thanks.
there is a typo on the definition of the primary key, it should be
CREATE TABLE order (
order_id int,
item_id int,
item_name,
primary key (order_id, item_id)
);
Are you using COPY to upload the data?
Regarding the denormalization, that depends on your use case, usually on a normalized schema you will have one table for orders, another for customers and do a join with SQL to display information of the order and customers at the same time; in this case, for a denormalized table you will have the order and the customer information in the same table, the fields will depend on how you are going to use the query.
As a rule of thumb, before creating the table, you first need to define what that are you going to use.
Using a secondary index on your item_id should do the trick:
CREATE INDEX idx_item_id ON order (item_id);
Now you should be able to query like:
SELECT * FROM order WHERE item_id = ?;
Beware that indexes usually have performance impacts, so you can use them to import your data, and drop them when finished.
Please refer to the Cassandra Index Documentation for further information.

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Resources