Load data into Cassandra denormalized table - cassandra

I understand that as Cassandra does not support join, we need to create denormalized table sometimes.
Given I need to get Item names for each item within a order given order Id, I create a table using:
CREATE TABLE order (
order_id int,
item_id int,
item_name,
primary key ((id), item_id)
);
I have two csv files to load data from, order.csv and item.csv, where order.csv contains order_id and item_id and item.csv contains item_id and item_name.
The question is how to load data from the csv file into the table I create? I insert data from order file first and it works fine. When I do a insertion of item, it will throw error saying missing primary key.
Any idea how I can insert data from different input files into the denormalized table? Thanks.

there is a typo on the definition of the primary key, it should be
CREATE TABLE order (
order_id int,
item_id int,
item_name,
primary key (order_id, item_id)
);
Are you using COPY to upload the data?
Regarding the denormalization, that depends on your use case, usually on a normalized schema you will have one table for orders, another for customers and do a join with SQL to display information of the order and customers at the same time; in this case, for a denormalized table you will have the order and the customer information in the same table, the fields will depend on how you are going to use the query.
As a rule of thumb, before creating the table, you first need to define what that are you going to use.

Using a secondary index on your item_id should do the trick:
CREATE INDEX idx_item_id ON order (item_id);
Now you should be able to query like:
SELECT * FROM order WHERE item_id = ?;
Beware that indexes usually have performance impacts, so you can use them to import your data, and drop them when finished.
Please refer to the Cassandra Index Documentation for further information.

Related

Data modelling conflicts in Cassandra

The schema I am using is following :
CREATE TABLE mytable(
id varchar,
date date,
name varchar,
PRIMARY KEY ((date),name, id)
) WITH CLUSTERING ORDER BY (name desc);
I have 2 queries for my use case :
Fetching all records for given name
Delete all records for given date.
As we can't delete records without partition key being specified, my partition key got fixed to date only and no other column can be added to partition key as I won't have anything except date at time of deletion.
But to fetch records using name, I need to use ALLOW FILTERING as I need to scan whole table of above schema which causes performance issue.
Can you suggest a better way so that I can skip ALLOW FILTERING with is also delete by date compatible.
You could use indexes:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
But you have to be careful, there could be issues depending on the size of the table. You should read this post for more informations:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
You need an additional table to support your requirements.
Your main query is to retrieve the records given a name. For this, you should use mytable as follow (note the primary key):
CREATE TABLE mytable(
id varchar,
date date,
name varchar,
PRIMARY KEY ((name),date, id)
) WITH CLUSTERING ORDER BY (date desc);
This table will let you retrieve your data for a given name with (query 1):
SELECT * FROM mytable WHERE name='bob';
Now, you want to be able to delete by date. For this you would need the following additional table:
CREATE TABLE mytable_by_date(
id varchar,
date date,
name varchar,
PRIMARY KEY ((date), name, id)
) WITH CLUSTERING ORDER BY (name);
This table will let you find the name (and id) for a given date with:
SELECT * from mytable_by_date WHERE date='your-date';
I don't know your business requirements, so you this query might return 0, 1 or maybe more results. Once you have that, you can issue the delete against the first and second table (maybe using a logged batch for atomicity?)
DELETE * from mytable_by_date WHERE date='your-date' and name='the-name' and id='the-id'
DELETE * from mytable WHERE name='the-name' and ...
Overall, you might need to adjust according to your business requirements (is name unique, is uniqueness enforced by id etc...)
Hope it helps!

Delete whole row based on one of clusturing column value in cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?
In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...
You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra DB Java: for each row gotten in query, fetch data stored in seperate table (eg counter table)?

Does anyone know of an efficient way to fetch counter data stored in a separate table for each row gotten in a query?
The tables are defined as follows
TABLE person (
id timeuuid,
name text,
many other attributes... );
TABLE person_counts(
id timeuuid, //same id as person
count1 counter,
PRIMARY KEY (id));
The goal is that when persons/single person are fetched, before returning to add the count then return. Is iterating over each person and querying person_counts the only way to achieve this? It needs to be a counter however since I need a certain Primary Key for the person table I cannot have a counter directly there it seems.
I am using datastax cassandra if it makes a difference.
For this insertion, you can use the batch operation, which is an atomic operation in cassandra datastax driver. Whenever you try to enter a record in persons table, you have to create a prepared query for persons table and persons_count table, and add the two prepared queries to a single batch and carry on the insertion. The advantage with batches is, it is an atomic operation i.e., either it inserts both records or none at all.
In the same way whenever you want to delete from persons table, create a batch and delete from both the persons and persons_count table. You can read more about them here:https://datastax.github.io/cpp-driver/topics/basics/batches/
Note: The two tables are independent and you can read entries of the two tables separately.Inserting via batch operation does not make them interlinked.
Now, for the requirement of fetching, you have to first query the count from the table and then go to the persons table. Probably, there is no other way as cassandra doesn't support any joins. Moreover, you have to specify the primary key for the persons table and other attributes, which help in finalising whether count should be in other table or it can be use in the same table. If you are fine with this implementation, you can use this:
create table persons(id uuid,
name text, count counter, primary key(id, name));
and update statement for the counter column. Then, there is no need of other table too.

Cassandra Update query | timestamp column as clustering key

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.
The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.
If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

Resources