Cassandra Cql schema best practice - cassandra

Here I am again asking similar question after getting really a great explanation on
How do secondary indexes work in Cassandra?
CREATE TABLE update_audit (
scopeid bigint,
formid bigint,
time timestamp,
operation int,
record_id bigint,
ipaddress text,
user_id bigint,
value text,
PRIMARY KEY ((scopeid), formid, time)
) WITH CLUSTERING ORDER BY (formid ASC, time DESC)
FYI,
operation Column possible values are 1,2 and 3. Low cardinality.
record_link_id high-cardinality. every entry can be unique.
user_id is the best candidate for Index according to How do secondary indexes work in Cassandra? and The sweet spot for cassandra secondary indexing.
Search should work based on
time with limit 100.
operation and time with limit 100.
user_id and time with limit 100.
record_id and time with limit 100.
Problems
total records more than 10,000M
which One is best
- creating Index over operation, user_id and record_id and applying limit 100.
1) Does Hidden columnfamily for index operation Will return only 100 results?
2) More seeks will slow down the fetch operation?
OR Create a new columnfamily with definition like
CREATE TABLE audit_operation_idx (
scopeid bigint,
formid bigint,
operation int,
time timeuuid,
PRIMARY KEY ((scopeid), formid, operation, time)
) WITH CLUSTERING ORDER BY (formid ASC, operation ASC, time DESC)
required two select query for single select operation.
So, if I will create new columnfamily for operation, user_id and record_id
I have to make a batch query to insert into these four columnfamilies.
3) Does TCP problems will come? while executing batch query.because writes will be huge.
4) what else should I cover to avoid unnecessary problems.

There are three options.
Create a new table and use bulk insert. If the size of insert query becomes huge you'll have to configure its related parameter. Don't worry about writes in Cassandra.
Create a materialized View with required columns of where clause.
Create secondary index if cardinality is low. (Not recommended)

Related

Cassandra Data modeling avoiding tombstone

I am starting with an initial idea of rewriting mammoth spark-kafka-hbase application with spark-kafka-cassandra(on kubernetes).
I have the following data models one supports all-time inserts and other one supports upserts
Approach 1:
create table test.inv_positions(
location_id int,
item bigint,
time_id timestamp,
sales_floor_qty int,
backroom_qty int,
in_backroom boolean,
transit_qty int,
primary
key ((location_id), item,time_id) )
with clustering order by (item
asc,time_id DESC);
This table keeps inserting as timeid is part of clustering col. I am thinking to read latest (timeid is desc) by fetch 1 and somehow delete the old record by either setting TTL on key cols or delete them overnight.
Concerns: TTL or delete the old records creates tombstones.
Approach 2:
create table test.inv_positions(
location_id int,
item bigint,
time_id timestamp,
sales_floor_qty int,
backroom_qty int,
in_backroom boolean,
transit_qty int,
primary key ((location_id),
item) ) with clustering order by (item asc);
This table if a new record comes for the same location and item, it upserts it. Its easy to read and no need to worry about purging old records
Concerns : I have another application on Cassandra that updates different col at different time and we still have read issues. That said, upserts also creates tombstones but how worse compared to approach 1? or any other better way to modeling it right?
First approach seems good. TTL and delete, both create tombstones. you can refer compaction strategy for TTL based deletes. TWCS is better for TTL based deletes else you can use STCS for simple deletes. Also,configure gc_grace_seconds accordingly to clear tombstones smoothly because heavy tombstones leads read latency.

What do nested parenthesis indicate in a PRIMARY KEY definition

What is difference between these two kinds of tables in Cassandra?
First :
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
volts2 float,
PRIMARY KEY (sensor_id, collected_at,volts )
)
and Second:
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
volts2 float,
PRIMARY KEY ((sensor_id, collected_at),volts )
)
My questions:
What is difference between these two tables?
When would we use first table, and when would we use the second table?
The difference is the primary key. Cassandra primary key is divided in (Partition Key, Clustering Key).
Partition Key decides where a register goes within the ring and Clustering determines how the registers with same partition key are stored to make use of the on-disk sorting of columns in your queries.
First table:
Sensor_id is your Partition Key so you know every register with the same sensor_id will go to the same node.
You have two clustering keys, collected_at and volts fields so data with the same sensor_id will be stored ordered by collected_at field in ascending order and data with same sendor_id, collected_at fields will be stored ordered by volts field in ascending order.
Second table:
You will have a compound Partition Key (sensor_id, collected_at) so you know every register with the same sensor_id and collected_at will go to the same node.
Your clustering key is volts so data with same (sensor_id, collected_at) will be stored ordered by volts in ascending order.
Imagine you have billions of registers for the same sensor_id. Using the first approach you will store it in the same node so probably you will run out of space. If you use the second approach you will have to query using an exact sensor_id and collected_at timestamp so probably it doesn't make sense. Because of that in Cassandra modeling you must know what queries are you going to execute before create the model.
The first table partitions data on sensor_id only. Meaning, that all data underneath each sensor_id is stored in the same data partition. The hashed token value of sensor_id also determines which node(s) in the cluster the data partition is stored on. Data within each partition is sorted by collected_at and volts.
The second table uses a composite key on both sensor_id and collected_at to determine data partitioning. Data in each partition is sorted by volts.
When we use first table and when we use the second table ?
As you have to pass all of your partition keys in a query, the first table offers more query flexibility. That is, you can decide to query only on sensor_id, and then you can choose whether or not to also query by collected_at and then volts. In the second table, you have to query by both sensor_id and collected_at. So you have less query flexibility, but you get better data distribution out of the second model.
And actually, partitioning on a timestamp (second table) value is typically not very useful, because you would have to have that exact timestamp before executing your query. Typically what you see when timestamp components are used in a partition key, is in a technique called "date bucketing," in which you would use something with less precision like month or day. That way, you could still query for an entire month/day or whatever your bucket was.

cassandra ordering or sorting

Need to fetch latest result in a table without mentioning partion key . For example need latest tweets .Problems facing as follows,
create table test2.threads(
thread text ,
created_date timestamp,
forum_name text,
subject text,
posted_by text,
last_reply_timestamp timestamp,
PRIMARY KEY (thread,last_reply_timestamp)
)
WITH CLUSTERING ORDER BY (last_reply_timestamp DESC);
Only if i know the partion key , I can retrive data .
select * from test2.threads where thread='one' order by last_reply_timestamp DESC;
How can i get latest threads sort by desc without mentioning where condition?
Your data model is not suited for that purpose. The partitions are not ordered. You'd have to loop over the partition keys, fetch a few and then see which ones are the most recent at the application level.

Using secondary indexes to update rows in Cassandra 2.1

I'm using Cassandra 2.1 and have a model that roughly looks as follows:
CREATE TABLE events (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((client_id, bucket), timestamp)
);
CREATE INDEX events_ticket ON events(ticket_id);
As you can see, I've created a secondary index on ticket_id. This index works ok. events contains around 100 million rows, while only 5 million of these rows have around 50,000 distinct tickets. So a ticket - on average - has 100 events.
Querying the secondary index works without supplying the partition key, which is convenient in our situation. As the bucket column is sometimes hard to determine beforehand (i.e. you should know the date of the events, bucket is currently the date).
cqlsh> select * from events where ticket_id = 123;
client_id | bucket | timestamp | ... | ticket_id
-----------+--------+-----------+-----+-----------
(0 rows)
How do I solve the problem when all events of a ticket should be moved to another ticket? I.e. the following query won't work:
cqlsh> UPDATE events SET ticket_id = 321 WHERE ticket_id = 123;
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY ticket_id found in where clause"
Does this imply secondary indexes cannot be used in UPDATE queries?
What model should I use to support these changes?
First of all, UPDATE and INSERT operations are treated the same in Cassandra. They are colloquially known as "UPSERTs."
Does this imply secondary indexes cannot be used in UPDATE queries?
Correct. You cannot perform an UPSERT in Cassandra without specifying the complete PRIMARY KEY. Even UPSERTs with a partial PRIMARY KEY will not work. And (as you have discovered) UPSERTing by an indexed value does not work, either.
How do I solve the problem when all events of a ticket should be moved to another ticket?
Unfortunately, the only way to accomplish this, is to query the keys of each row in events (with a particular ticket_id) and UPSERT ticket_id by those keys. The nice thing, is that you don't have to first DELETE them, because ticket_id is not part of the PRIMARY KEY.
How do I solve the problem when all events of a ticket should be moved to another ticket?
I think your best plan here would be to forego a secondary index all together, and create a query table to work alongside your events table:
CREATE TABLE eventsbyticketid (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((ticket_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
This would allow you to query by ticket_id quickly (to obtain your client_id, bucket, and timestamp. This would give you the information you need to UPSERT the new ticket_id on your events table.
You could also then perform a DELETE by ticket_id (on the eventsbyticketid table). Cassandra does allow a DELETE operation with a partial PRIMARY KEY, as long as you have the full partition key (ticket_id). So removing old ticket_ids from the query table would be easy. And to ensure write atomicity, you could batch the UPSERTs together:
BEGIN BATCH
UPDATE events SET ticket_id = 321 WHERE client_id=2112 AND bucket='2015-04-22 14:53' AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d;
UPDATE eventsbyticketid SET client_id=2112, bucket='2015-04-22 14:53' WHERE ticket_id=321 AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d
APPLY BATCH;
Which is actually the same as performing:
BEGIN BATCH
INSERT INTO events (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
INSERT INTO eventsbyticketid (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
APPLY BATCH;
Side note: timestamp is actually a (reserved word) data type in Cassandra. This makes it a pretty lousy name for a timeuuid column.
You can use the secondary index to query the events for the old ticket, and then use the primary key from those retrieved events to update the events.
I'm not sure why you need to do this manually, seems like something Cassandra should be able to do under the hood.

Is Cassandra secondary index optimized if the partition key specified?

For secondary index queries that the partition key is specified in the WHERE clause, does the secondary index lookup hits all cluster nodes, or just the node of the specified partition key?
If the latter is correct, then secondary index will be a good fit also for high cardinality fields (only for queries that satisfies the partition key).
EDIT: For example, for the following feed schema, query of a specific feed (feed_id specified) to retrieve existing or deleted feed items should be very efficient:
CREATE TABLE my_feed (
feed_id int,
item_id timeuuid,
is_deleted boolean,
data text,
PRIMARY KEY (feed_id, item_id)
) WITH CLUSTERING ORDER BY (item_id DESC);
CREATE INDEX my_feed_is_deleted_idx ON my_feed (is_deleted);
==> SELECT * FROM my_feed WHERE feed_id=1 AND is_deleted=false; --efficient?
If you hit a partition key first, then it won't be a cluster wide operation. Only the target partition will be hit. If you have wide rows with many rows in a partition, a secondary index will be an efficient way to filter them down once a partition is hit.
When and when not to use a secondary index and why is covered here: https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useWhenIndex.html

Resources