How to find the delta difference for a table in cassandra using uuid column type - cassandra

I have the following table on my Cassandra db, I want to find the delta difference in terms of cassandra query. For example, if I operate any insert,update,delete operation to the table I should be able to show which row/rows are getting impacted as my final result.
Let's say on first instance I have perform some 10 rows insertions so if I take the delta difference the output should only show that 10 rows are inserted. Same if we modify any number of rows or delete some rows then those changes should be captured.
Next time if we run the query it should idealy give 0 as we have not insert/modify/delete any row/rows
Here is the following table
CREATE TABLE datainv (
datainv_account_id uuid,
datainv_run_id uuid,
id uuid,
datainv_summary text,
json text,
number text,
PRIMARY KEY (datainv_account_id, datainv_run_id));
many things I have searched on internet but most of the solution are based on timeuuid,but in this case I have uuid columns only. So I'm not getting any solution that the same use-case can be achieved using uuid

It's not so easy to generate a diff between 2 table states in Cassandra, because you can't easily detect if you have inserted new partitions or not. You can implement something based on the timeuuid or on the timestamp as clustering column - in this case you'll able to filter out the data since latest change, as you have ordering of values that you don't have with uuid that is completely random. But it still requires that you perform the full scan of all the table. Plus it won't detect deletions...
Theoretically you can implement this with Spark as following:
read all primary key values & store this data in some other table/on disk;
next time, read all primary key values & find difference between original set of primary keys & new set - for example, do full outer join & use presence of None on left as addition, and presence of None on right as deletion;
store new set of the primary keys in a separate table/on disk, but previous version should be truncated.
but it will consume quite a lot of resources.

Related

Data modelling to faciliate pruning/bulk update/delete in scylladb/cassandra

Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
As #nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
CREATE TABLE my_table (
a uuid,
b text,
c text,
d int,
e timestamp
PRIMARY KEY ((a, b), c)
);
CREATE MATERIALIZED VIEW my_mv AS
SELECT a,
b,
c
FROM my_table
WHERE time IS NOT NULL
PRIMARY KEY (b, a, c);
Then in your prune job you could select from my_mv based on b and then delete from my_table based on the values returned from the select query.
Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

YCQL Secondary indexes on tables with TTL in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I have a table with TTL and a secondary index, using YugabyteDB 2.9.0 and I’m getting the following error when I try to insert a row:
SyntaxException: Feature Not Supported
Below is my schema:
CREATE TABLE lists.list_table (
item_value text,
list_id uuid,
created_at timestamp,
updated_at timestamp,
is_deleted boolean,
valid_from timestamp,
valid_till timestamp,
metadata jsonb,
PRIMARY KEY ((item_value, list_id))
) WITH default_time_to_live = 0
AND transactions = {'enabled': 'true'};
CREATE INDEX list_created_at_idx ON lists.list_table (list_id, created_at)
WITH transactions = {'enabled': 'true'};
We have two types of queries (80% & 20% distribution):
select * from list_table where list_id= <id> and item_value = <value>
select * from list_table where list_id= <id> and created_at>= <created_at>
We expect per list_id there would be around 1000-10000 entries.
The TTL would be around 1 month.
It is a restriction, it’s currently not supported to transactionally expire rows using TTL out of a table which are indexed (i.e. atomic expiry of TTL entries in both table and index). There are several workarounds to this:
a) In YCQL, we also support an index with a weaker consistency. This is not well documented today, but you can see the details here: https://github.com/YugaByte/yugabyte-db/issues/1696
The main issue to call out when using this variant of index is that error handling (on INSERT failure), is that it is an application side responsibility to retry the INSERT on failure. As noted in the above issue << If an insert/update or batch of such operations fails, it is the app's responsibility to retry the operation so that the index is consistent. Much like in a 2-table case, it would have been the apps responsibility to retry (in case of a failure between the update to the two tables) to make sure both tables are in sync again. >>
This type of index supports a TTL at the table & index level. (which is recommended to keep the same): https://github.com/yugabyte/yugabyte-db/issues/2481#issuecomment-537177471
b)Another workaround is to use a background cleanup job to periodically delete stale records (instead of using TTL).
c)Avoid using indexes and store data in two tables. one organized by the original primary key and one organized by the index columns you wanted (as the primary key). Both tables can have TTL. But it is an application side responsibility to INSERT to both tables when data is added to the database.
The first table's PK would be ((list_id, item_value)), identical to the current main table. nstead of an index you'll have a second table; the second table's PK would be ((list_id), created_at) and both tables would have a TTL. The application must insert the data into both tables. In the 2nd table you have a choice:
(option 1) Duplicate all the columns from the main table here including your JSON columns etc. This makes Q2 lookup fast, the row has everything it needs; but increases your storage requirements.
(option 2): In addition to the PK, just store the item_value column in the second table. For Q2, you must first lookup the 2nd table and get the item_value, and then use list_id and item_value and retrieve the data from the main table (much like an index would do under the covers).
d)Another workaround, is if we could avoid the index and pick the PK to be ((list_id, item_value), created_at).
This would not affect the performance of Q1 because with (where list_id and item_value) provided it can use the PK to find the rows. But it would be slower for Q2 where list_id and created_at are provided because while it can still use list_id, it must filter out the data using the created_at value without the help of an index. So if Q2 is really 20% of your queries, you probably do not want to scan 1 to 10k items to find your matching row.
To clarify option (c), with the example in mind:
The first table's PK would be ((list_id, item_value)); it is the same as your current main table. Instead of an index you'll have a second table; the second table's PK would be ((list_id), created_at).
both tables would have a TTL
The application would have to insert entries into both tables.
In the 2nd table you have a choice:
(option 1) duplicate all the columns from the main table, including your JSON columns etc. This makes Q2 lookup fast, the row has everything it needs; but increases your storage requirements.
(option 2): in addition to the Primary Key, just store the item_value column in the second table. For Q2, you must first lookup the 2nd table and get the item_value, and then use list_id and item_value and retrieve the data from the main table (much like an index would do under the covers)

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

Resources