I want to fetch last n, say last 5 updated rows i.e. order by updated_time desc in cassandra. Is there any good way of doing it?
Exact use case is like, I want to update the count of event whenever it occurs in the event table and fetch the last five events by updated time along with the count.
table structure:-
event_name text, updated_time timestamp, count counter
In Cassandra you can retrieve the editing time with writetime (cell_name). But as you have multiple columns and Cassandra is fast-reads only you may consider doing another view providing exactly the data needed in an ordered manner. On that new table you want to limit read results and periodically trim it down.
It may be possible doing it with writetime() -- but this was not the Cassandra way as it is too slow in production. Another table with just your data is the denormalized Cassandra way of solving it.
Related
I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.
We store events in multiple tables depending on category.
Each event have an id but contains multiple subelements.
We have a lookup table to find events using the subelement_id.
Each subelement can participate at max in 7 events.
Hence the partition will hold max 7 rows.
We will have 30-50 BILLIONS of rows in eventlookup over a period of 5 years.
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
Problem: How do we delete old data once we reach the 5 (or some other number) year mark.
We want to purge the "tail" at some specific intervals, say every week or month.
Approaches investigated so far:
TTL of X years (performs well, but TTL needs to be known before hand, 8 extra bytes for each column)
NO delete - simply ignore the problem (somebody else's problem :0)
Rate limited single row delete (do complete table scan and potentially billions of delete statements)
Split the table to multiple tables -> "CREATE TABLE eventlookupYYYY". Once a year is not needed, simply drop it. (Problem is every read should potentially query all tables)
Is there any other approaches we can consider ?
Is there a design decision we can make now ( we are not in production yet) that will mitigate the future problem?
If it's worth the extra space, track for ranges of recordtimes your subelement_id in a seperate table / columnfamiliy.
Then you can easily get the ids to delete for records having a specific age if you do not want to set a ttl a priori.
But keep in mind to make this tracking distribute well, just a single date will generate hotspots in your cluster and very wide rows, so think about some partition key like (date,chunk) where I uses a random number from 0-10 in the past for chunk. Also you might look at TimeWindowCompactionStrategy - here is a blog post about it: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Your partition key is only set to subelement_id, so all tuples of 7 events for all recordtimes will be in one partition.
Given your table structure, you need to know all the subelement_id of all your data just to fetch a single row. So, with this assumption, your table structure can be improved a bit by sorting your data by recordtime DESC:
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
eventtype int,
parentid text,
partition bigint,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
WITH CLUSTERING ORDER BY (recordtime DESC);
Now all of your data is in descending order and this will give you a big advantage.
Suppose that you have multiple years of data (eg from 2000 to 2018). Assuming you need to keep only the last 5 years, you'd need to fetch data by something like:
SELECT * FROM eventlookup WHERE subelement_id = 'mysub_id' AND recordtime >= '2013-01-01';
This query is efficient because C* will retrieve your data and will stop scanning the partition exactly where you wanted to: 5 years ago. The big plus is that if you have tombstones after that point, well, they won't impact your reads at all. That means you can "safely" trim after that point safely by issuing a delete with
WHERE subelement_id = 'mysub_id' AND recordtime < '2013-01-01';
Beware that this delete will create tombstones that will be skipped by your reads, BUT they will be read during compactions, so keep it in mind.
Alternatively, you can simply skip the delete part if you don't need to reclaim your storage space, your system will always run smooth because you will always retrieve your data efficiently.
I am struggling with data order of Cassandra data. I have a table like this
tbl_data
- yymmddhh (text)
- data (text)
parting key is 'yymmddhh'
I am adding data like this
'16-11-17-01', 'a'
'16-11-17-01', 'b'
'16-11-17-02', 'c'
'16-11-17-03', 'xyz'
'16-11-17-03', 'e'
'16-11-17-03', 'f'
select * from tbl_data limit 10;
I am expecting data in the order in which I added data. But it is giving data like this
'16-11-17-03', 'f'
'16-11-17-03', 'e'
'16-11-17-01', 'a'
i.e. latest record first or some random order. I need data in the same order in which I added. I am not able to figure out the default order of the data in my case. Also I don't want to pass partition key in where condition because its overhead to remember that value for me. Kindly suggest me the solution.
I'm afraid you will struggle forever on this.
As per comments, you can't decide the order "outside" a partition, unless you really understand what you're doing by changing the partitioner.
Please have a read at the suggested link, and at this and this SO answers to understand why you are getting your records in this specific order (yes, they ARE ordered...).
A possible solution, however, is to add a timestamp clustering key, and change the partition key to a simpler "yymmdd":
tbl_data
- yymmdd (timestamp)
- hhmmssMMM (timestamp)
- data (text)
Now you'd store data on day by day basis (that is you need to know the day you are querying data for), and the order of your data inside each partition (that is each day) is sorted by the timestamp column, so for your requirements you'd store there the insertion time of the record.
Now, if you don't insert data every day, you really need to keep track the insertion dates into another (very simple) table:
CREATE TABLE inserted_days (
yymmdd timestamp PRIMARY KEY
);
Issuing a
SELECT * FROM inserted_days
would scan all this partition, returning records in random order (from you app point of view, so you need to sort it), but here we are talking of 365 records in year, something you don't need to worry about. It's easy to do and you'd not incur into unmanageable overheads.
HTH.
I have a log table in cassandra, and now I want to search the rows count of the table.
First, I use the select count(*) from log,but it's very, very slow.
Then I want to use the counter type, and then the problem is coming. My table is a TTL table, all rows keep an hour, use the counter type become very difficult.
Cassandra isn't efficient for doing table scan operations. It is good at ingesting high volumes of data and then accessing small slices of that data rather than the whole table.
So if you want to count keys without using a counter, you need to break the table into chunks of data that are small enough to be processed quickly. For example if you want to use count(*), you should only use it on a single partition, and keep the partition size below about 100,000 rows.
In your case you might want to partition your data by hour (or something small like 5 minute intervals if you insert a lot of log lines per second).
Be careful with using a TTL of an hour if you are inserting a lot of data continuously since it could cause a lot of tombstones. To avoid building up tombstones you should delete each hour partition after the hour has passed.
Im planning a few ETLs that eventually will "fill" the same row in Cassandra, i.e. if a table is defined as:
CREATE TABLE MyTable (
key text,
column1 text,
column2 text,
column3 text,
column4 text,
PRIMARY KEY (key)
)
Than few ETLs will fill out the appropriate values in the columns 1-4 in a different times.
How well cassandra handles such operations? Should I read the row first, update in code and then write back or would an UPDATE call will do the trick?
I know that Cassandra is highly optimized for write throughput in that it never modifies data on disk, it only appends to existing files or creates new ones. knowing that, and without diving deeper into the implementation, It worries me that if an ETL will write column4 and 20 minutes later a different ETL will write column2, I will lose a lot of performance comparing to waiting for all the ETLs to finish and than save all the data in a bulk (which is not an easy implementation by itself).
Ideas?
All inserts/updates in Cassandra are Upserts, and Cassandra uses last-write-wins for conflict resolution. If your ETLs are updating different columns, there would be no issue. If they update the same column, the last update for a column will win. If this is an issue, you can add a timestamp column as a clustering key (allowing multiple values of the data), and during read, read the latest one. You could also add a TTL so older irrelevant values get cleared out.
If certain columns are updated, and others aren't, you'll effectively get null for those columns when querying.
I couldn't really understand your last paragraph. Could you please explain your concern?