Selecting records in Cassandra based on Time range in frequent intervals - cassandra

I have a table in Cassandra where i am storing events as they are coming in , different processing are done on the events at different stages. The events are entered into the table with the event occurrence time. I need to get all the events whose event time is less than a certain time and do some processing on them. As its a select range query and its invariably will use scatter gather. Can some one suggest best way to do this. This process is going to happen in every 5 secs and scatter gather happening in Cassandra happening frequently is not a good idea as its an overhead on Cassandra itself which will degrade my overall application Performance.
The table is as below:
PAS_REQ_STAGE (PartitionKey = EndpointID, category ; clusterkey= Automation_flag,alertID)
AlertID
BatchPickTime: Timestamp
Automation_Threshold
ResourceID
ConditionID
category
Automation_time: Timestamp
Automation_flag
FilterValue
Eventtime which i have referred above is the BatchPickTime..
A scheduler wakes up at regular interval and gets all the records whose BatchPickTime is Less than the current scheduler wakeup time and sweeps them off from the table to process them.
Because of this usecase i cannot provide any specific Partition key for the query as it will have to get all data which has expired and is less than the current scheduler wake-up time.

Hi and welcome to Stackoverflow.
Please post your schema and maybe some example code with your question - you can edit it :)
The Cassandra-way of doing this is to denormalize data if necessary and build your schema around your queries. In your case I would suggest putting your events in to a table together with a time bucket:
CREATE TABLE events (event_source int, bucket timestamp,
event_time timestamp, event_text text PRIMARY KEY ((event_source, bucket),event_time));
The reason for this is that it is very efficent in cassandra to select a row by its so called partition key (in this example (event_source, bucket)) as such a query hits only one node. The reminder of the primary key is called clustering columns and defines the order of data, here all events for a day inside the bucket are sorted by event_time.
Try to model your event table in a way that you do not need to make multiple queries. There is a good and free data modeling course from DataStax available: https://academy.datastax.com/resources/ds220-data-modeling
One note - be careful when using cassandra as queue - this is maybe an antipattern and you might be better of with a message queue as ActiveMQ or RabbitMQ or similar.

Related

Reduce cassandra tombstones

I have a table to store messages which are failed to process and I am retrying to process messages every 5 minutes through scheduler.
When message gets processed successfully, respective row from table is deleted, so that same message should not get processed again.
To fetch rows from table query is SELECT * FROM <table_name> , due to which we are facing tombstone issues if large number of rows gets deleted.
Table have timestamp as partition key and message_name(TEXT) as clustering key, TTL of 7 days and gc_grace_second of 2 days
As per my requirement, I need to delete records otherwise duplicate record will get processed. Is there any solution to avoid tombstone issues?
So I see two problems here.
Cassandra is being used as a queuing mechanism, which is an established anti-pattern.
All partitions are being queried with SELECT * FROM <table_name>, because there isn't a WHERE clause.
So with Cassandra, some data models and use cases will generate tombstones. At that point, there's not a whole lot to be done, except to design the data model so as to not query them.
So my thought here, would be to partition the table differently.
CREATE TABLE messages (
day TEXT,
message_time TIMESTAMP,
message_text TEXT,
PRIMARY KEY ((day),message_time))
WITH CLUSTERING ORDER BY (message_time DESC);
With this model, you can query all messages for a particular day. You can also run a range query on day and message_time. Ex:
SELECT * FROM messages
WHERE day='20210827'
AND message_time > '2021-08-27 04:00';
This will build a result set of all messages since 2021-08-27 04:00. Any tombstones generated outside of the requested time range (in this case, before 04:00) will not be queried.
Note that (based on the delete pattern) you could still have tombstones within the given time range. But the idea here, is that the WHERE clause limits the "blast radius," so querying a smaller number of tombstones shouldn't be a problem.
Unfortunately, there isn't a quick fix to your problem.
The challenge for you is that you're using Cassandra as a queue and it isn't a good idea because you run exactly into that tombstone hell. I'm sure you've seen this blog post by now that talks queues and queue-like datasets being an anti-pattern for Cassandra.
It is possible to avoid generating lots of tombstones if you model your data differently in buckets with each bucket mapping to a table. When you're done processing all the items in the bucket, TRUNCATE the table. This idea came from Ryan Svihla in his blog post Understanding Deletes where he goes through the idea of "partitioning tables". Cheers!

Prevent race condition while writing to Cassandra

I have a realtime streaming solution with Kafka, Spark (as the aggregation engine) and Cassandra (as the store). User defines the aggregates that are needed and the engine creates the aggregate and writes them to the store. Here is an example of how the aggregates are created
CREATE AGGR COUNT FROM input_data WHERE type,event,id
This creates a count aggregate for the 3 columns and writes to C*.
We have a requirement to process historical data as well. That means if an aggregate was created today, we need to go back and fix history for it. To cater to this use case, we have created a hvalue column in Cassandra. Here is the schema for reference
CREATE TABLE tbl (
key blob,
key2 blob,
key3 blob,
...
key15 blob,
column1 blob,
column2 blob,
...
column20 blob,
*hvalue* blob,
*value* blob,
PRIMARY KEY ((key, key2, key3 ... key15), column1 ... column20)
) WITH CLUSTERING ORDER BY (column1 ASC,column2 ASC .. column20 ASC)
value stores the facts that are computed while online processing. hvalue stores the value for historical processing. While querying, both the columns are retrieved, merged and returned to user.
We are using datastax leftJoin API to join with Cassandra.
RDD.leftJoinWithCassandraTable(keyspace,tableName)
.on(SomeColumns(...)
.map { case (ip, row) => row match {
case None => ip
case Some(data) => CASSANDRA_MAP_SCHEMA(...)
)
}
}.saveToCassandra(keyspace,tableName)
In short, we create a schema for the RDD, and write the row to Cassandra.
Now, here is the problem. During the historical process, we need to create a row to write to Cassandra. This means that we need to provide some data to the "value" column. If it is a new row that is not present in Cassandra, we create a null object and write back. If the row is present, we take the existing value and write it back.
The online and historical process will run at the same time. This means that when the historical process reads a row, and writes back, the online process may have created the same row. This will result in corrupt data, since the historical process may read a stale data and update the value that was written by the online process.
I am not sure how to resolve this problem. I'll appreciate if there is any other solutions to prevent this.
I tried to explain the best I can, let me know if further clarifications are needed and I'll try to add more inputs.
Thanks in advance for the help.
There are a few ways to work around this, but none are really simple. Fundamentally write after write problems are hard.
The first is that you introduce a shared external locking mechanism where you obtain a lock for the row and either release it when it is done or have a short ttl. You can use something like Redis for this.
A second option is to funnel all changes to Cassandra through a kafka queue so that only one source is allowed to write. Though there is a chance that this will make your problem worse. If you are going to do this, make sure that you are partitioning your queue based on keys so that the same key always routes to the same queue.
A third option is that the services are only allowed to operate on data for a given time range. If your online data is only allowed to work on data in the last day, or X hours, etc. and your historical is only allowed to work on data that is more than that period of time old then there is virtually no chance of running into conflicts.
The fourth option is to accept that it is a possibility and that the possibility of it happening is small enough that it isn't an issue. If the datacenter where your code runs is very close (ideally colocated with your db) and you aren't doing significant processing on the row between read and write this may be a reasonable option.

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Cassandra data model for interval and event based time series

I have to collect time series data from various IoT sensors. Based on my research there are two different types of time series data streams.
Case 1 : Fixed interval
This type of data stream has a fixed interval and its very easy to select data points between a given range. A typical use case would be a counter.
Case 2 : Event based
This type of data stream comes at irregular points in time and only occurs when something is about to change. A typical use case would be a power switch when a sensor is going offline or online.
Requirements
Selecting all affected data points between a given time window
Data model
This is my cassandra data model. Any point in the stream can be modeled by
CREATE TABLE sensor_raw (
sensor_id text,
bucket_id date,
sensor_time timestamp,
sensor_value double,
PRIMARY KEY ((sensor_id, bucket_id), sensor_time )
) WITH CLUSTERING ORDER BY (sensor_time DESC);
Solution for case 1
This is very easy and needs no further discussion
SELECT * FROM sensor_raw where
sensor_id = '1' AND
bucket_id = '2017' AND
sensor_time >= '2017-01-01 10:00'
AND sensor_time < '2017-01-01 10:14'
Solution for case 2
Here i have the problem that events from outside the window can overlap into the selected range. For example E1
Another problem is the last event E3 where the event has not yet finished.
I need
Partial duration from window start to E1.
To get this info i would have to look back from the first event in the stream to get the previous one. Then calculate the difference from window start to E2.
Duration from E2 to E3
This is easy
Duration from E2 to window end ( not yet finished )
Would have to check if last event has same timestamp as window end and if not last event is still running.
Result
Question
Is there any better data model for case 2 ?
Is there any way to not have an additional query to get the solution i need ?
I think you pretty much covered all the scenarios. One thing that could help you is if you could create an events table where the data with the "event" type and end_time would go. Something on the lines of:
CREATE TABLE sensor_raw_events (
sensor_id text,
bucket_id date,
event_end_time timestamp,
event_begin_time timestamp,
event_type text,
PRIMARY KEY ((sensor_id, bucket_id), sensor_end_time )
) WITH CLUSTERING ORDER BY (sensor_end_time DESC);
The prerequisite for that would be that you actually have some sort of layer that would be able to track the events switching on the application level. A project I worked on had to keep sessions when connecting to devices due to protocol requirements so this wasn't really a problem I guess.
We basically had small in memory grid that was keeping the current state of every sensor with periodic flushing to cassandra - this was only for recovering should all the application go down, but this never happened.
This approach would probably require a lot of memory resources for running it so if you are having millions of sensors this might get too expensive and it doesn't add much value so basically it all depends on scale that you actually have.
Plus one down side of the idea is that you wouldn't really catch the event that is currently ongoing because it's not written to the table yet. But would actually be o.k. for analytical workload because you wouldn't have to make additional query to fetch the beginning of the E1, it would already be there for you.
Some approaches with one table with begin_time and end_time might also be possible but then again this just wastes space (and with sensors it get's packed pretty quick).
Your model and how you described it is pretty much very similar to stuff that I did before and with cassandra alone there simply isn't much more that is known to me that you can do :(

Is it okay to directly read from Cassandra to surface information from a web application?

I'm using Cassandra as my primary data store for a time series logging application. I receive a high-volume number of writes to this database, so Cassandra was a natural choice.
However, when I try showing statistics about the data on a web application, I make costly reads to this database and things start to slow down.
My initial idea is to run periodic cron jobs that pre-compute these statistics every hour. This would ensure no slow reads. I'm wondering if there's another way to read from a Cassandra database and what is the best solution?
You are on the right track with what your initial thinking.
How you store data in C*, and specifically how you select you Primary Key fields have a direct influence on how you can read data out. If you are hitting a single partition on a table reading data out of a C* cluster is very efficient and is an excellent choice for showing data on a website.
In your case if you want to show some level of aggregated data (e.g. by hour) I would suggest that you create your partition key in such as way as to make it so all the data you want to aggregate is contained in the same partition. Here is an example schema for what I mean:
CREATE TABLE data_by_hour (
day text,
hour int,
minute int,
data float,
PRIMARY KEY((day, hour), minute)
);
You can then use a cron job or some other mechanism to run a query and aggregate the data into another table to show on the website.

Resources