How to perform operations inside Cassandra Trigger? - cassandra

My application collects per sec data from devices and inserts into cassandra table. My idea is to write a trigger for the per sec data table which will automatically convert the per sec into hourly / daily data. And also I'll store the hourly and daily data in the same table with different key. To achieve this use case, I need to perform below operations inside my trigger code.
How can I insert a data into the same table which will invoke the trigger again ? ( will be used for converting per hour to per day )
How can I insert a data into different table ? ( store accumulated data into a temp table )
How can I select a data from different table ? ( fetch the last data for accumulation )
If I know the above info, my application will just insert per sec data and rest(per sec -to- hour -to- day convertion) will be automatically taken care by my trigger code.
Can you please help me to get the above info ?
It would be great if you give some code snippet for the same.

Unless you're comfortable with Cassandra internals, you should do this in a data abstraction layer instead of a trigger.

Related

How to get all the operations done on an oracle table be imported into hive for processing?(not just actual data in table, but the operations also)

I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.

Best practices to store every minute and select from database latest 24h data only?

The task is to permanently record new data to a database every minute and then, occasionally, to read only latest 24h data, using Python.
The only approach I know:
create a script A that will be inserting into a MariaDB database table, one new line per minute, with a timestamp as a field value
create a script B that will be reading from the database table, using WHERE and timestamp values
The problem is, there are 2 restrictions:
it is not allowed to have more than 10.000 lines in one database table
it is not allowed to delete any lines
How to fulfill the task and meet both restrictions? Are there best practices?
Thanks!
You can create a new table every X days when it is full. Name the table with the first timestamp value.
With this solution you need to create your B script in this way:
List all tables
Find the tables you are looking for
Write your SQL query on all theses tables using UNION ALL
You can do it into a single SQL query for optimisation or into a script using multiple queries for simplicity.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

Best Cassandra data model for maintaining bounded lists per user

I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.

Cassandra - Data Modeling Time Series - Avoiding "Hot Spots"?

I'm working on a Cassandra data model to store records uploaded by users.
The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).
How can I avoid having too many records on a partition key in a short amount of time?
I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.
CREATE TABLE uploads (
user_id text
,rec_id timeuuid
,rec_key text
,rec_value text
,PRIMARY KEY (user_id, rec_id)
);
The use cases are:
Get all upload records by user_id
Search for upload records by date range
range
A few possible ideas:
Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.
Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.
Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.

Resources