Cassandra aggregation to second table

Cassandra aggregation to second table - cassandra

i have a cassandra table like:
CREATE TABLE sensor_data (
sensor VARCHAR,
timestamp timestamp,
value float,
PRIMARY KEY ((sensor), timestamp)
)
And and aggregation table.
CREATE TABLE sensor_data_aggregated (
sensor VARCHAR,
aggregation VARCHAR /* hour or day */
timestamp timestamp,
aggragation
min_timestamp timestamp,
min_value float,
max_timestamp timestamp,
max_value float,
avg_value float,
PRIMARY KEY ((sensor, aggregation), timestamp)
)
Is there a possibility of any trigger, to fill the "sensor_data_aggregated" table automaticly on insert, update, delete or "sensor_data" table?
My current solution whould be to write an custom trigger, with second commit log.
And an application that read and truncate this log peridicly to generate the aggregated data.
But i also found information that the datastax ops center can do this but no instruction how to do that?
What will be the best solution how to to this?

You can implement your own C* trigger for that, which will execute additional queries for your aggregation table after each row insert into sensor_data.
Also, for maintaining min/max values you can use CAS and C* lightweight transactions like
update sensor_data_aggregated
set min_value=123
where
sensor='foo'
and aggregation='bar'
and ts='2015-01-01 00:00:00'
if min_value>123;
using a bit updated schema ('timestamp' is a reserved keyword in cql3, you cannot use it unescaped):
CREATE TABLE sensor_data_aggregated (
sensor text,
aggregation text,
ts timestamp,
min_timestamp timestamp,
min_value float,
max_timestamp timestamp,
max_value float,
avg_value float,
PRIMARY KEY ((sensor, aggregation), ts)
)

Related

Not able to run multiple where clause without Cassandra allow filtering

Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )

The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.

Cassandra: can you use a derived value from a column as part of the partition key?

Let's say I have a table schema that has a timestamp for the event:
CREATE TABLE event_bucket_1 (
event_source text,
event_year int,
event_month int,
event_id text,
event_time timestamp,
...
PRIMARY KEY ((event_source, event_year, event_month), event_id)
) WITH CLUSTERING ORDER BY (event_id DESC)
My question is: Can I skip adding the event_year and event_month columns and replace it with some kind of function like year(event_time) and month(event_time)? The thinking is that event_year, event_month are both duplication of information from event_time.

No, it is not possible. But, from my understanding, you want to query based on year and month, right? You can accomplish this by replacing the event_year and event_month by event_time in your compound key and use query time ranges:
SELECT * FROM event_bucket_1 where event_source='source' and event_time > '2018-06-01 00:00:00' and event_time < '2018-07-01 00:00:00';

No, the partition key needs to be static, and AFAIK, can't be evaluated at.
You can try to open a ticket as an improvement for future versions at https://issues.apache.org/jira/secure/Dashboard.jspa
Seems to be a good use case and would fit more scenarios.

Normalization in Cassandra for the specified use case?

I have a device table (say 'device' table) which has the static fields with current statistics and I have another table (say 'devicestat' table) which has the statistics of that device collected for every one minute and sorted by timestamp like below.
Example :
CREATE TABLE device(
"partitionId" text,
"deviceId" text,
"name" text,
"totalMemoryInMB" bigint,
"totalCpu" int,
"currentUsedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"currentUsedCpu" int,
"ipAddress" text,
primary key ("partitionId","deviceId"));
CREATE TABLE devicestat(
"deviceId" text,
"timestamp" timestamp,
"totalMemoryInMB" bigint,
"totalCpu" int,
"usedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"usedCpu" int
primary key ("deviceId","timestamp"));
where,
currentUsedMemoryInMB & currentUsedCpu => Hold the most recent statistics
usedMemoryInMB & usedCpu => Hold the most and also old statistics based on time stamp.
Could somebody suggest me the correct approach for the following concept?
So whenever I need static data with the most recent statistics I read from device table, Whenever I need history of device staistical data I read from the devicestat table
This looks fine for me, But only problem is I need to write the statitics in both table, In case of devicestat table It will be a new entry based on timestamp but In case of device table, we will just update the statistics. What is your thought on this, Does this need to be maintained in only the single stat table or Is it fine to update the most recent stat in device table too.

in Cassandra the common approach is to have a table(ColumnFamily) per query. And denormalization is also a good practice in Cassandra. So it's ok to keep 2 column families in this case.
Another way to get the latest stat from devicestat table is make data be DESC sorted by timestamp:
CREATE TABLE devicestat(
"deviceId" text,
"timestamp" timestamp,
"totalMemoryInMB" bigint,
"totalCpu" int,
"usedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"usedCpu" int
primary key ("deviceId","timestamp"))
WITH CLUSTERING ORDER BY (timestamp DESC);
so you can query with limit 1 when you know deviceId
select * from devicestat where deviceId = 'someId' limit 1;
But if you want to list last stat of devices by partitionId then your approach with updating device table with latest stat is correct

Cassandra - aggregate timestamp by hour

I have a table with timestamps in a 15 min interval. It's possible to aggregate or group by hour and the load field being the average?

Theres a post on materialized views. You can use it to create a copy of data batched by hour. Then use the average aggregate functions on load. I think CASSANDRA-11871 Is what your looking for though, which has all its dependancies in group by has recently been completed but hasnt been worked on yet.
Kinda just guessing on your schema but something like (disclaimer not really tested):
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc), date)
);
CREATE MATERIALIZED VIEW load_by_hour AS
SELECT * FROM load
WHERE ref_equip IS NOT NULL AND ptd_assoc IS NOT NULL
PRIMARY KEY ((ref_equip, ptd_assoc), date_hour, date);
where date_hour is just the timestamp with hour resolution, meaning divide by 1000*60*60 (epoc is ms) when doing insert. Can then select average
SELECT avg(load) FROM load_by_hour WHERE ref_equip='blarg' AND ptd_assoc='blargy' AND date_hour = 410632;
Alternatively something that may just be better to begin with is to store you data, partitioned by hour:
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc, date_hour), date)
);

Cassandra data model with obsolete data removal possibility

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks.
I would like to model data with future removal possibility.
I have 100,000,000 records per day of this structure:
transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information
I will need to fetch data by user_name (I have about 5,000,000 users).
Also I will need to find transaction details by its id.
All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.
As much I have found, TTL-s expire column values, not rows.
So far I came across with this model, and as I understand it will imply really wide rows:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY ((transaction_date, user_name), transaction_id)
);
CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);
but this model does not allow deletions by transaction_date.
this also builds indexes with high cardinality, what cassandra docs strongly discourages
So what will be the correct model for this task?
EDIT:
Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model
CREATE TABLE user_transactions_YYYYMMDD (
user_name text,
transaction_id text,
transaction_time timestamp,
transaction_type int,
PRIMARY KEY (user_name)
);
YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.

Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (user_name, transaction_id)
);
So you can query using the partition key directly like this:
SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';
And for the id you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (transaction_id)
);
so the query could be something like this:
SELECT * FROM user_transactions WHERE transaction_id = 'ID';
By this way you dont need indexes.
About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).

Perhaps my answer will be a little useful.
I would have done so:
CREATE TABLE user_transactions (
date timestamp,
user_name text,
id text,
type int,
PRIMARY KEY (id)
);
CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);
No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.
INSERT INTO user_transactions ... USING TTL 86400;
will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.
But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL.
If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.
And indeed as a driver to work with Cassandra I recommend you use Astyanax.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra aggregation to second table - cassandra

Related

Not able to run multiple where clause without Cassandra allow filtering

Cassandra: can you use a derived value from a column as part of the partition key?

Normalization in Cassandra for the specified use case?

Cassandra - aggregate timestamp by hour

Cassandra data model with obsolete data removal possibility

Categories

Resources