Cassandra Timestampe - cassandra

I want ask about timestampe formate in the insert command
in the following sample when i insert any number like "12" or "15" in the "message_sent _at"
I found that all the values of the timestamp fields is the same value : 1970-01-01 02:00 EGYPT standard time .
sample:
CREATE TABLE chat (
id1 int,
id2 int,
message_sent_at timestamp,
message text,
primary key ((id1, id2), message_sent_at)
)

The units of timestamp type are milliseconds since the epoch (1/1/1970 00:00:00 UTC). Entering 12 means 12 ms after midnight so will be rounded to the time you print (in your timezone) when displayed in that format.
You can create timestamps from dates here: http://www.epochconverter.com/.

Related

Cassandra: Data Modeling for event based time series

I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well

How can we filter rows based on timestamp column?

I have a cassandra column which is of type date and has values in timestamp format like below. How can we filter rows based on this column which have date greater than today's date?
Example:
Type: date
Timestamp: 2021-06-29 11:53:52 +00:00
TTL: null
Value: 2021-03-16T00:00:00.000+0000
I was able to filter rows using columname <= '2021-09-25' which gives ten rows some of them having dates on sep 23 and 24. When i filter using columname < '2021-09-24', i get an error like below
An error occurred on line 1 (use Ctrl-L to toggle line numbers):
Cassandra failure during read query at consistency ONE (1 responses were required but only 0 replica responded, 1 failed)
The CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Depending on where you're running the query, the filter could be translated in the local timezone. Let me illustrate with this example table:
CREATE TABLE community.tstamptbl (
id int,
tstamp timestamp,
PRIMARY KEY (id, tstamp)
)
These 2 statements may appear similar but translate to 2 different entries:
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09');
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09 +0000');
The first statement creates an entry with a timestamp in my local timezone (Melbourne, Australia) while the second statement creates an entry with a timestamp in UTC (+0000):
cqlsh:community> SELECT * FROM tstamptbl WHERE id = 5;
id | tstamp
----+---------------------------------
5 | 2021-08-08 14:00:00.000000+0000
5 | 2021-08-09 00:00:00.000000+0000
Similarly, you need to be precise when reading the data. You need to specify the timezone to remove ambiguity. Here are some examples:
SELECT * FROM tstamptbl WHERE id = 5 AND tstamp < '2021-08-09 +0000';
SELECT * FROM tstamptbl WHERE id = 1 AND tstamp < '2021-08-10 12:00+0000';
SELECT * FROM tstamptbl WHERE id = 1 AND tstamp < '2021-08-10 12:34:56+0000';
In the second part of your question, the error isn't directly related to your filter. The problem is that the replica(s) failed to respond for whatever reason (e.g. unresponsive/overloaded, down, etc). You need to investigate that issue separately. Cheers!

Cassandra - using "date" vs "text" types for a partition date key

We have a schema where the partition key will be a date (yyyy-MM-dd) and we are thinking about choosing the data type between text and date for this partition key.
Does one data type provide benefits over the other and how would they differ in querying/storage?
Here is an example schema.
CREATE TABLE test.user_sessions (
sess_date date (or text),
sess_starttime timestamp,
event_type text,
total_req int,
ended_at timestamp
PRIMARY KEY (sess_date, sess_starttime)
);
Cassandra Date Type :
Value is a date with no corresponding time value; Cassandra encodes date as a 32-bit integer representing days since epoch (January 1, 1970)
Cassandra Text Type :
UTF-8 encoded string; 16 bit for each character
If you store date (yyyy-MM-dd) as date data type each entry will only take 32-bit. On the other hand if you store the date as text it will take 10*16 = 160 bit storage.
As per your comments, if you need maximum portability simply store the information as a timestamp (that is a 64-bit number) corresponding to the something like yyyy-MM-dd 00:00:00 (a truncated time stamp). You can't go wrong with an "universal" number...

Is it possible to insert ddmmyyhh to text column based on now() value of timeuuid column

I'm referring to one of the presentation slide from eBay - http://www.slideshare.net/jaykumarpatel/cassandra-data-modeling-best-practices
I want to try out the same thing. Hence, I create the following table.
CREATE TABLE ebay_event (
date text,
eventtype text,
time timeuuid,
payload text,
PRIMARY KEY((date, eventtype), time));
Then, in my PHP script, I will perform insert using the following insert statement.
insert into ebay_event(date, eventtype, time, payload) values('03031611', 'view', now(), 'additional data');
Instead of hard code value '03031611', is there a way to tell cassandra, to generate ddmmyyhh based on the now() value of timeuuid column?
No. There are no such functions available in cassandra. You will have to create it in the language you are using.
Values for the timestamp type are encoded as 64-bit signed integers
representing a number of milliseconds since the standard base time
known as the epoch: January 1 1970 at 00:00:00 GMT.
There are some functions available that can create date in YYYY-mm-dd format.
Date from timeuuid

Cassandra : Making an appropriate Data Model

I have a table called Price in MYSQL which looks like this :
+---------+-------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+-------------------+-----------------------------+
| Current | float(20,3) | YES | | NULL | |
| Time | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+---------+-------------+------+-----+-------------------+-----------------------------+
My application requires me to sum and retrieve results from the last 1 hour, 2 hours up to the last week from now. I am trying to move to Cassandra and wanted to make a suitable model for my data. Currently i have built a table in Cassandra which looks something like this :
CREATE TABLE IF NOT EXISTS HAS.Price (
ID INT,
Current float,
Time timestamp,
Time_uuid timeuuid,
PRIMARY KEY (ID, Time_uuid)
);
This is not logical as it just creates one big table and i dont think this will distribute data to other nodes. I am using a fixed id of 1 here. I believe in my case the logical partition key to choose would be "hour" so for example i can sum all the current values from last hour, last 2 hours and so on. In this case i am referring to this post . If i create hour as a partition key for example all the data for lets say the 15th hour of the day will go in this row
2015-08-06 15:00:00
and the data for the next hour will go to 2015-08-06 16:00:00. However lets say the current time is 2015-08-06 16:12:43 and i want to select records from last hour how will my query look like because part of the data is in 2015-08-06 15:00:00 which will have a different primary key
Try the following option. ( I have correct the answer)
Design for your queries. Here, possible queries I could see other than upto minute
Get sum for day
Get sum for hour
Get sum for last hour (any time on the hour)
CREATE TABLE mykeyspace.price (
day text,
hour text,
inserttime timeuuid,
current float,
PRIMARY KEY ((day, hour), inserttime)
) WITH CLUSTERING ORDER BY (inserttime DESC)
Make 2 insert for every transaction like below
insert into price (day, hour , inserttime , current ) VALUES ('20150813','',now(),2.00)
insert into price (day, hour , inserttime , current ) VALUES ('',’ 2015081317',now(),2.00)
Where
day is YYYYMMDD
hour YYYYMMDDhhmmss (2015081317)
Select Query to get last hour at any minute: Use minTimeuuid and maxTimeuuid
select day,hour,dateOf(inserttime) from price where day = 0 and hour IN ( 2015081317, 2015081316) and inserttime > maxTimeuuid('2015-08-13 16:20:00-0500') and inserttime < minTimeuuid('2015-08-13 17:20:00-0500');
Note: Range query is not allowed on a partition key, although documentation says you could use token function but the results are not predictable.
This is not logical as it just creates one big table and i dont think this will distribute data to other nodes.
Yes, this won't distribute data across you nodes.
Here what I think solution should be
CREATE TABLE IF NOT EXISTS HAS.Price (
Time_uuid timeuuid,
Current float,
PRIMARY KEY (Time_uuid)
);
Then simply find start hour time_uuid and end hour time_uuid and write query like
`SELECT * FROM HAS.Price WHERE time_uuid>=cdb36860-4444-11e5-8080-808080808080 AND time_uuid<=f784b8ef-450d-11e5-7f7f-7f7f7f7f7f7f`

Resources