Cassandra CQL - get records since 1 before timestamp - cassandra

I have the following table:
CREATE TABLE records(
device_id text,
read_time timestamp,
data_row text,
PRIMARY KEY (device_id, read_time ))
WITH CLUSTERING ORDER BY (read_time DESC);
I want to get all the records starting with the one before a specific read_time.
Is there a way to do that?
I thought maybe having another field previous_read_time, but it will be hard to
maintain since I sometime get out of order reads.

I don't think there's any CQL statement that does this (filter on a timestamp column PLUS the first record not matching the time filter). But depending on your exact case maybe something like the following would work for you?
For example I will find all records with read_time after 2020-05-14 00:00:01 plus the first one on-or-before 2020-05-14 00:00:01:
Select all records after my chosen time (2020-05-14 00:00:01).
SELECT * FROM records WHERE device_id=? AND read_time > '2020-05-14 00:00:01';
From the results of the first query, take the record with the read time closest to 2020-05-14 00:00:01.
// Let's say you find records with the following times.
// The earliest (closest to the filter's time) is 2020-05-14 00:00:55
2020-05-14 00:00:55
2020-05-14 00:00:56
2020-05-14 00:30:55
2020-05-14 13:30:55
Select again and find the first record which comes before the "closest time to the filter time" you found in step 2:
SELECT * FROM records WHERE device_id=? AND read_time < 2020-05-14 00:00:55 LIMIT 1;

Related

Cassandra clustering columns and performance

So I have a materialized view defined as follows (changed the name a bit):
CREATE MATERIALIZED VIEW MYVIEW AS
SELECT *
FROM XXXX
WHERE id IS NOT NULL AND process_on_date_time IS NOT NULL AND poller_processing_status IS NOT NULL
PRIMARY KEY (poller_processing_status, process_on_date_time, id)
WITH CLUSTERING ORDER BY (process_on_date_time ASC, id ASC)
...
Based off of the definition, the data would be sorted by the PROCESS_ON_DATE_TIME column in ASC order (oldest first).
Now I have a query that runs as follows:
SELECT JSON * FROM MYVIEW
WHERE poller_processing_status='PENDING'
AND process_on_date_time<=1548775105000 LIMIT 250 ;
There are over 250 rows that match, so 250 are returned. Running this from CQLSH, it fetches 3 times - the first two fetch 100 rows and the last one fetches 50 rows. I enabled tracing via CQLSH with consistency LOCAL_ONE. Now of the first two fetches - they do EXACTLY the same steps, in the same order, same sstables, etc. One of them always takes 2 SECONDS, the other 8 ms. I can't for the life of me figure out why one takes over 2 seconds. The slow guy stalls on "Merged data from memtables and 3 sstables" (but again, but the first two fetches do EXACTLY the same thing, but one is slow and the next is fast - same merged sstables).
Move on to step 2. I thought, ok, we have a clustering column so it's sorted. What if I add an ORDER BY clause to sort the results. So I ran this:
SELECT JSON * FROM sop_cas.notification_by_status_ptd_mv
WHERE poller_processing_status='PENDING'
AND process_on_date_time<=1548775105000
ORDER BY process_on_date_time ASC LIMIT 250 ;
So basically the exact same query, specifying an order in the same order as the sorted data (clustering columns). Results? Same - over 2 seconds to complete. No improvement. Bummer.
Now one last test. I changed the sort in my query from ASC to DESC and give it a try. Results? Every time the query responds in less than 10 milliseconds. Huh???? This is what I'm trying to achieve - but why does the REVERSE sort run well?
I would figure that as I'm asking for records OLDER than something, an ASC sort would be best because the oldest records would be first and immediately grabbed. The other route, DESC, I would have newer records first and thus would have to skip over a bunch to find the first record older than my timestamp and then grab the next 250. Can anyone help me with this logic? Why does a DESC ORDER BY respond well while the ASC does not.
Using DSE 5.1.9
Thanks in advance.
-Jim

How can I search a table for a timestamp x hours old in Cassandra?

I am trying to search for timestamps in a Cassandra table that are within a given length of time. For example, "all timestamp that are 4 hours or less old".
I have tried using DATESUB(), TIMEDIFF() and other ways to find a time delta, but just haven't had any luck. I am not sure if I am just looking at this from a relational DB mindset.
EDIT: Adding an example below
SELECT *
FROM events
WHERE event_timestamp < (now() - 4 hours) -- This part is giving issues
ORDER BY region DESC;
Thanks!

Timeseries data modelling in cassandra

I am trying to store & retrieve data in cassandra in the following way:
Storing Data:
I created the table in the following way:
CREATE TABLE mydata (
myKey TEXT,
datetime TIMESTAMP,
value TEXT,
PRIMARY KEY (myKey,datetime)
);
Where i would store a value for every minute for last 5 years. So it stores 1440 * 365 * 5 = 2628000 records/columns per row (myKey as row key).
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:01:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:02:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:03:00','72F');
.................
I am able to store data and all fine. However, i would like to know, if this is efficient way of doing (storing) data horizontally (2628000 values for each key for 1 million such keys altogether) ?
Retrieving Data:
After storing the data in above format, i am able to select data by using a simple select query for a period.
Ex:
SELECT *
FROM mydata
WHERE myKey='1234ABCD' AND datetime > '2013-04-03 07:01:00' AND datetime < '2013-04-03 07:04:00';
The query works fine and i get result as expected.
However my question is:
How can i select only those values at certain intervals. For example, if i query data for a day, i would get 1440 values (1 for every minute). I would like to get values at every 10 minutes interval (value at every 10th minute) limiting the no. of values to 144.
Is there a way to query the table if we use the above storage strategy?
If not, what are possible options to meet my requirement of querying data at a specific interval like 1-min, 10-min, 1-hour, 1-day etc?
Appreciate any other suggestions.
No it not right ,in future you will face problem because per row key we can only store 2 billion records or columns. After that it will not give error but it will store data also .
For your problem split column timestamp into year , month , day and time .
like 2016 , 04 , 04 and 15:03:00 .Put also year , month , day into partition key .
You definitely need to bound your partition with a modular version of the timestamp. But the granularity really depends on your reads.
If you are mainly going to read per day then use something like this PK((myKey, yyyymmdd), time)
If mainly by weeks PK((mykey, yyyyww), time), or month...
The problem is then if you want to read values for a whole year, then you better have a partition per weeks or month, or even year would do I think if you don't do any deletes, your partition size needs to be smaller than 100MB

Nullable timestamp with order by

We've recently decided to migrate an application to Cassandra (from Oracle) because it may help with performance, and as I have a decent Oracle background, I gotta admit I struggle with the Cassandra "way of thinking".
Basically i'm having a table with ~15 fields, among those dates. One of these dates is used for "ordering", so I need to be able to do "order by" on it. At the same time though, this field can be nullable.
Now i've figured putting that field as a primary key lets me actually do the order-by part, but I can't assign the null value to it anymore...
Any ideas ?
You are correct in that you cannot query by NULL values in Cassandra. There's a really good reason for that; which is that NULL values don't really exist. That row simply does not contain a value for the "NULL" column. So the CQL interface abstracts that with the "NULL" output, because that's easier to explain to people.
Cassandra also does not allow NULLs (or an absence of a column value) in its key fields. So the best you can do in this case, is to come up with a timestamp constant that you (and your application) recognize to be NULL without breaking anything. So consider this example table structure:
aploetz#cqlsh:stackoverflow> CREATE TABLE eventsByMonth (
monthBucket text,
eventTime timestamp,
event text,
PRIMARY KEY (monthBucket,eventTime))
WITH CLUSTERING ORDER BY (eventTime DESC);
Next I'll insert some values to test with:
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509','2015-09-19 00:00:00','Talk Like A Pirate Day');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509','2015-09-25 00:00:00','Hobbit Day');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509','2015-09-19 21:00:00','dentist appt');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201503','2015-03-14 00:00:00','Pi Day');
Let's say that I have two events that I want to keep track of, but I don't know the eventTimes, so instead of INSERTing a NULL, I'll just specify a zero. For the sake of the example, I'll put one in September 2015 and the other in October 2015:
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201510',0,'Some random day I want to keep track of');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509',0,'Some other random day I want to keep track of');
Now when I query for September of 2015, I'll get the following output:
aploetz#cqlsh:stackoverflow> SELECT * FROM eventsbymonth WHERe monthbucket = '201509';
monthbucket | eventtime | event
-------------+--------------------------+-----------------------------------------------
201509 | 2015-09-25 00:00:00-0500 | Hobbit Day
201509 | 2015-09-19 21:00:00-0500 | dentist appt
201509 | 2015-09-19 00:00:00-0500 | Talk Like A Pirate Day
201509 | 1969-12-31 18:00:00-0600 | Some other random day I want to keep track of
(4 rows)
Notes:
This is probably something you want to avoid doing, if possible.
INSERT/UPDATE (Upsert) with a "NULL" value is the same as a DELETE operation, and creates tombstone(s).
Upserting a zero (0) as a TIMESTAMP defaults to 1970-01-01 00:00:00 UTC. My current timezone offset is -0600, which is why the value of 1969-12-31 18:00:00 appears.
I don't need to specify an ORDER BY clause in my query, because the defined clustering order is what I want. It is a good idea to configure this as per your query requirements, because all ORDER BY can really do is enforce ASCending or DESCending. You cannot specify a column in your ORDER BY that differs from your table's defined clustering order.
An advantage of using a zero TIMESTAMP, is that all rows containing that key are ordered at the bottom of the result set (DESCending order), so you'll always know where to look for them.
Not sure what your partitioning key is, but I used monthBucket for mine. FYI- "bucketing" is a Cassandra modeling technique used when working with time series data, to evenly distribute data in your cluster.

Duplicate timestamps in timeseries - Cassandra

I am going to use cassandra to store activity logs. I have something like this
CREATE TABLE general_actionlog (
date text,
time text,
date_added timestamp,
action text,
PRIMARY KEY ((date,time),date_added)
);
I want to store all the activity in an hour in a single row (=a time serie. "time" is only the hour of the day in the format H:00:00, ignoring minutes and seconds, so I have a row for each Y-m-d H:00:00)
The problem appears when two actions happen in the same timestamp (ex. two page views in the same second), so the second one overwrites the first one.
How can I solve this in a way that I still can query using slices?
Thanks
marc
You want to use timeuuid instead of timestamp for the date_added column. A timeuuid is a v1 UUID. It has a timestamp component (and is sorted by the timestamp), so it effectively provides a conflict-free timestamp.

Resources