Cassandra is unexpectedly prepending zeros in timestamp millisecond - cassandra

My code is reading data from Kafka and writing it to Cassandra using Spark. But in some cases it is appending the zero in front of millisecond.
For Example-
Kafka Data: 2022-10-11T08:46:12.220Z
Cassandra Data: 2022-10-11 14:16:12.022000+0000
Another Example Where we are expecting 2022-07-31 23:28:46.960000+0000 but in Cassandra it is present as 2022-07-31 23:28:46.096000+0000
How a zero is getting prepended in millisecond and how resolve it? It is only happening in some cases, most of the timestamp are coming correctly.
Note- The difference in hour and minute is due to change in timezone.

I suspect you're looking at different records because the timestamps you posted are not the same as what you think they should be.
I happen to have a table that contains timestamps and I inserted the same timestamps you posted above and I can confirm that Cassandra is not adding the leading zeros.
Here is my table schema:
CREATE TABLE community.tstamp_table (
id int,
tstamp timestamp,
name text,
PRIMARY KEY (id, tstamp)
)
And here are the table's contents with the timestamps you posted:
id | tstamp | name
----+---------------------------------+-------
1 | 2022-10-11 14:16:12.022000+0000 | alice
1 | 2022-10-11 14:16:12.220000+0000 | alice
2 | 2022-07-31 23:28:46.096000+0000 | bob
2 | 2022-07-31 23:28:46.960000+0000 | bob
The CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT). Knowing this, we can display the timestamp as an integer value using some native CQL functions:
system.blobasbigint(system.timestampasblob(tstamp)) | tstamp
-----------------------------------------------------+---------------------------------
1665497772022 | 2022-10-11 14:16:12.022000+0000
1665497772220 | 2022-10-11 14:16:12.220000+0000
1659310126096 | 2022-07-31 23:28:46.096000+0000
1659310126960 | 2022-07-31 23:28:46.960000+0000
You should be able to see from the sample data above that the encoded value of the timestamps are correct. For example, the first row is encoded correctly with 022 ms vs 220 ms and the third row is encoded as 096 ms vs 960 ms. Cheers!

Related

How can I filter for a specific date on a CQL timestamp column?

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

Get the last 100 rows from cassandra table

I have a table in cassandra now i cannot select the last 200 rows in the table.
The clustering order by clause was supposed to enforce sorting on disk.
CREATE TABLE t1(id int ,
event text,
receivetime timestamp ,
PRIMARY KEY (event, id)
) WITH CLUSTERING ORDER BY (id DESC)
;
The output is unsorted by id:
event | id | receivetime
---------+----+---------------------------------
event1 | 1 | 2021-07-12 08:11:57.702000+0000
event7 | 7 | 2021-05-22 05:30:00.000000+0000
event5 | 5 | 2021-05-25 05:30:00.000000+0000
event9 | 9 | 2021-05-22 05:30:00.000000+0000
event2 | 2 | 2021-05-21 05:30:00.000000+0000
event10 | 10 | 2021-05-23 05:30:00.000000+0000
event4 | 4 | 2021-05-24 05:30:00.000000+0000
event6 | 6 | 2021-05-27 05:30:00.000000+0000
event3 | 3 | 2021-05-22 05:30:00.000000+0000
event8 | 8 | 2021-05-21 05:30:00.000000+0000
How do I overcome this problem?
Thanks
The same question was asked on https://community.datastax.com/questions/11983/ so I'm re-posting my answer here.
The rows within a partition are sorted based on the order of the clustering column, not the partition key.
In your case, the table's primary key is defined as:
PRIMARY KEY (event, id)
This means that each partition key can have one or more rows, with each row identified by the id column. Since there is only one row in each partition, the sorting order is not evident. But if you had multiple rows in each partition, you'd be able to see that they would be sorted. For example:
event | id | receivetime
---------+----+---------------------------------
event1 | 7 | 2021-05-22 05:30:00.000000+0000
event1 | 5 | 2021-05-25 05:30:00.000000+0000
event1 | 1 | 2021-07-12 08:11:57.702000+0000
In the example above, the partition event1 has 3 rows sorted by the ID column in reverse order.
In addition, running unbounded queries (no WHERE clause filter) is an anti-pattern in Cassandra because it requires a full table scan. If you consider a cluster which has 500 nodes, an unbounded query has to request all the partitions (records) from all 500 nodes to return the result. It will not perform well and does not scale. Cheers!
The ordering for a clustering order, is the order within a single partition key value, e.g. all of the rows for event1 would be in order for event1. It is not a global ordering.
From your results we can see you are selecting multiple partitions - which is why you are not seeing an order you expect.

Spark Structured Streaming ignore old records

I am new to spark and help me to arrive in solutions for this problem. I am receiving the input file it has information about an event occurred and the file itself has the timestamp value. Event Id is the primary column for this input. Refer below the sample input (the actual file has many other columns).
Event_Id | Event_Timestamp
1 | 2018-10-11 12:23:01
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
When we get the above input we need to get the latest record based on event id, timestamp and the expected output would be
Event_Id | Event_Timestamp
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
Hereafter whenever I receive the event information which has timestamp value less than the above value I need to ignore, for example, consider the second input
Event_Id | Event_Timestamp
2 | 2018-10-11 10:25:01
1 | 2018-10-11 08:23:01
3 | 2018-10-11 21:12:01
Now I need to ignore event_id 1 and 2 since it has the old timestamp that the state what we have right now. Only the event 3 would be passed and the expected output here is
3 | 2018-10-11 21:12:01
Assume we have n number of unique(10 billion) event id how it would be stored in spark memory, is there something needs to be taken care.
Thanks in advance
We can take max timestamp and use persist() method with disk_only or disk_only2 storage levels... In that case, we can achieve this I think...
Since it's an streaming data, we can try with memory_only or memory_only2 storage levels too...
Please try and update..

Time varies in postgres server and excel

I am trying a query which groups the data by months.
test_db=# select date_trunc('month', install_ts) AS month, count(id) AS count from api_booking group by month order by month asc;
month | count
------------------------+-------
2016-08-01 00:00:00+00 | 297
2016-09-01 00:00:00+00 | 2409
2016-10-01 00:00:00+00 | 2429
2016-11-01 00:00:00+00 | 3512
(4 rows)
This is the output in my postgres db shell.
How ever, when I try this query in excel, this is the output,
month | count
------------------------+-------
2016-07-31 17:00:00+00 | 297
2016-08-31 17:00:00+00 | 2409
2016-09-30 17:00:00+00 | 2429
2016-10-31 17:00:00+00 | 3512
(4 rows)
The problem is I think excel is understanding date format in some different timezone.
So, How can I tell excel to read it correctly?
OR any solution to this problem?
Try...
select date(date_trunc('month', install_ts)) AS month, count(id) AS count from api_booking
The date() strips out the time from a date with a time.

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources