Cassandra timeuuid column to nanoseconds precision

Cassandra timeuuid column to nanoseconds precision - cassandra

Cassandra table has timeuuid data type column so how do I see the value of type timeuuid in nanoseconds?
timeuuid:
49cbda60-961b-11e8-9854-134d5b3f9cf8
49cbda60-961b-11e8-9854-134d5b3f9cf9
How to convert this timeuuid to nanoseconds
need a select statement like:
select Dateof(timeuuid) from the table a;

There is a utility method in driver UUIDs.unixTimestamp(UUID id) that returns a normal epoch timestamp that can be converted into a Date object.
Worth noting that ns precision from the time UUID will not necessarily be meaningful. A type 1 uuid includes a timestamp which is the number of 100 nanosecond intervals since the Gregorian calendar was first adopted at midnight, October 15, 1582 UTC. But the driver takes a 1ms timestamp (precision depends on OS really, can be 10 or 40ms precision even) and keeps a monotonic counter to fill the rest of the 10000 unused precision but can end up counting into the future if in a 1ms there are over 10k values (note: performance limitations will ultimately prevent this). This is much more performant and guarantees no duplicates, especially as sub ms time accuracy in computers is pretty meaningless in a distributed system.
So if your looking from a purely CQL perspective theres no way to do it without a UDF, not that there is much value in getting beyond ms value anyway so dateOf should be sufficient. If you REALLY want it though
CREATE OR REPLACE FUNCTION uuidToNS (id timeuuid)
CALLED ON NULL INPUT RETURNS bigint
LANGUAGE java AS '
return id.timestamp();
';
Will give you the 100ns's from October 15, 1582. To translate that to nanoseconds from epoc, mulitply it by 100 to convert to nanos and add the difference from epoch time (-12219292800L * 1_000_000_000 in nanos). This might overflow longs so might need to use something different.

Related

Get Epoch timestamp accurate by the day with datetime

I want to get a day-accurate (not hour, minutes, seconds) Epoch timestamp that remains the same throughout the day.
This is accurate by the millisecond (and therefore too accurate):
from datetime import date, datetime
timestamp = datetime.today().strftime("%s")
Is there any simple way to make it less precise?

A UNIX timestamp is by necessity accurate to the (milli)second, because it's a number counting seconds. The only thing you can do is choose a specific time which "stays constant" throughout the day, for which midnight probably makes the most sense:
from datetime import datetime, timezone
timestamp = datetime.now(timezone.utc).replace(hour=0, minute=0, second=0, microsecond=0).timestamp()

It depends what do you want.
If you just want a quick way, either use time.time_ns() or time.time(). Epoch time is used by system (on many OS), and so there is no conversion. The _ns() version avoid floating point maths, so faster.
If you want to store it in more efficient way, you can just do a:
(int(time.time()) % (24*60*60) so you get the epoch at start of the day. Epoch contrary most of other times (and GPS time) has all days long 246060 seconds (so discarding leap seconds).

What is the data type of timestamp in cassandra

I notice that if my model has a field expirationTime of type DateTime then I cannot store it iin timestamp field in Cassandra.
QueryBuilder.set("expiration_time",model.expirationTime) //data gets corrupted
But if I store time as milli seconds then it works.
QueryBuilder.set("expiration_time",model.expirationTime.getMillis()) //WORKS
Question 1 - Does that mean that the timestamp field in cassandra is of type long?
Question2 - Is it cqlsh which converts the time into readable format like 2018-05-18 03:21+0530??

From DataStax documentation on CQL types:
Date and time with millisecond precision, encoded as 8 bytes since epoch. Can be represented as a string, such as 2015-05-03 13:30:54.234.
In Java as input you can use either long with milliseconds, or string literal, supported in CQL, or java.util.Date (see the code). When reading, results mapped into java.util.Date in driver 3.x/1.x (see full table for CQL<->Java types mapping), or to the java.time.Instant in the driver 4.x/2.x (see CQL<->Java types mapping).
In Python/cqlsh, yes - the data is read as 8-byte long, then is then converted into string representation.

Find documents in MongoDB with non-typical limit

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.

Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

Why will timeuuid not have any collisions?

I was reading the Datastax CQL reference:
Collisions that would potentially overwrite data that was not intended
to be overwritten cannot occur.
Can someone explain to me why a collision will never occur? Is it impossible or "highly" unlikely?

Cassandra's timeuuid is a Version 1 UUID which is based on the time and the MAC address of the machine generating the UUID.
The time used is accurate down to 100ns, so the chance of a collision is incredibly slim (a nano second is a millionth of a millisecond).

Cassandra timeuuid is a Version 1 UUID(Type 1 UUID) which is based on:
A timestamp consisting of a count of 100-nanosecond intervals since
15 October 1582 (the date of Gregorian reform to the Christian
calendar).
A version (which should have a value of 1).
A variant(which should have a value of 2).
A sequence number, which can be a counter or a pseudo-random number.
A "node" which will be the machine's MAC address (which should make the UUID unique across machines).
Using a pseudo-random number for the sequence number provides a 1 in a 16,384 chance that each UUID Class will have a unique id.
if you generate more than 10000 UUID per msec then they may collide.
1 msec = 10^6 ns
By this you can generate 10^6 UUID if we take ns level timestamp but
as we take timestamp as 100ns count.
we will be have at most 10000 unique timestamps in one millisecond.
Now generating more than that on a single machine(which will have same MAC address), there is a chance to collide ass we also need to take sequence number into account.
If your application generates more than 10000 per ms, use another column to make a compound key which helps to avoid collisions.

Azure Table Partitioning Strategy

I am trying to come up with a partition key strategy based on a DateTime that doesn't result in the Append-Only write bottleneck often described in best practices guidelines.
Basically, if you partition by something like YYYY-MM-DD, all your writes for a particular day will end up the same partition, which will reduce write performance.
Ideally, a partition key should even distribute writes across as many partitions as possible.
To accomplish this while still basing the key off a DateTime value, I need to come up with a way to assign what amounts to buckets of dateline values, where the number of buckets is predetermined number per time interval - say 50 a day. The assignment of a dateline to a bucket should be as random as possible - but always the same for a given value. The reason for this is that I need to be able to always get the correct partition given the original DateTime value. In other words, this is like a hash.
Lastly, and critically, I need the partition key to be sequential at some aggregate level. So while DateTime values for a given interval, say 1 day, would be randomly distributed across X partition keys, all the partition keys for that day would be between a queryable range. This would allow me to query all rows for my aggregate interval and then sort them by the DateTime value to get the correct order.
Thoughts? This must be a fairly well known problem that has been solved already.

How about using the millisecond component of the date time stamp, mod 50. That would give you your random distribution throughout the day, the value itself would be sequential, and you could easily calculate the PartitionKey in future given the original timestamp ?

To add to Eoin's answer, below is the code I used to simulate his solution:
var buckets = new SortedDictionary<int,List<DateTime>>();
var rand = new Random();
for(int i=0; i<1000; i++)
{
var dateTime = DateTime.Now;
var bucket = (int)(dateTime.Ticks%53);
if(!buckets.ContainsKey(bucket))
buckets.Add(bucket, new List<DateTime>());
buckets[bucket].Add(dateTime);
Thread.Sleep(rand.Next(0, 20));
}
So this should simulate roughly 1000 requests coming in, each anywhere between 0 and 20 milliseconds apart.
This resulted in a fairly good/even distribution between the 53 "buckets". It also resulted, as expected, in avoiding the append-only or prepend-only anti-pattern.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra timeuuid column to nanoseconds precision - cassandra

Related

Get Epoch timestamp accurate by the day with datetime

What is the data type of timestamp in cassandra

Find documents in MongoDB with non-typical limit

Why will timeuuid not have any collisions?

Azure Table Partitioning Strategy

Categories

Resources