Why will timeuuid not have any collisions? - cassandra

I was reading the Datastax CQL reference:
Collisions that would potentially overwrite data that was not intended
to be overwritten cannot occur.
Can someone explain to me why a collision will never occur? Is it impossible or "highly" unlikely?

Cassandra's timeuuid is a Version 1 UUID which is based on the time and the MAC address of the machine generating the UUID.
The time used is accurate down to 100ns, so the chance of a collision is incredibly slim (a nano second is a millionth of a millisecond).

Cassandra timeuuid is a Version 1 UUID(Type 1 UUID) which is based on:
A timestamp consisting of a count of 100-nanosecond intervals since
15 October 1582 (the date of Gregorian reform to the Christian
calendar).
A version (which should have a value of 1).
A variant(which should have a value of 2).
A sequence number, which can be a counter or a pseudo-random number.
A "node" which will be the machine's MAC address (which should make the UUID unique across machines).
Using a pseudo-random number for the sequence number provides a 1 in a 16,384 chance that each UUID Class will have a unique id.
if you generate more than 10000 UUID per msec then they may collide.
1 msec = 10^6 ns
By this you can generate 10^6 UUID if we take ns level timestamp but
as we take timestamp as 100ns count.
we will be have at most 10000 unique timestamps in one millisecond.
Now generating more than that on a single machine(which will have same MAC address), there is a chance to collide ass we also need to take sequence number into account.
If your application generates more than 10000 per ms, use another column to make a compound key which helps to avoid collisions.

Related

Cassandra timeuuid column to nanoseconds precision

Cassandra table has timeuuid data type column so how do I see the value of type timeuuid in nanoseconds?
timeuuid:
49cbda60-961b-11e8-9854-134d5b3f9cf8
49cbda60-961b-11e8-9854-134d5b3f9cf9
How to convert this timeuuid to nanoseconds
need a select statement like:
select Dateof(timeuuid) from the table a;
There is a utility method in driver UUIDs.unixTimestamp(UUID id) that returns a normal epoch timestamp that can be converted into a Date object.
Worth noting that ns precision from the time UUID will not necessarily be meaningful. A type 1 uuid includes a timestamp which is the number of 100 nanosecond intervals since the Gregorian calendar was first adopted at midnight, October 15, 1582 UTC. But the driver takes a 1ms timestamp (precision depends on OS really, can be 10 or 40ms precision even) and keeps a monotonic counter to fill the rest of the 10000 unused precision but can end up counting into the future if in a 1ms there are over 10k values (note: performance limitations will ultimately prevent this). This is much more performant and guarantees no duplicates, especially as sub ms time accuracy in computers is pretty meaningless in a distributed system.
So if your looking from a purely CQL perspective theres no way to do it without a UDF, not that there is much value in getting beyond ms value anyway so dateOf should be sufficient. If you REALLY want it though
CREATE OR REPLACE FUNCTION uuidToNS (id timeuuid)
CALLED ON NULL INPUT RETURNS bigint
LANGUAGE java AS '
return id.timestamp();
';
Will give you the 100ns's from October 15, 1582. To translate that to nanoseconds from epoc, mulitply it by 100 to convert to nanos and add the difference from epoch time (-12219292800L * 1_000_000_000 in nanos). This might overflow longs so might need to use something different.

SPARK parallelization of algorithm - non-typical, how to

I have a processing requirement that does not seem to fit the nice SPARK parallelization use cases. On the other hand, I may not see how it can be done in SPARK easily.
I am seeking the easiest way to parallelize the following situation:
Given a set of N records of record type A,
perform some processing on A records that generates a not yet existing set of initial results, say, of J records of record type B. Record type B has a data range aspect to it.
Then repeat the process for the A set of records not yet processed - the leftovers - for any records generated as part of B, but look to the left and to the right of the A records.
Repeat 3 until no new records generated.
This may sound odd, but it is nothing more than taking a set of trading records, and deciding for a given computed period Pn, if there is a bull or bear spread evident during this period. Once that initial period is found, then date-wise before Pn and after Pn, one can attempt to look for a bull or bear spread period that precedes or follows the initial Pn period. And so on. It all works correctly.
The algorithm I designed works on inserting records using SQL and some looping. The records generated do not exist initially and get created on the fly. I looked at dataframes and RDDs, but it is not so evident (to me) how one would do this.
Using SQL it is not such a difficult algorithm, but you need to work through the records of a given logical key set sequentially. Thus not a typical SPARK use case.
My questions are then:
How can I achieve at the very least parallelization?
Should we use mapPartitions in some way so as to at least get ranges of logical key sets to process, or is this simply not possible given the use case I attempt to present? I am going to try this, but feel I may be barking up the wrong tree here. It may just need to be a loop / while in the driver running single thread.
Some examples record A's shown in tabular format - as per how this algorithm works:
Jan Feb Mar Apr May Jun Jul Aug Sep
key X -5 1 0 10 9 -20 0 5 7
would result in record B's being generated as follows:
key X Jan - Feb --> Bear
key X Apr - Jun --> Bull
This falls into the category of non-typical Spark. Solved via looping within a loop in Spark Scala but with JDBC usage. Could as well have been a Scala JDBC program. Also variation with foreachPartition.

Cassandra TTL not working

I am using datastax with cassandra. I want a row to be automatically deleted after 15 minutes of its insertion. But the row still remains.
My code is below:
Insert insertStatement = QueryBuilder.insertInto(keySpace, "device_activity");
insertStatement.using(QueryBuilder.ttl(15* 60));
insertStatement.value("device", UUID.fromString(persistData.getSourceId()));
insertStatement.value("lastupdatedtime", persistData.getLastUpdatedTime());
insertStatement.value("devicename", persistData.getDeviceName());
insertStatement.value("datasourcename", persistData.getDatasourceName());
The table consist of 4 columns : device (uuid), datasourcename(text), devicename(text), lastupdatedtime (timestamp).
If I query the TTL of some field it shows me 4126 seconds which is wrong.
//Select TTL(devicename) from device_activity; // Gives me 4126 seconds
In the below link, the explanation of TTL is provided.
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
"TTL data has a precision of one second, as calculated on the server. Therefore, a very small TTL probably does not make much sense. Moreover, the clocks on the servers should be synchronized; otherwise reduced precision could be observed because the expiration time is computed on the primary host that receives the initial insertion but is then interpreted by other hosts on the cluster."
After reading this i could resolve by setting proper time on the corresponding node(machine.)

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Azure Table Partitioning Strategy

I am trying to come up with a partition key strategy based on a DateTime that doesn't result in the Append-Only write bottleneck often described in best practices guidelines.
Basically, if you partition by something like YYYY-MM-DD, all your writes for a particular day will end up the same partition, which will reduce write performance.
Ideally, a partition key should even distribute writes across as many partitions as possible.
To accomplish this while still basing the key off a DateTime value, I need to come up with a way to assign what amounts to buckets of dateline values, where the number of buckets is predetermined number per time interval - say 50 a day. The assignment of a dateline to a bucket should be as random as possible - but always the same for a given value. The reason for this is that I need to be able to always get the correct partition given the original DateTime value. In other words, this is like a hash.
Lastly, and critically, I need the partition key to be sequential at some aggregate level. So while DateTime values for a given interval, say 1 day, would be randomly distributed across X partition keys, all the partition keys for that day would be between a queryable range. This would allow me to query all rows for my aggregate interval and then sort them by the DateTime value to get the correct order.
Thoughts? This must be a fairly well known problem that has been solved already.
How about using the millisecond component of the date time stamp, mod 50. That would give you your random distribution throughout the day, the value itself would be sequential, and you could easily calculate the PartitionKey in future given the original timestamp ?
To add to Eoin's answer, below is the code I used to simulate his solution:
var buckets = new SortedDictionary<int,List<DateTime>>();
var rand = new Random();
for(int i=0; i<1000; i++)
{
var dateTime = DateTime.Now;
var bucket = (int)(dateTime.Ticks%53);
if(!buckets.ContainsKey(bucket))
buckets.Add(bucket, new List<DateTime>());
buckets[bucket].Add(dateTime);
Thread.Sleep(rand.Next(0, 20));
}
So this should simulate roughly 1000 requests coming in, each anywhere between 0 and 20 milliseconds apart.
This resulted in a fairly good/even distribution between the 53 "buckets". It also resulted, as expected, in avoiding the append-only or prepend-only anti-pattern.

Resources