How to generate UUID(Long) using cassandra timestamp in cluster environment? - cassandra

I have the requirement where we need to generate UUID as Long value using Java based on Cassandra timestamp which is in cluster. Can anyone help how to geranate it using java and cassandra cluster timestamp combination?

Use TimeUUID cql3 data type:
A value of the timeuuid type is a Type 1 UUID. A type 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
In Java you can use UUIDs helper class from com.datastax.driver.core.utils.UUIDs:
UUIDs.timeBased()

Related

Cassandra: Does timeuuid preserve order?

I was using timestamp as primary key for my data by calling toTimestamp(now()), but unfortunately this creates collision.
I understand that timeuuid guarantees uniqueness, but if I do ORDER BY timeuuid, does timeuuid also guarantee the original order?
From the docs:
Timeuuid types can be entered as integers for CQL input. A value of the timeuuid type is a Version 1 UUID. A Version 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/uuid_type_r.html
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html

How to store microsecond level timestamps in cassandra?

I'm trying to store data in cassandra which contains microsecond level timestamps.
Cassandra's docs say that the 'timestamp' data type can store milliseconds since epoch but several messages on the internet seem to imply that cassandra can natively store microsecond timestamps.
What is the best way practice for storing microsecond level times in cassandra? Should I just leave out the date part and store a long?
I'm trying to sore a columsn which look like this:
2015-11-18 07:30:46.700824
I get the following error:
ErrorMessage code=2200 [Invalid query] message="unable to coerce '2015-11-18 07:30:18.261543' to a formatted date (long)"
Aborting import at record #1. Previously inserted records are still present, and some records after that may be present as well.
My cassandra version:
[cqlsh 5.0.1 | Cassandra 2.1.11 | CQL spec 3.2.1 | Native protocol v3]
EDIT:
Here is an example of microsecond confusion in Cassandra's own docs:
"CAS and new features in CQL such as DROP COLUMN assume that cell timestamps are microseconds-since-epoch"
https://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeChangesC_c.html
Another: https://issues.apache.org/jira/browse/CASSANDRA-8297
EDIT2:
I should mention that I intend to query this using spark. From what I understand, spark parses its own flavor of sql and translates it to cassandra (although I'm using CassandraContext in zeppelin). Anything which might help or hinder my search for microsecond level timestmaps?
You can use bigint or a timeuuid. Type 1 uuid's have 100ns precision so it can cover you. Some utilities, libraries, convenience functions may not give you what you need though so be prepared to write some uuid functions.

Cassandra: Data Type for Partition Key - Decimal or UUID

I want to describe the problem I am working on first:
Currently I try to find a strategy that would allow me to migrate data from an existing PostgreSQL database into a Cassandra cluster. The primary key in the PostgreSQL is a decimal value with 25 digits. When I migrate the data, it would be nice if I could keep the value of the current primary key in one way or another and use it to uniquely identify the data in Cassandra. This key should be used as the partition key in Cassandra (no other columns are involved in the table I am talking about). After doing some research, I found out that a good practise is to use UUIDs in Cassandra. So now I have two possible solutions to solve my problem:
I can either create a transformation rule, that would transfer my current decimal primary keys from the PostgrSQL database into UUIDs for Cassandra. Everytime someone requests to access some of the old data, I would have to reapply the transformation rule to the key and use the UUID to search for the data in Cassandra. The transformation would happen in an application server, that manages all communication with Cassandra (so no client will talk to Cassandra directly) New data added to Cassandra would of course be stored with an UUID.
The other solution, which I already have implemented in Java at the moment, is to use a decimal value as the partition key in Cassandra. Since it is possible, that multiple application servers will talk to Cassandra concurrently, my current approach is to generate a UUID in my application and transform it into a decimal value. Using this approach, I could simply reuse all the existing primary keys form PostgreSQL.
I cannot simply create new keys for the existing data, since other applications have stored their own references to the old primary key values and will therefore try to request data with those keys.
Now here is my question: Both approaches seem to work and end up with unique keys to identify my data. The distribution of data across all node should also be fine. But I wonder, if there is any benefit in using a UUID over a decimal value as partition key or visa versa. I don't know exactly what Cassandra does to determine the hash value of the partition key and therefore cannot determine if any data type is to be preferred. I am using the Murmur3Partitioner for Cassandra if that is relevant.
Does anyone have any experience with this issue?
Thanks in advance for answers.
There are two benefits of UUID's that I know of.
First, they can be generated independently with little chance of collisions. This is very useful in distributed systems since you often have multiple clients wanting to insert data with unique keys. In RDBMS we had the luxury of auto-incrementing fields to give uniqueness since that could easily be done atomically, but in a distributed database we don't have efficient global atomic locks to do that.
The second advantage is that UUID's are fairly efficient in terms of storage, and only require eight bytes.
As long as your old decimal values are unique, you should be able to use them as partition keys.

Generating timeuuids based on old timestamps cassandra

I am porting in data from an old postgres system onto cassandra. How do I create TimeUUIDs based on existing timestamps? I want to use existing timestamps as the sorting key. Or am I better off storing the exact timestamp instead of generating the TimeUUID?
CQL3 provides functions for converting a timestamp into a UUID, these are called minTimeuuid and maxTimeuuid
SELECT * FROM myTable
WHERE t > maxTimeuuid('2013-01-01 00:05+0000')
AND t < minTimeuuid('2013-02-02 10:00+0000')
If you are using a datastax driver like the java-driver there should be a utility class like UUIDs which provides capabilities for converting between TimeUUIDs and timestamps.

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources