using timeuuid as unique userid or token - cassandra

I have read some older answers, regarding generating userids. Do you know if it safe to use timeuuid as a unique identifier? I am planning on using it both for userids and for tokens.
Thanks
Regards

The UUID and timeUUID are safe to use as unique keys. Documentation on the UUID types describes them as follows:
The UUID (universally unique id) comparator type is used to avoid
collisions in column names. Alternatively, you can use the timeuuid.
For more information on how UUIDs work the Wikipedia article outlines the way UUIDs are generated and why they work. This question is strange since the odds of a duplicate TYPE 1 UUID is very low. From the wikipedia entry,
...after generating 1 billion UUIDs every second for the next 100 years,
the probability of creating just one duplicate would be about 50%.
This holds true for timeUUIDs.

Related

Is UUID or Integer a good choice as partition key?

Two simple questions:
Is a UUID a good choice as a partition key? Will this distribute data evenly among all nodes in the cluster?
Is a (unique) integer a good choice?
Will any of these options create "hot" partitions?
Thanks!
UUID is a good choice for partition key - it should be good distributed between cluster nodes. "Unique" integer is more tricky - some node need to be an authority for generation of this number, and this is hard to do in the distributed environment.
Regarding hot partition - this will depend on your data model. If you have other primary key components besides the partition key, yes - you may have this problem. For example, you generate a random UUID for sensor & starting to write a lot of data into it.
I usually tell folks not to use a UUID as a partition key for two simple reasons.
UUIDs are designed to be unique, and thus have a high potential cardinality.
While it does depend on your data model, think about how many rows you're going to have under each UUID, and then ask yourself if you really want to have to supply a full UUID on each and every query.
Again, it's all about the data model. From a DBA's perspective, they'll distribute well. But from a developer's perspective, it can really clamp-down your potential query patterns.
Ultimately, you want your primary key components to allow your model to A) distribute well and B) match your query patterns. If partitioning on a UUID gives you that, then great!

Auto increment primary key when using saveToCassandra()

Is it possible to create auto increment primary key in table Cassandra?
Basically you cannot generate an auto-increment key in Cassandra. It doesn't really make sense in a distributed db, since some central point would need to be responsible for keeping the sequence.
A common way to make keys is to generate a UUID, which is random but almost impossible to create collision.
From wikipedia:
for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
If you are really, really worried about collision, then you can insert the new UUID using a "IF NOT EXISTS" lightweight transaction
e.g.
INSERT INTO mytable (uuid, text) VALUES (123e4567-e89b-12d3-a456-426655440000, "hello") IF NOT EXISTS

Cassandra and IDs concept

Could you please clarify about ids with cassandra.
In the relational databases use id with auto increment generation.
field id is connected to tables mapping, locking.
As i know cassandra uses UUID instead Id
Could you please explain main concept UUIDs. Why does cassandra exclude ids.
Thanks!
The advantage of UUIDs over auto-incrementing integers is that you can generate them distributed. When using incrementing integers there must be a single counter somewhere that always have to be consulted when generating a new ID. With UUIDs you can just generate a new ID anywhere in your cluster and use it right away.
Basically you can think of UUIDs as big random numbers. So it's highly unlikely that two nodes are generating the same ID even if they are not coordinated.
Still it seems you should make yourself familar on the concepts of the keys in Cassandra. Different to relational databases, keys in Cassandra are not just there for generating a unique identification of a record but to prepare your query for data. Therefore keys in cassandra are often not a UUID … or not a UUID alone.

Cassandra UUID vs TimeUUID benefits and disadvantages

Given that TimeUUID handily allows you to use now() in CQL, are there any reasons you wouldn't just go ahead and always use TimeUUID instead of plain old UUID?
UUID and TIMEUUID are stored the same way in Cassandra, and they only really represent two different sorting implementations.
TIMEUUID columns are sorted by their time components first, and then by their raw bytes, whereas UUID columns are sorted by their version first, then if both are version 1 by their time component, and finally by their raw bytes. Curiosly the time component sorting implementations are duplicated between UUIDType and TimeUUIDType in the Cassandra code, except for different formatting.
I think of the UUID vs. TIMEUUID question primarily as documentation: if you choose TIMEUUID you're saying that you're storing things in chronological order, and that these things can occur at the same time, so a simple timestamp isn't enough. Using UUID says that you don't care about order (even if in practice the columns will be ordered by time if you put version 1 UUIDs in them), you just want to make sure that things have unique IDs.
Even if using NOW() to generate UUID values is convenient, it's also very surprising to other people reading your code.
It probably does not matter much in the grand scheme of things, but sorting non-version 1 UUIDs is a bit faster than version 1, so if you have a UUID column and generate the UUIDs yourself, go for another version.
A TimeUUID is a plain old UUID according to the documentation.
A UUID is simply a 128-bit value. Think of it as an unimaginably large number.
The particular bits may be determined by any of several methods. The original method involved taking the MAC address of the computer's networking hardware, combining the current date and time, plus an arbitrary number and a random number. Squish all that together to get a virtually unique number.
Later, for various reasons (security, privacy), other methods were invented to assemble the bits when generating a UUID value. These other methods omit date-time and/or MAC address as an ingredient. The point being: Not all UUID values have an embedded date-time value.
The Cassandra doc incorrectly refers to its TimeUUID being a "Type 1 UUID". The correct term is Version 1 UUID. This version is sometimes called the "time-based version".
A Bit Of Advice
Cassandra seems to identify this specific version of UUID for the purpose of extracting the date and time portion of the 128-bits. Extracting the date-time from a UUID is a bad idea.
For one thing, UUID was never intended to be used for such history tracking. Indeed, the spec for UUID specifically recognizes that (a) computer clocks can be reset and therefor (b) UUIDs generated later may actually record an earlier date-time than previous UUIDs. Another reason to not extract date-time from a UUID is because you may well have UUIDs that were not generated by the time method, therefore you will be building a data-time value based on bits that do not in fact represent the date-time of creation. A third reason is that when programming code is later refactored, the UUID may be generated at a different time than the database record so using the UUID's date-time would be misleading.
If you need to track date-time history, do so explicitly. Create a date-time field in your data. By the way, track that date-time in UTC, but that’s another topic.
All said, you need to generate some to believe them. Timeuuids are Version/Level 1 UUID only seem to randomize the first 8 characters as you can see below, so, there is some chance of conflict, but still timeuuid is better than using timestamp itself. If uuid randomness is important, using Version/Level 4 UUID is a better choice with an almost improbable collision.
So, it feels like if you don't care about uniqueness across partitions and your partitions are wide row time series data with high writes and need some unique identifier for each event (time), its a good choice that also has the benefit of clustering, pagination, etc.,.
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
49cbda60-961b-11e8-9854-134d5b3f9cf8
49d1a6c1-961b-11e8-9854-134d5b3f9cf8
49d59e61-961b-11e8-9854-134d5b3f9cf8
49d8d2b1-961b-11e8-9854-134d5b3f9cf8

Cassandra: Generate a unique ID?

I'm working on a distributed data base. I'm trying to generate a unique ID that will serve as a column family primary key in cassandra.
I read some articles about doing this with Java using UUID but it seems like there is a probability for collision (even if it's very low).
I wonder if there is a way to generate a unique ID based on time maybe?
You can use the TimeUUID type in Cassandra, which backs a Type 1 UUID. This uses the current time and the creator's MAC address and a sequence number. If the TimeUUID number is generated correctly this can be done with zero collisions (you can use the CQL now() method or insert your own, the java SDK's provide some thread-safe implementations). The main advantage of TimeUUIDs is that the IDs can be time ordered. See http://wiki.apache.org/cassandra/TimeBaseUUIDNotes for more info.
However, the time ordering is unlikely to be useful for row primary keys, since the ordering is useless when using a hash partitioner, though possible using a clustering key. And also the complexity of generating a unique ID could be a source of bugs if you roll your own. Cassandra also supports Type 4 UUIDs by using the UUID type. These are just random bits. There is a collision probability, but the collision probability (assuming uncorrelated random number sources, which it will be if you generate in Java) is extremely low - if you created 1 billion a second for 100 years the probability of one collision is about 50%. (See http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates for more details.)
You should investigate using Twitter Snowflake. From the project readme:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
Snowflake uses an intuitive algorithm that generates longs which are both time-ordered and unique. Since your database is distributed, this service should suit your needs well.
As said by Richard you can use TimeUUID, and generating TimeUUID value is not a big deal. Just follow cassandra FAQ timeuuid.
You need to use cassandra function now() to generate timeuuid and use uuid() function to generate uuid type string.

Resources