We are investigating migrating a system from a RDBMS to Cassandra and are having trouble finding a way to convert auto-increment column to Cassandra. We actually have no need for this to be sequential at all, it can even contain characters, but it must be short (ideally under 8 chars) and globally unique. Ideal value would look something like
AB123456
First part of the question is should we be generating this key in application code or in Cassandra?
Second part:
If Cassandra, how?
If Application code, is it an acceptable pattern to generate a candidate code then attempt an insert, if collision occurs then regenerate key candidate and retry?
The common way to do this in Cassandra is to use a uuid (or timeuuid if the IDs should be time ordered). But these must be long to get uniqueness - they are 16 bytes long. (uuids are unique because the probability of a collision is so low; timeuuids are guaranteed unique since they contain information about the generating host and include time.)
If you need a shorter key, you can't reliably find collisions by checking before inserting. There will always be race conditions where this will fail without external coordination. Coming in Cassandra 2.0 is compare-and-set which will let you do this, but at a performance cost.
If you use a random 8 character string, containing only numbers and letters, there are 36^8 possible keys, with collisions becoming very likely after about sqrt(36^8) ~ 1 million operations. You can improve this by using any character, so there are 256^8 keys, with collisions likely after about sqrt(256^8) ~ 4 billion operations. This is probably too low though, so it would be better to use longer IDs.
Related
I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether. By dummy, I mean a column whose only purpose is to contain the same value for all rows, thereby putting all data on 1 node and giving the lowest possible cardinality. For example:
dummy | id | name
-------------------------
0 | 01 | 'Oliver'
0 | 02 | 'James'
0 | 03 | 'Nicholls'
The two main points in regards to why you should avoid dummy partition keys are:
1) You end up with data "hot-spots". There is a lot of data stored on 1 node so there's more traffic around that node and you have poor distribution around the cluster.
2) Partition space is finite. If you put all data on one partition, it will eventually be incapable of storing any more data.
I can understand these points and I agree that you definitely want to avoid those situations, so I put this idea out of my mind and tried to think of a good partition key for my table. The table in question stores sites and there are two common ways that table gets queried in our system. Either a single site is requested or all sites are requested.
This puts me in a bit of an awkward situation, because the table is either queried on nothing or the site ID, and making a unique field the partition key would give me very high cardinality and high latency on queries that request all sites.
So I decided that I'd just choose an arbitrary field that would give relatively low cardinality, even though it doesn't reflect how the data will actually be queried, just because it's better than having a cardinality that is either excessively high or excessively low. This approach also has problems though.
I could partition my data on column x, but we have numerous clients, all of whom use our system differently, so x for 1 client could give the results I'm after, but could give awful results for another.
At this point I'm running out of options. I need a field in my table that will be consistent for all clients, however this field doesn't exist, so I'm now considering having a new field that will contain a random number from 1-3 and then partitioning on that field, which is essentially just a dummy field. The only difference is that I want to randomise the values a little bit as to avoid hot-spots and unbounded row growth.
I know this is a data-modelling question and it varies from system to system, and of course there are going to be situations where you have to choose the lesser of two evils (there is no perfect solution), but what I'm really focussed on with this question is:
Are dummy partition keys something that should outright never be a consideration in Cassandra, or are there situations in which they're seen as acceptable? If you think the former, then how would you approach this situation?
I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether.
I'm going to go out on a limb and guess that your search has yielded my article We Shall Have Order!, where I made my position on the use of "dummy" partition keys quite clear. Bearing that in mind, I'll try to provide some alternate solutions.
I see two potential problems to solve here. The first:
I need a field in my table that will be consistent for all clients, however this field doesn't exist
Typically this is solved by duplicating your data into another query table. That's the best way to serve multiple, varying query patterns. If you have one client (service?) that needs to query that table by site id, then you could have that table duplicated into a table called sites_by_id.
CREATE TABLE sites_by_id (
id BIGINT,
name TEXT,
PRIMARY KEY (id));
The other problem is this query pattern:
all sites are requested
Another common Cassandra anti-pattern is that of unbound SELECTs (SELECT query without a WHERE clause). I am sure you understand why these are bad, as they require all nodes/partitions to be read for completion (which is probably why you are looking into a "dummy" key). But as the table supporting these types of queries increases in size, they will only get slower and slower over time...regardless of whether you execute an unbound SELECT or use a "dummy" key.
The solution here is to perform a re-examination of your data model, and business requirements. Perhaps your data can be split up into sites by region or country? Maybe your client really only needs the sites that have been updated for this year? Obtaining some more details on the client's query requirements may help you find a good partitioning key for them to use. Otherwise, if they really do need all of them all of the time, then doanduyhai's suggestion of using Spark will better fit your use case.
or all sites are requested
So basically you have a full table scan scenario. Isn't Apache Spark over Cassandra a better fit for this use-case ? I suspect it's an analytics use-case, isn't it ?
As far as I understand, you want to access a single site by its id, in which case lookup by partition key is ideal. The other use-case which requires to fetch all the sites is best suited with Spark
We have some entity uniquely identified by generated UUID. We need to support find by name query. Also we need to support sorting to be by name.
We know that there will be no more than 1000 of entities of that type which can perfectly fit in one row. Is it viable idea to hardcode primary key, use name as clustering key and id as clustering key there to satisfy uniqueness. Lets say we need school entity. Here is example:
CREATE TABLE school (
constant text,
name text,
id uuid,
description text,
location text,
PRIMARY KEY ((constant), name, id)
);
Initial state would be give me all schools and then filtering by exact name will happen. Our reasoning behind this was to place all schools in single row for fast access, have name as clustering column for filtering and have id as clustering column to guaranty uniqueness. We can use constant = school as known hardcoded value to access this row.
What I like about this solution is that all values are in one row and we get fast reads. Also we can solve sorting easy by clustering column. What I do not like is hardcoded value for constant which seams odd. We could use name as PK but then we would have 1000 records spread across couple of partitions, probably find all without name would be slower and would not be sorted.
Question 1
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
Question 2
Name is editable field, it will probably be changed rarely (someone can make typo or school can change name) but it can change. What is best way to achieve this? Delete insert inside batch (LTE can be applied to same row with conditional clause)?
Yes this is a good approach for such a small dataset. Just because Cassandra can partition large datasets across multiple nodes does not mean that you need to use that ability for every table. By using a constant for the partition key, you are telling Cassandra that you want the data to be stored on one node where you can access it quickly and in sorted order. Relational databases act on data in a single node all the time, so this is really not such an unusual thing to do.
For safety you will probably want to use a replication factor higher than one so that there are at least two copies of the single partition. In that way you will not lose access to the data if the one node where it is stored went down.
This approach could cause problems if you expect to have a lot of clients (i.e. thousands of clients) frequently reading and writing to this table, since it could become a hot spot. With only 1000 records you can probably keep all the rows cached in memory by setting the table to cache all keys and rows.
You probably won't find a lot of examples where this is done because people move to Cassandra for the support of large datasets where they want the scalability that comes from using multiple partitions. So examples are geared towards that.
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
I briefly addressed this type of modeling solution earlier this year in my article: We Shall Have Order! This is what is known as a "dummy key," where each row has the same partition key. This is a shortcut that allows you to easily order all of your rows (on an unbound SELECT *) by clustering column(s).
Problems with this solution:
Cassandra allows a maximum of 2 billion column values per partition key. When using a dummy partition key, you will approach this limit with each value that you add.
Your data will all be stored in the same partition, which will create a "hot spot" (large groupings of data) in your cluster. This means that your data model will immediately void one of Cassandra's main benefits...data distribution. This will also complicate load balancing (the same nodes and ranges will keep serving all of your requests).
I can see that your model is designed around a SELECT * query. Cassandra works best when you can give it specific keys to query by. Unbound SELECT * queries (queries without WHERE clauses) are not a good idea to be doing with Cassandra, as they can lead to timeouts (as your data grows).
From reading through your question, I know that you're going to say that you're only using it for 1000 rows. That your dataset won't ever grow much beyond those 1000 rows, so you won't hit any of the roadblocks that I have mentioned.
So then I have to wonder, why are you using Cassandra? As a Cassandra MVP, that's a question I don't ask often. But you don't have an especially large data set (which is what Cassandra is designed to work with). Relying on that fact as a reason to use a product incorrectly is not really the best solution.
Honestly, I am going to recommend that you save yourself some complexity, and use a RDBMS instead. That will fit your use case significantly better than Cassandra will. Then you can update and order by whatever fields you wish.
Given that TimeUUID handily allows you to use now() in CQL, are there any reasons you wouldn't just go ahead and always use TimeUUID instead of plain old UUID?
UUID and TIMEUUID are stored the same way in Cassandra, and they only really represent two different sorting implementations.
TIMEUUID columns are sorted by their time components first, and then by their raw bytes, whereas UUID columns are sorted by their version first, then if both are version 1 by their time component, and finally by their raw bytes. Curiosly the time component sorting implementations are duplicated between UUIDType and TimeUUIDType in the Cassandra code, except for different formatting.
I think of the UUID vs. TIMEUUID question primarily as documentation: if you choose TIMEUUID you're saying that you're storing things in chronological order, and that these things can occur at the same time, so a simple timestamp isn't enough. Using UUID says that you don't care about order (even if in practice the columns will be ordered by time if you put version 1 UUIDs in them), you just want to make sure that things have unique IDs.
Even if using NOW() to generate UUID values is convenient, it's also very surprising to other people reading your code.
It probably does not matter much in the grand scheme of things, but sorting non-version 1 UUIDs is a bit faster than version 1, so if you have a UUID column and generate the UUIDs yourself, go for another version.
A TimeUUID is a plain old UUID according to the documentation.
A UUID is simply a 128-bit value. Think of it as an unimaginably large number.
The particular bits may be determined by any of several methods. The original method involved taking the MAC address of the computer's networking hardware, combining the current date and time, plus an arbitrary number and a random number. Squish all that together to get a virtually unique number.
Later, for various reasons (security, privacy), other methods were invented to assemble the bits when generating a UUID value. These other methods omit date-time and/or MAC address as an ingredient. The point being: Not all UUID values have an embedded date-time value.
The Cassandra doc incorrectly refers to its TimeUUID being a "Type 1 UUID". The correct term is Version 1 UUID. This version is sometimes called the "time-based version".
A Bit Of Advice
Cassandra seems to identify this specific version of UUID for the purpose of extracting the date and time portion of the 128-bits. Extracting the date-time from a UUID is a bad idea.
For one thing, UUID was never intended to be used for such history tracking. Indeed, the spec for UUID specifically recognizes that (a) computer clocks can be reset and therefor (b) UUIDs generated later may actually record an earlier date-time than previous UUIDs. Another reason to not extract date-time from a UUID is because you may well have UUIDs that were not generated by the time method, therefore you will be building a data-time value based on bits that do not in fact represent the date-time of creation. A third reason is that when programming code is later refactored, the UUID may be generated at a different time than the database record so using the UUID's date-time would be misleading.
If you need to track date-time history, do so explicitly. Create a date-time field in your data. By the way, track that date-time in UTC, but that’s another topic.
All said, you need to generate some to believe them. Timeuuids are Version/Level 1 UUID only seem to randomize the first 8 characters as you can see below, so, there is some chance of conflict, but still timeuuid is better than using timestamp itself. If uuid randomness is important, using Version/Level 4 UUID is a better choice with an almost improbable collision.
So, it feels like if you don't care about uniqueness across partitions and your partitions are wide row time series data with high writes and need some unique identifier for each event (time), its a good choice that also has the benefit of clustering, pagination, etc.,.
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
insert into test_tuuid(1, now())
49cbda60-961b-11e8-9854-134d5b3f9cf8
49d1a6c1-961b-11e8-9854-134d5b3f9cf8
49d59e61-961b-11e8-9854-134d5b3f9cf8
49d8d2b1-961b-11e8-9854-134d5b3f9cf8
I'm working on a distributed data base. I'm trying to generate a unique ID that will serve as a column family primary key in cassandra.
I read some articles about doing this with Java using UUID but it seems like there is a probability for collision (even if it's very low).
I wonder if there is a way to generate a unique ID based on time maybe?
You can use the TimeUUID type in Cassandra, which backs a Type 1 UUID. This uses the current time and the creator's MAC address and a sequence number. If the TimeUUID number is generated correctly this can be done with zero collisions (you can use the CQL now() method or insert your own, the java SDK's provide some thread-safe implementations). The main advantage of TimeUUIDs is that the IDs can be time ordered. See http://wiki.apache.org/cassandra/TimeBaseUUIDNotes for more info.
However, the time ordering is unlikely to be useful for row primary keys, since the ordering is useless when using a hash partitioner, though possible using a clustering key. And also the complexity of generating a unique ID could be a source of bugs if you roll your own. Cassandra also supports Type 4 UUIDs by using the UUID type. These are just random bits. There is a collision probability, but the collision probability (assuming uncorrelated random number sources, which it will be if you generate in Java) is extremely low - if you created 1 billion a second for 100 years the probability of one collision is about 50%. (See http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates for more details.)
You should investigate using Twitter Snowflake. From the project readme:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
Snowflake uses an intuitive algorithm that generates longs which are both time-ordered and unique. Since your database is distributed, this service should suit your needs well.
As said by Richard you can use TimeUUID, and generating TimeUUID value is not a big deal. Just follow cassandra FAQ timeuuid.
You need to use cassandra function now() to generate timeuuid and use uuid() function to generate uuid type string.
I ve got a million objects . Which is the fastest way to lookup a particular object with name as key also the fastest way to perfrom insertion ? would hashing be sufficient?
Probably a hash table, assuming you don't need anything other than key based access. Make sure that the hashing of the key is good enough (as to minimise collisions) and the table is large enough (for the same reason).
It will depend on how often your need to do a lookup and how often you need to insert elements.
If you often have to insert elements then a linked list would perform better.
If you often have to search for elements, an hash table is more efficient. Perhaps, you can have both - your main data as a linked list, and an hash table which will serve as an index to the list.
You can also use a binary search tree. BST has the advantage of fast search and fast insertion too. Use the key to route your way in the tree and build the tree node to have the value.
Use BST in favor of hash tables if you are not sure about the balance of the operation (ie: looking up a key and value pairs, insertion, etc) and if you (based on your analysis) know that keys may collide frequently in the hash table (which will cause bad performance for the hash table).
Several Structures exist here that you can use. Each has it's advantage and disadvantage.
A HashTable will have a great lookup time and insertion time, provided you have a table that minimizes collision. If not, then lookup/insertion can lead to a lot more time.
A Binary Search Tree has ln(n) insertion and lookup, provided that it's balanced. Sometimes the balancing can cause the insertion to take a bit longer then ln(n), depending on the BST you go with.
Can go with B+ tree, it guarantees lesser search complexity ( since you reach leaf nodes fast, height = log n to base k, k = degree of nodes). The databases have similar requirement and they use B+ trees to maintain and retrieve data.