Could you please clarify about ids with cassandra.
In the relational databases use id with auto increment generation.
field id is connected to tables mapping, locking.
As i know cassandra uses UUID instead Id
Could you please explain main concept UUIDs. Why does cassandra exclude ids.
Thanks!
The advantage of UUIDs over auto-incrementing integers is that you can generate them distributed. When using incrementing integers there must be a single counter somewhere that always have to be consulted when generating a new ID. With UUIDs you can just generate a new ID anywhere in your cluster and use it right away.
Basically you can think of UUIDs as big random numbers. So it's highly unlikely that two nodes are generating the same ID even if they are not coordinated.
Still it seems you should make yourself familar on the concepts of the keys in Cassandra. Different to relational databases, keys in Cassandra are not just there for generating a unique identification of a record but to prepare your query for data. Therefore keys in cassandra are often not a UUID … or not a UUID alone.
Related
I am new to Cassandra and coming from relational background. I learned Cassandra does not support JOINs hence no concept of foreign keys. Suppose I have two tables:
Users
id
name
Cities
id
name
In RDBMS world I should pass city_id into users table. Since there is no concept of joins and you are allowed to duplicate data, is it still work passing city_id into users table while I can create a table users_by_cities?
The main Cassandra concept is that you design tables based off of your queries (as writes to the table have no restrictions). The design is based off of the query filters. An application that queries a table by some ID is somewhat unnatural as the CITY_ID could be any value and typically is unknown (unless you ran a prior query to get it). Something more natural may be CITY_NAME. Anyway, assuming there are no indexes on the table (which are mere tables themselves), there are rules in Cassandra regarding the filters you provide and the table design, mainly that, at a minimum, one of the filters MUST be the partition key. The partition key helps direct cassandra to the correct node for the data (which is how the reads are optimized). If none of your filters are the partition key, you'll get an error (unless you use ALLOW FILTERING, which is a no-no). The other filters, if there are any, must be the clustering columns (you can't have a filter that is neither the partition key nor the clustering columns - again, unless you use ALLOW FILTERING).
These restrictions, coming from the RDBMS world, are unnatural and hard to adjust to, and because of them, you may have to duplicate data into very similar structures (maybe the only difference is the partition keys and clustering columns). For the most part, it is up to the application to manipulate each structure when changes occur, and the application must know which table to query based off of the filters provided. All of these are considered painful coming from a relational world (where you can do whatever you want to one structure). These "constraints" need to be weighed against the reasons why you chose Cassandra for your storage engine.
Hope this helps.
-Jim
Two simple questions:
Is a UUID a good choice as a partition key? Will this distribute data evenly among all nodes in the cluster?
Is a (unique) integer a good choice?
Will any of these options create "hot" partitions?
Thanks!
UUID is a good choice for partition key - it should be good distributed between cluster nodes. "Unique" integer is more tricky - some node need to be an authority for generation of this number, and this is hard to do in the distributed environment.
Regarding hot partition - this will depend on your data model. If you have other primary key components besides the partition key, yes - you may have this problem. For example, you generate a random UUID for sensor & starting to write a lot of data into it.
I usually tell folks not to use a UUID as a partition key for two simple reasons.
UUIDs are designed to be unique, and thus have a high potential cardinality.
While it does depend on your data model, think about how many rows you're going to have under each UUID, and then ask yourself if you really want to have to supply a full UUID on each and every query.
Again, it's all about the data model. From a DBA's perspective, they'll distribute well. But from a developer's perspective, it can really clamp-down your potential query patterns.
Ultimately, you want your primary key components to allow your model to A) distribute well and B) match your query patterns. If partitioning on a UUID gives you that, then great!
I am new to cassandra and my cassandra is giving lot of read timeout errors..tweaked timout but still problem may be problem with design (for my application cassandra expected to store trillions of data):
Question 1 : In an all my cassandra tables i use UUID as rowkey...but for few tables just for maintainence i break that rule like in user table i make email id as rowkey....so that looking at tables i can understand data stored...IS using UUID right approach for huge case and second approach for user table is right or not ???????????
Question 2 : i have one relations table with startNodeId, relationTypeId, endNodeId...rowkey for that is UUID which is relationId.....i define secondary indexes on startNode, relationType, endNode as i can have lookup by any of them by business case.........becuase of that for each new row i have to do get to check ALREADY existing relation or not....One approach to avoid existing check is : i take startNodeId, relationTypeId, endNodeId SORT them and create HASH CODE and use that as ROWKEY...so my already checking explicitly will be avoided here..........IS THIS RIGHT approach ???????
Please guide me i am stuck at these thoughts...any guidance will really help me
Answering to your first question, until and unless you are comfortable in handling the rowkey with non-uuid value, its great also easier to track else go for the UUID.
Regarding to your second question, why don't you try the compound key. You don't have to maintain hashcode like stuffs, leave it on Cassandra.
1) Better use natural keys not UUIDs. Email, timestamp, composite primary keys, and so on. Using UUID is an approach from RDBMS world, you should avoid it in Cassandra
2) Read-modify-update is wrong pattern for Cassandra. Try rewritng data, if your business case allows this. Or just use timestamp and get the row with latest timestamp (don't forget about TTL).
I'm working on a distributed data base. I'm trying to generate a unique ID that will serve as a column family primary key in cassandra.
I read some articles about doing this with Java using UUID but it seems like there is a probability for collision (even if it's very low).
I wonder if there is a way to generate a unique ID based on time maybe?
You can use the TimeUUID type in Cassandra, which backs a Type 1 UUID. This uses the current time and the creator's MAC address and a sequence number. If the TimeUUID number is generated correctly this can be done with zero collisions (you can use the CQL now() method or insert your own, the java SDK's provide some thread-safe implementations). The main advantage of TimeUUIDs is that the IDs can be time ordered. See http://wiki.apache.org/cassandra/TimeBaseUUIDNotes for more info.
However, the time ordering is unlikely to be useful for row primary keys, since the ordering is useless when using a hash partitioner, though possible using a clustering key. And also the complexity of generating a unique ID could be a source of bugs if you roll your own. Cassandra also supports Type 4 UUIDs by using the UUID type. These are just random bits. There is a collision probability, but the collision probability (assuming uncorrelated random number sources, which it will be if you generate in Java) is extremely low - if you created 1 billion a second for 100 years the probability of one collision is about 50%. (See http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates for more details.)
You should investigate using Twitter Snowflake. From the project readme:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
Snowflake uses an intuitive algorithm that generates longs which are both time-ordered and unique. Since your database is distributed, this service should suit your needs well.
As said by Richard you can use TimeUUID, and generating TimeUUID value is not a big deal. Just follow cassandra FAQ timeuuid.
You need to use cassandra function now() to generate timeuuid and use uuid() function to generate uuid type string.
I need to get a maximal key of a column family an cassandra database for further use. How can I get it using cassandra query language or hector API?
Unless you are using an ordered partitioner, which is usually a bad idea, getting the maximal key in a column family is very expensive. See this article for more details about random versus ordered partitioner.
Generally you want to structure your cassandra data model so that you do gets on a single key, rather than gets on a range of keys. Often you have to denormalize your data to do so.