I need to get a maximal key of a column family an cassandra database for further use. How can I get it using cassandra query language or hector API?
Unless you are using an ordered partitioner, which is usually a bad idea, getting the maximal key in a column family is very expensive. See this article for more details about random versus ordered partitioner.
Generally you want to structure your cassandra data model so that you do gets on a single key, rather than gets on a range of keys. Often you have to denormalize your data to do so.
Related
I am new to Cassandra and coming from relational background. I learned Cassandra does not support JOINs hence no concept of foreign keys. Suppose I have two tables:
Users
id
name
Cities
id
name
In RDBMS world I should pass city_id into users table. Since there is no concept of joins and you are allowed to duplicate data, is it still work passing city_id into users table while I can create a table users_by_cities?
The main Cassandra concept is that you design tables based off of your queries (as writes to the table have no restrictions). The design is based off of the query filters. An application that queries a table by some ID is somewhat unnatural as the CITY_ID could be any value and typically is unknown (unless you ran a prior query to get it). Something more natural may be CITY_NAME. Anyway, assuming there are no indexes on the table (which are mere tables themselves), there are rules in Cassandra regarding the filters you provide and the table design, mainly that, at a minimum, one of the filters MUST be the partition key. The partition key helps direct cassandra to the correct node for the data (which is how the reads are optimized). If none of your filters are the partition key, you'll get an error (unless you use ALLOW FILTERING, which is a no-no). The other filters, if there are any, must be the clustering columns (you can't have a filter that is neither the partition key nor the clustering columns - again, unless you use ALLOW FILTERING).
These restrictions, coming from the RDBMS world, are unnatural and hard to adjust to, and because of them, you may have to duplicate data into very similar structures (maybe the only difference is the partition keys and clustering columns). For the most part, it is up to the application to manipulate each structure when changes occur, and the application must know which table to query based off of the filters provided. All of these are considered painful coming from a relational world (where you can do whatever you want to one structure). These "constraints" need to be weighed against the reasons why you chose Cassandra for your storage engine.
Hope this helps.
-Jim
Could you please clarify about ids with cassandra.
In the relational databases use id with auto increment generation.
field id is connected to tables mapping, locking.
As i know cassandra uses UUID instead Id
Could you please explain main concept UUIDs. Why does cassandra exclude ids.
Thanks!
The advantage of UUIDs over auto-incrementing integers is that you can generate them distributed. When using incrementing integers there must be a single counter somewhere that always have to be consulted when generating a new ID. With UUIDs you can just generate a new ID anywhere in your cluster and use it right away.
Basically you can think of UUIDs as big random numbers. So it's highly unlikely that two nodes are generating the same ID even if they are not coordinated.
Still it seems you should make yourself familar on the concepts of the keys in Cassandra. Different to relational databases, keys in Cassandra are not just there for generating a unique identification of a record but to prepare your query for data. Therefore keys in cassandra are often not a UUID … or not a UUID alone.
We have some entity uniquely identified by generated UUID. We need to support find by name query. Also we need to support sorting to be by name.
We know that there will be no more than 1000 of entities of that type which can perfectly fit in one row. Is it viable idea to hardcode primary key, use name as clustering key and id as clustering key there to satisfy uniqueness. Lets say we need school entity. Here is example:
CREATE TABLE school (
constant text,
name text,
id uuid,
description text,
location text,
PRIMARY KEY ((constant), name, id)
);
Initial state would be give me all schools and then filtering by exact name will happen. Our reasoning behind this was to place all schools in single row for fast access, have name as clustering column for filtering and have id as clustering column to guaranty uniqueness. We can use constant = school as known hardcoded value to access this row.
What I like about this solution is that all values are in one row and we get fast reads. Also we can solve sorting easy by clustering column. What I do not like is hardcoded value for constant which seams odd. We could use name as PK but then we would have 1000 records spread across couple of partitions, probably find all without name would be slower and would not be sorted.
Question 1
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
Question 2
Name is editable field, it will probably be changed rarely (someone can make typo or school can change name) but it can change. What is best way to achieve this? Delete insert inside batch (LTE can be applied to same row with conditional clause)?
Yes this is a good approach for such a small dataset. Just because Cassandra can partition large datasets across multiple nodes does not mean that you need to use that ability for every table. By using a constant for the partition key, you are telling Cassandra that you want the data to be stored on one node where you can access it quickly and in sorted order. Relational databases act on data in a single node all the time, so this is really not such an unusual thing to do.
For safety you will probably want to use a replication factor higher than one so that there are at least two copies of the single partition. In that way you will not lose access to the data if the one node where it is stored went down.
This approach could cause problems if you expect to have a lot of clients (i.e. thousands of clients) frequently reading and writing to this table, since it could become a hot spot. With only 1000 records you can probably keep all the rows cached in memory by setting the table to cache all keys and rows.
You probably won't find a lot of examples where this is done because people move to Cassandra for the support of large datasets where they want the scalability that comes from using multiple partitions. So examples are geared towards that.
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
I briefly addressed this type of modeling solution earlier this year in my article: We Shall Have Order! This is what is known as a "dummy key," where each row has the same partition key. This is a shortcut that allows you to easily order all of your rows (on an unbound SELECT *) by clustering column(s).
Problems with this solution:
Cassandra allows a maximum of 2 billion column values per partition key. When using a dummy partition key, you will approach this limit with each value that you add.
Your data will all be stored in the same partition, which will create a "hot spot" (large groupings of data) in your cluster. This means that your data model will immediately void one of Cassandra's main benefits...data distribution. This will also complicate load balancing (the same nodes and ranges will keep serving all of your requests).
I can see that your model is designed around a SELECT * query. Cassandra works best when you can give it specific keys to query by. Unbound SELECT * queries (queries without WHERE clauses) are not a good idea to be doing with Cassandra, as they can lead to timeouts (as your data grows).
From reading through your question, I know that you're going to say that you're only using it for 1000 rows. That your dataset won't ever grow much beyond those 1000 rows, so you won't hit any of the roadblocks that I have mentioned.
So then I have to wonder, why are you using Cassandra? As a Cassandra MVP, that's a question I don't ask often. But you don't have an especially large data set (which is what Cassandra is designed to work with). Relying on that fact as a reason to use a product incorrectly is not really the best solution.
Honestly, I am going to recommend that you save yourself some complexity, and use a RDBMS instead. That will fit your use case significantly better than Cassandra will. Then you can update and order by whatever fields you wish.
I started reading Cassandra the definitive guide, which is based on Cassandra 0.7. Now, I'm trying to experiment with Cassandra 2.1.5 and it seems that there's a lot of differences which is really confusing.
For example, I see that in 0.7 version CQL did not exist. On the other hand, data model seems quite different. You can now define a schema with CQL, while in version 0.7 there was no schema.
Can anyone shortly explain the differences, especially about the data model?
I understand that in 0.7 version the idea was about different length rows, that is, rows that have different number of columns. But now I understand that each column is actually a field that contains a number of parameters, so you can have as much fields as you want within the same row (same key).
Can someone summarize the differences? Maybe I did not understand correctly.
An important point to consider, is that the underlying storage model remains the same. CQL is simply an abstraction layer on top of that model, to make it easier to work with and model your data. DataStax MVP John Berryman has a great article on this: Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
In this article, Berryman observes that:
The value of the CQL primary key is used internally as the row key (which in the new CQL paradigm is being called a “partition key”).
The names of the non-primary key CQL fields are used internally as columns names. The values of the non-primary key CQL fields are then internally stored as the corresponding column values.
Additionally, he outlines the benefits of using the CQL-based approach:
It provides fast look-up by partition key and efficient scans and slices by cluster key.
It groups together related data as CQL rows. This means that you can do in one query what would otherwise take multiple queries into different column families.
It allows for individual fields to be added, modified, and deleted independently.
It is strictly better than the old Cassandra paradigm. Proof: you can coerce CQL Tables to behave exactly like old-style Cassandra ColumnFamilies. (See the examples here.)
It extends easily to implementation of sets lists and maps (which are super ugly if you’re working directly in old cassandra) — but that’s for another blog post.
The CQL protocol allows for asynchronous communication as compared with the synchronous, call-response communication required by Thrift. As a result, CQL is capable of being much faster and less resource intensive than Thrift – especially when using single threaded clients.
can have as much fields as you want within the same row (same key).
Actually, there is a hard limit of about 2 billion columns per partition (rowkey).
I'm working on a distributed data base. I'm trying to generate a unique ID that will serve as a column family primary key in cassandra.
I read some articles about doing this with Java using UUID but it seems like there is a probability for collision (even if it's very low).
I wonder if there is a way to generate a unique ID based on time maybe?
You can use the TimeUUID type in Cassandra, which backs a Type 1 UUID. This uses the current time and the creator's MAC address and a sequence number. If the TimeUUID number is generated correctly this can be done with zero collisions (you can use the CQL now() method or insert your own, the java SDK's provide some thread-safe implementations). The main advantage of TimeUUIDs is that the IDs can be time ordered. See http://wiki.apache.org/cassandra/TimeBaseUUIDNotes for more info.
However, the time ordering is unlikely to be useful for row primary keys, since the ordering is useless when using a hash partitioner, though possible using a clustering key. And also the complexity of generating a unique ID could be a source of bugs if you roll your own. Cassandra also supports Type 4 UUIDs by using the UUID type. These are just random bits. There is a collision probability, but the collision probability (assuming uncorrelated random number sources, which it will be if you generate in Java) is extremely low - if you created 1 billion a second for 100 years the probability of one collision is about 50%. (See http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates for more details.)
You should investigate using Twitter Snowflake. From the project readme:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
Snowflake uses an intuitive algorithm that generates longs which are both time-ordered and unique. Since your database is distributed, this service should suit your needs well.
As said by Richard you can use TimeUUID, and generating TimeUUID value is not a big deal. Just follow cassandra FAQ timeuuid.
You need to use cassandra function now() to generate timeuuid and use uuid() function to generate uuid type string.