Auto increment primary key when using saveToCassandra() - apache-spark

Is it possible to create auto increment primary key in table Cassandra?

Basically you cannot generate an auto-increment key in Cassandra. It doesn't really make sense in a distributed db, since some central point would need to be responsible for keeping the sequence.
A common way to make keys is to generate a UUID, which is random but almost impossible to create collision.
From wikipedia:
for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
If you are really, really worried about collision, then you can insert the new UUID using a "IF NOT EXISTS" lightweight transaction
e.g.
INSERT INTO mytable (uuid, text) VALUES (123e4567-e89b-12d3-a456-426655440000, "hello") IF NOT EXISTS

Related

Cassandra and IDs concept

Could you please clarify about ids with cassandra.
In the relational databases use id with auto increment generation.
field id is connected to tables mapping, locking.
As i know cassandra uses UUID instead Id
Could you please explain main concept UUIDs. Why does cassandra exclude ids.
Thanks!
The advantage of UUIDs over auto-incrementing integers is that you can generate them distributed. When using incrementing integers there must be a single counter somewhere that always have to be consulted when generating a new ID. With UUIDs you can just generate a new ID anywhere in your cluster and use it right away.
Basically you can think of UUIDs as big random numbers. So it's highly unlikely that two nodes are generating the same ID even if they are not coordinated.
Still it seems you should make yourself familar on the concepts of the keys in Cassandra. Different to relational databases, keys in Cassandra are not just there for generating a unique identification of a record but to prepare your query for data. Therefore keys in cassandra are often not a UUID … or not a UUID alone.

Cassandra data modelling less then 1000 records to fit in one row

We have some entity uniquely identified by generated UUID. We need to support find by name query. Also we need to support sorting to be by name.
We know that there will be no more than 1000 of entities of that type which can perfectly fit in one row. Is it viable idea to hardcode primary key, use name as clustering key and id as clustering key there to satisfy uniqueness. Lets say we need school entity. Here is example:
CREATE TABLE school (
constant text,
name text,
id uuid,
description text,
location text,
PRIMARY KEY ((constant), name, id)
);
Initial state would be give me all schools and then filtering by exact name will happen. Our reasoning behind this was to place all schools in single row for fast access, have name as clustering column for filtering and have id as clustering column to guaranty uniqueness. We can use constant = school as known hardcoded value to access this row.
What I like about this solution is that all values are in one row and we get fast reads. Also we can solve sorting easy by clustering column. What I do not like is hardcoded value for constant which seams odd. We could use name as PK but then we would have 1000 records spread across couple of partitions, probably find all without name would be slower and would not be sorted.
Question 1
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
Question 2
Name is editable field, it will probably be changed rarely (someone can make typo or school can change name) but it can change. What is best way to achieve this? Delete insert inside batch (LTE can be applied to same row with conditional clause)?
Yes this is a good approach for such a small dataset. Just because Cassandra can partition large datasets across multiple nodes does not mean that you need to use that ability for every table. By using a constant for the partition key, you are telling Cassandra that you want the data to be stored on one node where you can access it quickly and in sorted order. Relational databases act on data in a single node all the time, so this is really not such an unusual thing to do.
For safety you will probably want to use a replication factor higher than one so that there are at least two copies of the single partition. In that way you will not lose access to the data if the one node where it is stored went down.
This approach could cause problems if you expect to have a lot of clients (i.e. thousands of clients) frequently reading and writing to this table, since it could become a hot spot. With only 1000 records you can probably keep all the rows cached in memory by setting the table to cache all keys and rows.
You probably won't find a lot of examples where this is done because people move to Cassandra for the support of large datasets where they want the scalability that comes from using multiple partitions. So examples are geared towards that.
Is this viable solution and are there any problems with it which we do not see? I did not see any example on Cassandra data modelling with hardcoded primary key probably for the reason so we are doubting this solution.
I briefly addressed this type of modeling solution earlier this year in my article: We Shall Have Order! This is what is known as a "dummy key," where each row has the same partition key. This is a shortcut that allows you to easily order all of your rows (on an unbound SELECT *) by clustering column(s).
Problems with this solution:
Cassandra allows a maximum of 2 billion column values per partition key. When using a dummy partition key, you will approach this limit with each value that you add.
Your data will all be stored in the same partition, which will create a "hot spot" (large groupings of data) in your cluster. This means that your data model will immediately void one of Cassandra's main benefits...data distribution. This will also complicate load balancing (the same nodes and ranges will keep serving all of your requests).
I can see that your model is designed around a SELECT * query. Cassandra works best when you can give it specific keys to query by. Unbound SELECT * queries (queries without WHERE clauses) are not a good idea to be doing with Cassandra, as they can lead to timeouts (as your data grows).
From reading through your question, I know that you're going to say that you're only using it for 1000 rows. That your dataset won't ever grow much beyond those 1000 rows, so you won't hit any of the roadblocks that I have mentioned.
So then I have to wonder, why are you using Cassandra? As a Cassandra MVP, that's a question I don't ask often. But you don't have an especially large data set (which is what Cassandra is designed to work with). Relying on that fact as a reason to use a product incorrectly is not really the best solution.
Honestly, I am going to recommend that you save yourself some complexity, and use a RDBMS instead. That will fit your use case significantly better than Cassandra will. Then you can update and order by whatever fields you wish.

Cassandra Data Modelling and designing the Clustering

I am little confused on designing the data model for Cassandra, coming from SQL background! I have gone through Datastax documentation several times to understand many things about Cassandra! This seems to be problem and not sure how can I overcome this and type of data model which I should opt for!
Primary Key along with Clustering is something really explained well here!
The documentation says that, Primary Key (Partition key, Clustering keys) is the most important thing in data model.
My use-case is pretty simple:
ITEM_ID CREATED_ON MOVED_FROM MOVED_TO COMMENT
ITEM_ID will be unique (partition_key) and each item might have 10-20 movement records! I wanted to get the movement records of an item sorted by time it's created on. So I decided go with CREATED_ON as clustering key.
According to documentation, clustering_key comes under secondary index which should be as much repeatable value as possible unlike partition key. My data-model exactly fails here! How do I preserve order using clustering to achieve the same?
Obviously I can't create some ID generation login in Application since it runs on many instances and if I have to relay on some logic, eventually the purpose of Cassandra goes for toss here.
You actually do not need a secondary index for this particular example and secondary indexes are not created by default. Your clustering key all by itself will will allow you to do queries that look like
SELECT * from TABLE where ITEM_ID = SOMETHING;
Which will automatically give you back results sorted on your clustering key CREATED_ON.
The reason for this is your key will basically make partitions internally that looks like
ITEM_ID => [Row with first Created_ON], [Row with second Created_ON] ...

Efficient modeling of versioned hierarchies in Cassandra

Disclaimer:
This is quite a long post. I first explain the data I am dealing with, and what I want to do with it.
Then I detail three possible solutions I have considered, because I've tried to do my homework (I swear :]). I end up with a "best guess" which is a variation of the first solution.
My ultimate question is: what's the most sensible way to solve my problem using Cassandra? Is it one of my attempts, or is it something else?
I am looking for advice/feedback from experienced Cassandra users...
My data:
I have many SuperDocuments that own Documents in a tree structure (headings, subheadings, sections, …).
Each SuperDocument structure can change (renaming of headings mostly) over time, thus giving me multiple versions of the structure as shown below.
What I'm looking for:
For each SuperDocument I need to timestamp those structures by date as above and I'd like, for a given date, to find the closest earlier version of the SuperDocument structure. (ie. the most recent version for which version_date < given_date)
These considerations might help solving the problem more easily:
Versions are immutable: changes are rare enough, I can create a new representation of the whole structure each time it changes.
I do not need to access a subtree of the structure.
I'd say it is OK to say that I do not need to find all the ancestors of a given leaf, nor do I need to access a specific node/leaf inside the tree. I can work all of this out in my client code once I have the whole tree.
OK let's do it
Please keep in mind I am really just starting using Cassandra. I've read/watched a lot of resources about data modeling, but haven't got much (any!) experience in the field!
Which also means everything will be written in CQL3... sorry Thrift lovers!
My first attempt at solving this was to create the following table:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY ((doc_id, version_date), pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (pre_pos ASC);
That would give me the following structure:
I'm using a Nested Sets model for my trees here; I figured it would work well to keep the structure ordered, but I am open to other suggestions.
I like this solution: each version has its own row, in which each column represents a level of the hierarchy.
The problem though is that I (candidly) intended to query my data as follows:
SELECT * FROM superdoc_structures
WHERE doc_id="3399c35...14e1" AND version_date < '2014-03-11' LIMIT 1
Cassandra quickly reminded me I was not allowed to do that! (because the partitioner does not preserve row order on the cluster nodes, so it is not possible to scan through partition keys)
What then...?
Well, because Cassandra won't let me use inequalities on partition keys, so be it!
I'll make version_date a clustering key and all my problems will be gone. Yeah, not really...
First try:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY (doc_id, version_date, pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (version_date DESC, pre_pos ASC);
I find this one less elegant: all versions and structure levels are made into columns of a now very wide row (compared to my previous solution):
Problem: with the same request, using LIMIT 1 will only return the first heading. And using no LIMIT would return all versions structure levels, which I would have to filter to only keep the most recent ones.
Second try:
there's no second try yet... I have an idea though, but I feel it's not using Cassandra wisely.
The idea would be to cluster by version_date only, and somehow store whole hierarchies in each column values. Sounds bad doesn't it?
I would do something like this:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
nested_sets map<int, int>,
titles list<text>,
PRIMARY KEY (doc_id, version_date)
) WITH CLUSTERING ORDER BY (version_date DESC);
The resulting row structure would then be:
It looks kind of all right to me in fact, but I will probably have more data than the level title to de-normalize into my columns. If it's only two attributes, I could go with another map (associating titles with ids for instance), but more data would lead to more lists, and I have the feeling it would quickly become an anti-pattern.
Plus, I'd have to merge all lists together in my client app when the data comes in!
ALTERNATIVE & BEST GUESS
After giving it some more thought, there's an "hybrid" solution that might work and may be efficient and elegant:
I could use another table that would list only the version dates of a SuperDocument & cache these dates into a Memcache instance (or Redis or whatever) for real quick access.
That would allow me to quickly find the version I need to fetch, and then request it using the composite key of my first solution.
That's two queries, plus a memory cache store to manage. But I may end up with one anyway, so maybe that'd be the best compromise?
Maybe I don't even need a cache store?
All in all, I really feel the first solution is the most elegant one to model my data. What about you?!
First, you don't need to use memcache or redis. Cassandra will give you very fast access to that information. You could certainly have a table that was something like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
/* stuff */
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
which would give you a quick way to access a given version (this query may look familiar ;-):
select * from superdoc_structures
where doc_id="3399c35...14e1" and
version_date < '2014-03-11'
order by version_date desc
limit 1;
Since nothing about the document tree structure seems to be relevant from the schema's point of view, and you are happy as a clam to create the document in its entirety every time there is a new version, I don't see why you'd even bother breaking out the tree in to separate rows. Why not just have the entire document in the table as a text or blob field?
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
contents text;
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
So to get the contents of the document as existed at the new year, you'd do:
select contents from superdoc_structures
where doc_id="...." and
version_date < '2014-01-1'
order by version_date > 1
Now, if you did want to maintain some kind of hierarchy of the document components, I'd recommend doing something like a closure table table to represent it. Alternatively, since you are willing to copy the entire document on each write anyway, why not copy the entire section info on each write, why not do so and have a schema like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
section_path varchar;
contents text;
primary key (doc_id, version_date, section_path)
) with clustering order by (version_date desc, section_path asc);
Then have section path have a syntax like, "first_level next_level sub_level leaf_name". As a side benefit, when you have the version_date of the document (or if you create a secondary index on section_path), because a space is lexically "lower" than any other valid character, you can actually grab a subsection very cleanly:
select section_path, contents from superdoc_structures
where doc_id = '....' and
version_date = '2013-12-22' and
section_path >= 'chapter4 subsection2' and
section_path < 'chapter4 subsection2!';
Alternatively, you can store the sections using Cassandra's support for collections, but again... I'm not sure why you'd even bother breaking them out as doing them as one big chunk works just great.

Cassandra: Generate a unique ID?

I'm working on a distributed data base. I'm trying to generate a unique ID that will serve as a column family primary key in cassandra.
I read some articles about doing this with Java using UUID but it seems like there is a probability for collision (even if it's very low).
I wonder if there is a way to generate a unique ID based on time maybe?
You can use the TimeUUID type in Cassandra, which backs a Type 1 UUID. This uses the current time and the creator's MAC address and a sequence number. If the TimeUUID number is generated correctly this can be done with zero collisions (you can use the CQL now() method or insert your own, the java SDK's provide some thread-safe implementations). The main advantage of TimeUUIDs is that the IDs can be time ordered. See http://wiki.apache.org/cassandra/TimeBaseUUIDNotes for more info.
However, the time ordering is unlikely to be useful for row primary keys, since the ordering is useless when using a hash partitioner, though possible using a clustering key. And also the complexity of generating a unique ID could be a source of bugs if you roll your own. Cassandra also supports Type 4 UUIDs by using the UUID type. These are just random bits. There is a collision probability, but the collision probability (assuming uncorrelated random number sources, which it will be if you generate in Java) is extremely low - if you created 1 billion a second for 100 years the probability of one collision is about 50%. (See http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates for more details.)
You should investigate using Twitter Snowflake. From the project readme:
As we at Twitter move away from Mysql towards Cassandra, we've needed a new way to generate id numbers. There is no sequential id generation facility in Cassandra, nor should there be.
Snowflake uses an intuitive algorithm that generates longs which are both time-ordered and unique. Since your database is distributed, this service should suit your needs well.
As said by Richard you can use TimeUUID, and generating TimeUUID value is not a big deal. Just follow cassandra FAQ timeuuid.
You need to use cassandra function now() to generate timeuuid and use uuid() function to generate uuid type string.

Resources