What type should I use to store binary UUID values? - cassandra

I have a binary 16 UUID like this:
<Buffer 11 ed a7 28 25 a2 57 40 80 61 81 ed 9d 30 12 8e>
That comes from using this binary-uuid library in NodeJS.
What type should I use while creating my table schema in Cassandra?

To answer your question directly, binary data can be stored in a CQL blob type.
However, it's not clear to me whether (a) your use case really stores binary UUIDs, or (b) you are using artificial IDs as the partition key.
If it's (b) then I would recommend using natural IDs. For example, if you have a table of users then use usernames or email addresses. They are much better than using artificial IDs. Cheers!

Cassandra works with UUID rather than sequences or integer auto increments. The reason comes from the distributed nature of Cassandra, you want a node to generate a new randomized number without having to lock other nodes.
So UUID is an official type of CQL you want to use in your design.
CREATE TABLE IF NOT EXISTS something_by_id (
uid uuid,
name text
PRIMARY KEY ((uid))
);
How to use it with NodejS ?
Import the cassandra drivers: npm install cassandra-driver
Import the type in your file const Uuid = require('cassandra-driver').types.Uuid;
Use it const id = Uuid.random();
More details can be found here. Note the double parenthesis in the table definition. It is important to understand how a primary is built with Cassandra with its 2 parts partition keys and clustering clumns not serving the same purpose.

Related

How to copy data from a Cassandra table to another structure for better performance

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.
An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

How to generate UUID(Long) using cassandra timestamp in cluster environment?

I have the requirement where we need to generate UUID as Long value using Java based on Cassandra timestamp which is in cluster. Can anyone help how to geranate it using java and cassandra cluster timestamp combination?
Use TimeUUID cql3 data type:
A value of the timeuuid type is a Type 1 UUID. A type 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
In Java you can use UUIDs helper class from com.datastax.driver.core.utils.UUIDs:
UUIDs.timeBased()

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Regular expression search or LIKE type feature in cassandra

I am using datastax cassandra ver 2.0.
How do we search in cassandra column a value using regular expression.Is there way to achieve 'LIKE' ( as in sQL) functionality ?
I have created table with below schema.
CREATE TABLE Mapping (
id timeuuid,
userid text,
createdDate timestamp,
createdBy text,
lastUpdateDate timestamp,
lastUpdateBy text,
PRIMARY KEY (id,userid)
);
I inserted few test records as below.
id | userid | createdby
-------------------------------------+----------+-----------
30c78710-c00c-11e3-bb06-1553ee5e40dd | Jon | admin
3e673aa0-c00c-11e3-bb06-1553ee5e40dd | Jony | admin
441c4210-c00c-11e3-bb06-1553ee5e40dd | Jonathan | admin
I need to search records, where userid contains the word 'jon'.So that in results, i get all records, containing jon,jony,jonathan.
I know,there is no sql LIKE functionality in cassandra.
But is there any way to achieve it in cassandra ?
(NOTE: I am using datastax-java driver as client api).
Are you using DSE or the community version? In case of DSE, consider having a Solr node for these types of queries. If not, maybe use something like lucene / solr as an inverted index outside of cassandra for that particular functionality. That may be a hassle if all you have is cassandra set up, in which case, have a manual inverted index as Ananth suggested. One option is to keep rows of 2-3 character prefixes that hold indices to partitions. You could query those, find the appropriate partitions client side and then issue another query against the target data.
There is a lucene index for cassandra. You can use this on the community edition too and perform Regex searches
You don't have regular expressions check in cql for now. The basic usage of cassandra is having it function like a big data storage. The kind of functionality you had asked for can be done in your code portion in an optimised manner. If you are still persisting on this usage, my suggestion would be this
Column family 1:
Id- an unique id for your userid
Name - jonny(or any name you would like to use)
combinations- j,jon,jon ,etc and all possible combinations you want
query this and get the appropriate id for your query
Use that id I you column family instead of name directly. Query using that id.
Try to normalise such operations as much as possible. Cassandra is like your base to control. It provides availability of crucial data . Not the flexibility of SQL .

Migration of Oracle database to Cassandra with auto increment ids

I need to migrate Oracle database with Cassandra.
All Oracle tables have primary keys as auto increment integer type.
If we use UUID of integer type which could serve the same mechanism like auto increment and primary keys in Cassandra, can we set start value, so that we can easily migrate Oracle data with Cassandra seamlessly?
If there is any other better option available, please suggest.
usually just use a timeuuid so no need to set the start value ever even across restarts. Another option is like PlayOrm's unique keys which is just very short hostname (like b1, b2, b3) + unique id in that host machine. That is very much like a timeuuid but alot shorter and a bit easier to read. PlayOrm is just one of many clients for cassandra(an ORM layer one).

Resources