Cassandra: Does timeuuid preserve order? - cassandra

I was using timestamp as primary key for my data by calling toTimestamp(now()), but unfortunately this creates collision.
I understand that timeuuid guarantees uniqueness, but if I do ORDER BY timeuuid, does timeuuid also guarantee the original order?

From the docs:
Timeuuid types can be entered as integers for CQL input. A value of the timeuuid type is a Version 1 UUID. A Version 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/uuid_type_r.html
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html

Related

Partition key with UDT field in CQL

I have a little question concerning the partition key in Cassandra.
When I create a table which contain a field called flxB whose type is an UDT like this :
CREATE TYPE fluxes (
flux float,
flux_prec smallint,
flux_error float,
flux_error_prec smallint,
flux_bibcode text,
system text
);
Can I put the field flxB.flux in my partition key ?
No, you can't put flxB.flux on any part of primary key
Even In cassandra version lower than 3.0 UDT type field must be defined as frozen
When using the frozen keyword, you cannot update parts of a user-defined type value. The entire value must be overwritten. Cassandra treats the value of a frozen, user-defined type like a blob.
In Cassandra all the part of the primary key must be present when inserting/updating, If cassandra would allow you to put flx.flux in partition key, How cassandra will make sure all the part of the primary key is present in the insert/update query ?

How to generate UUID(Long) using cassandra timestamp in cluster environment?

I have the requirement where we need to generate UUID as Long value using Java based on Cassandra timestamp which is in cluster. Can anyone help how to geranate it using java and cassandra cluster timestamp combination?
Use TimeUUID cql3 data type:
A value of the timeuuid type is a Type 1 UUID. A type 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
In Java you can use UUIDs helper class from com.datastax.driver.core.utils.UUIDs:
UUIDs.timeBased()

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

How to get a range of data from Cassandra

[cqlsh 5.0.1 | Cassandra 2.1.0 | CQL spec 3.2.0 | Native protocol v3]
table:
CREATE TABLE dc.event (
id timeuuid PRIMARY KEY,
name text
) WITH bloom_filter_fp_chance = 0.01;
How do I get a time range of data from Cassandra?
For example, when I try 'select * from event where id> maxTimeuuid('2014-11-01 00:05+0000') and minTimeuuid('2014-11-02 10:00+0000')', as seen here http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/timeuuid_functions_r.html
I get the following error: 'code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"'
Can I keep timeuuid as primary key and meet the requirement?
Thanks
Can I keep timeuuid as primary key and meet the requirement?
Not really, no. From http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
WHERE clauses can include a greater-than and less-than comparisons,
but for a given partition key, the conditions on the clustering column
are restricted to the filters that allow Cassandra to select a
contiguous ordering of rows.
You could try adding "ALLOW FILTERING" to your query... but I doubt that would work. And I don't know of a good way (and neither do I believe there is a good way) to tokenize the timeuuids. I'm about 99% sure the ordering from the partitioner would yield unexpected, bad results, even though the query itself would execute and appear correct until you dug into it.
As an aside, you should really check out a similar question that was asked about a year ago: time series data, selecting range with maxTimeuuid/minTimeuuid in cassandra
Short answer, No. Long answer, you can do something similar EG:
CREATE TABLE dc.event (
event_time timestamp,
id timeuuid,
name text,
PRIMARY KEY(event_time, id)
) WITH bloom_filter_fp_chance = 0.01;
The timestamp would presumably be truncated so that it only reflected a whole day (or hour or minute depending on the velocity of your data). Your where clause would have to include the "IN" parameter for the timestamps that are included in your timeuuid range.
If you use an appropriate chunking factor (how much you truncate your timestamp), you may even answer some of the questions you're looking for without using a range of timeuuids, just a simple where clause.
Essentially this allows you the leeway to make the kind of query you're looking for while respecting the restrictions in Cassandra. As Raedwald pointed out, you can't use the partition key in continuous ranges because of the underpinning nature of Cassandra as a large hash- That being said, Cassandra is well known to do some incredibly powerful things in time-series data.
Take a look at how Newts is doing time series for ranges. The author has a great set of slides and a talk describing the data model to get precisely what you seem to be looking for. https://github.com/OpenNMS/newts/
Cassandra can not do this kind of query because Cassandra is a key-value store implemented using a giant hash map, not a relational database. Just like an in memory hash map, the only way to find the key values within a sub range is to iterate through all the keys. That can be expensive enough for an in memory hash map, but for Cassandra it would be crippling.
Yes, you can do it by using spark with scala and spark-cassandra-connector!
I think you should keep your partition keys fewer by setting them to 'YYYY-MM-dd hh:00+0000' and filter on dates and hours only.
Then you could use something like:
case class TableKey(id: timeuuid)
val dates = Array("2014-11-02 10:00+0000","2014-11-02 11:00+0000","2014-11-02 12:00+0000")
val selected_data = sc.parallelize(dates).map(x => TableKey(_)).joinWithCassandraTable('dc', 'event')
And there you have your selected data rdd that you could collect:
val data = selected_data.collect
I had similar problem...

Cassandra Compound primary key CQL3

If I want to partition my primary key by time window would it be better (for storage and retrieval efficiency) to use a textual representation of the time or a truncated native timestamp ie
CREATE TABLE user_data (
user_id TEXT,
log_day TEXT, -- store as 'yyyymmdd' string
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
or
CREATE TABLE user_data (
user_id TEXT,
log_day TIMESTAMP, -- store as (timestamp-in-milli - (timestamp-in-mills mod 86400)
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
Regarding your column key "log_timestamp":
If you are working with multiple writing clients - which I suggest, since otherwise you probably won't get near the possible throughput in a distributed write-optimized data base like C* - you should consider using TimeUUIDs instead of timestamps, as they are conflict-free (assuming MAC addresses are unique). Otherwise you would have to guarantee that no two inserts happen at the same time, otherwise you will lose this data. You can do column slice queries on TimeUUIDs and other time based operations.
I'd use unix time (i.e. 1234567890) over either of those formats - to point to an entire day, you'd just use the timestamp for 00:00.
However, I very much recommend reading Advanced Time Series with Cassandra on the DataStax dev blog. It covers some important things to consider in your model, with regards to bucketing/splitting.

Resources