Generating timeuuids based on old timestamps cassandra - cassandra

I am porting in data from an old postgres system onto cassandra. How do I create TimeUUIDs based on existing timestamps? I want to use existing timestamps as the sorting key. Or am I better off storing the exact timestamp instead of generating the TimeUUID?

CQL3 provides functions for converting a timestamp into a UUID, these are called minTimeuuid and maxTimeuuid
SELECT * FROM myTable
WHERE t > maxTimeuuid('2013-01-01 00:05+0000')
AND t < minTimeuuid('2013-02-02 10:00+0000')
If you are using a datastax driver like the java-driver there should be a utility class like UUIDs which provides capabilities for converting between TimeUUIDs and timestamps.

Related

GeoMesa: Cassandra table with composite key

Is it possible to create a Cassandra table with GeoMesa specifying keys (ie - a composite key)? I have a spark job that writes to Cassandra and a composite key is necessary for the output table. I would now like to create/write that same table somehow through the GeoMesa api instead of directly to Cassandra. The format is like this:
CREATE TABLE IF NOT EXISTS mykeyspace.testcompkey (pkey1 text, ckey1 int, attr1 int, attr2 int, minlat decimal, minlong decimal, maxlat decimal, maxlong decimal, updatetime text, PRIMARY KEY((pkey1), ckey1) )
Is this possible? You can see also in the create table statement that I have a partition key and a clustering key. From what I have read, I believe Geoserver does support both Simple and Complex features. I am just wondering if that support also maps over into the realm of Cassandra with GeoMesa?
Thank you
GeoMesa does use composite partition and clustering keys for Cassandra tables, but the keys are not configurable by the user - they are designed to facilitate spatial/temporal/attribute CQL queries.
Keys can be seen in the index table implementations here. The columns field (for example here) defines the primary keys. Columns with partition = true are used for partitioning, the rest are used for clustering.

How to copy data from a Cassandra table to another structure for better performance

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.
An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

How to generate UUID(Long) using cassandra timestamp in cluster environment?

I have the requirement where we need to generate UUID as Long value using Java based on Cassandra timestamp which is in cluster. Can anyone help how to geranate it using java and cassandra cluster timestamp combination?
Use TimeUUID cql3 data type:
A value of the timeuuid type is a Type 1 UUID. A type 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
In Java you can use UUIDs helper class from com.datastax.driver.core.utils.UUIDs:
UUIDs.timeBased()

Dynamic Column Family in Datastax Cassandra

As there is a way to create dynamic column family in Cassandra through CQL 3 i.e using a composite primary key with COMPACT STORAGE.
For inserting data in dynamic column family (wide rows), which will be efficient way, datastax java driver or Thrift API’s.
As I'm using Datastax, and Datastax highly suggest using non-compact tables for new developments, inspite non-compact table are less "compact" internally, then how should I need to create dynamic column family, with COMPACT STORAGE or without COMPACT STORAGE.
Please suggest.
To answer your other question. You should stay away from the Thrift API for new development. If you use the Trift API you will miss out on all the great features of CQL 3 including:
Compound Keys
Collections & other types
Usability
Use the latest Datastax Java driver!
(2.1)
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/whatsNew2.html
To do similar things to what you previously did with a "wide row" you will want to use a compound primary key that includes clustering columns.
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
PRIMARY KEY (sensor_id, collected_at)
);
See the following blog posts for more information on using wide partitions in CQL3 and how old thrift terminology relates to what you can do in CQL3:
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts
http://www.datastax.com/dev/blog/thrift-to-cql3

Hector support for CQL3 specific features (Partition & Clustering keys) and Compact Storage option

I'm trying to leverage a specific feature of Apache Cassandra CQL3, which is partition and clustering keys for tables which are created with compact storage option.
For Eg.
CREATE TABLE EMPLOYEE(id uuid, name text, field text, value text, primary key(id, name , field )) with compact storage;
I've created the table via CQL3 and i;m able to insert rows successfully using the Hector API.
But I couldn't find right set of options in the hector api to create the table itself as i require.
To elaborate a little bit more:
In ColumnFamilyDefinition.java i couldnt see an option for setting storage option (as compact storage) and In ColumnDefinition.java, i couldnt find the option to say that this column is part of the Partition and Clustering Keys
Could you please give me an idea of whether i can use Hector for this (i.e. Creating table) or not and if i can do that, what are the options that i need to provide?
If you are not tied to Hector, you could look into the DataStax Java Driver which was created to use CQL3 and Cassandra's binary protocol.

Resources