Cassandra time sliced data model for unknown data

Cassandra time sliced data model for unknown data - cassandra

I caveat this question by stating: I am somewhat new to NoSQL and very new to Cassandra, but it seems like it might be a good fit for what I'm trying to do.
Say I have a list of sensors giving input at reasonable intervals. My proposed data model is to partition by the name of the sensor, where it is (area) and the date (written as yyyyMMdd), and the cluster the readings for that day by the actual time the reading occurred. The thinking is that the query for "Get all readings from sensor A on date B" should be extremely quick. So far so good I think. The table / CF looks like this in CQL:
CREATE TABLE data (
area_id int,
sensor varchar,
date ascii,
event_time timeuuid,
PRIMARY KEY ((area_id, sensor, date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
This doesn't however actually include any data, and I'm not sure how to add this to the model. Each reading (from the same sensor) can have a different set of arbitrary data, and I won't know ahead of time what this. E.g. I could get temperature data, I could get humidity, I could get both, or I could get something I haven't seen before. It's up to the person who actually recorded the data as to what they want to submit (it's not reading from automated sensors).
Given that I want to be doing query operations on this data (which is basically UGC) what are my options? Queries will normally consist of counts on the data (e.g. Count readings from sensor A on date B where some_ugc_valueX = C and some_ugc_valueY = D). It is worth noting that there will be more data points than would normally be queried at once. A reading could have 20 data values, but maybe only 2 or 3 would be queried - just it's unknown which ahead of time.
Currently I have thought of:
Store the data for each sensor reading in as a Map type. This would certainly make the model simple, but my understanding is that querying would then be difficult? I think I would need to pull the entire map back for each sensor reading, then check the values and count it outside of Cassandra in Storm/Hadoop/whatever.
Store each of the user values as another column (composite column with event_time uuid). This would mean not using CQL as that doesn't support adding arbitrary new columns at insert time. The Thrift API does however allow this. This means I can get Cassandra to do the counting itself.
Maybe I'm going about this the wrong way? Maybe Cassandra isn't even the best choice for this kind of data?

tl;dr. you can't chose both speed and absolute flexibility ;-)
Queries based on data from User Generated Content is going to be complex - you aren't going to be able to produce a one-size-fits-all table definition that will allow quick responses for queries based on UGC-content. Even if you choose to use Maps, Cassandra will have to deserialize the entire data structure on every query so it's not really an option for big Maps - which as you suggest in your question is likely to be the case.
An alternative might be to store the sensor data in a serialised form, e.g., json. This would give maximum flexibility in what is being stored - at the expense of being unable to make complex queries. The serialization/deserialization burden is pushed to the client and all data is sent over the wire. Here's a simple example:
Table creation (slightly simpler than your example - I've dropped date):
create table data(
area_id int,
sensor varchar,
event_time timeuuid,
data varchar,
primary key(area_id,sensor,event_time)
);
Insertion:
insert into data(area_id,sensor,event_time,data) VALUES (1,'sensor1',now(),'["datapoint1":"value1"]');
insert into data(area_id,sensor,event_time,data) VALUES (1,'sensor2',now(),'["datapoint1":"value1","count":"7"]');
Querying by area_id and sensor:
>select area_id,sensor,dateof(event_time),data from data where area_id=1 and sensor='sensor1';
area_id | sensor | dateof(event_time) | data
---------+---------+--------------------------+-------------------------
1 | sensor1 | 2013-11-06 17:37:02+0000 | ["datapoint1":"value1"]
(1 rows)
Querying by area_id:
> select area_id,sensor,dateof(event_time),data from data where area_id=1;
area_id | sensor | dateof(event_time) | data
---------+---------+--------------------------+-------------------------------------
1 | sensor1 | 2013-11-06 17:37:02+0000 | ["datapoint1":"value1"]
1 | sensor2 | 2013-11-06 17:40:49+0000 | ["datapoint1":"value1","count":"7"]
(2 rows)
(Tested using [cqlsh 4.0.1 | Cassandra 2.0.1 | CQL spec 3.1.1 | Thrift protocol 19.37.0].)

Related

How to copy data from a Cassandra table to another structure for better performance

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]

You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.

An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

Cassandra data modeling - Do I choose hotspots to make the query easier?

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.

Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

Kudu auto generated key column

I am trying to make custom auto generated/incremented key in Kudu which will keep increasing its value -from a starting seed which zero by default.
It's pretty inefficient to go through all records and increment a counter to get a row count.
Does Kudu provide the rows count out of the box?
If not, what are the best way to get it?

Apache Kudu does not support AUTO_INCREMENT columns at this time. There is a FAQ entry on the Kudu web site that mentions this.
Kudu is a distributed storage engine that is focused on being a good analytical store (OLAP) as opposed to being a good transactional store (OLTP) and it shows in the features we've prioritized so far. This is a good example of that.
Because we're not trying to be an OLTP store, Kudu doesn't yet implement multi-row or multi-node transactions, and so a simple incrementing primary key counter would be difficult to implement correctly at this time -- especially for example when the table is hash-partitioned on the primary key. We'd need a central transaction coordinator that doesn't currently exist.
To answer your second question, getting a row count is currently a little expensive in Kudu as it involves scanning the index column on each tablet and summing up the total count. Apache Impala / Apache Spark SQL will do this transparently for you if you do a SELECT COUNT(*) from kudu_table but I wouldn't currently rely on that for the purposes of assigning a new ID, since Impala currently allows scanning from a slightly stale Kudu replica thus potentially being off on the row count.
The best thing to do right now is rely on some external mechanism to assign row IDs.
Source: I am a PMC member on Apache Kudu.

In addition do #JoeyVanHalens answer, there is another option which is also explained here on SO. You can use row_numer() to create an ID which resembles a counter but does not force you to do some cumbersome nesting or something else if you only want a counter-like column.
Straight forward, it looks like this:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as incremented_id
FROM some_table
row_number() creates an incremtented number over a partition. Unlike rank(), row_number() ensures you to have an increment even if your partition contains duplicates.
PARTITION BY "dummy" interprets a temporary "dummy"-column during runtime as one partition over the entire table. Thus, the increment happens for all records.
ORDER BY follows the same "dummy"-logic.
Of course you can also replace "dummy" by whatever column is necessary for your table-logic.
The result looks like:
-- ID = incremented_id
| ID | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|

There are several ways to get around this.
Use impala's uuid() function to generate a unique id.
convert the uuid() to BIGINT (via hashing, etc.)
use impala's unix_timestamp to generate a BIGINT value representing the current date and time as a delta from the Unix epoch (this might cause some collision, so better add another column if you're going to use this as a primary key.

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Cassandra data model for application logs (billions of operations!)

Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?

This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string