Data modelling of raw data for further transformation in cassandra - cassandra

I am working on a system for storing and processing time series data from a couple of plants. Every plant has a different number of raw measurement values, each of them represented as a key-value pair.
The raw data needs to be preprocessed to obtain semantics. I also need to save the raw data, because the transformation process should be configurable. While I am new to No-Sql databases and Cassandra I searched for resources on the web and found the weather station example (similar described on other resources, too).
My requirements are similar to this example, but as extension I need a way to store a variable number of measurement values (key-pair) per plant. I also know, that my table model highly depends on the queries I want to run against it. The most common queries will be:
Get all values per key for a specific time (range) and plant.
Get all values per multiple keys for a specific time (range) and plant.
My question now is, how would a table structure look like that best fit theses requirements?
I thought about something like that, but don't know if it contains some drawbacks:
CREATE TABLE values_per_day (
plant_id text,
date text,
event_time timestamp,
key text,
value text,
PRIMARY KEY ((plant_id, date), event_time, address)
);

The recommendation for Cassandra is to start with the queries you want to perform. For each query, consider the inputs to the query, which indicate what data you want it to return. For each query you should have a table that has the inputs to the query as its primary key. If you want to query for a rangeof values, that value should be the cluster key (not the partition key) of a primary key, with the other inputs the partition key. If you want to query for very long value ranges, consider slicing that value into buckets.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

How to model for repeated information on many records on cassandra

I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:
I have a generic table:
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
PRIMARY KEY ((key, key2) time)
)
This table has 700.000.000+ records.
I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.
The queries I want to run that I can't now are:
Which records on the readings table were gotten from a given source?
What is the source of a given record?
A solution would be for me to create a table such as:
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.
If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.
How does one go about modelling this in cassandra?
Your schema
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots
For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table
For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.
Do you realize that this query can potentially return tens of millions of records ?
If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
source text,
PRIMARY KEY ((key, key2), time)
)
CREATE CUSTOM INDEX source_idx ON readings(source)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)
http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.
normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.
There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.
Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.
This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!
There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.
Sorry for rambling...

Cassandra Schema for standard SELECT/FROM/WHERE/IN query

Pretty new to Cassandra - I have data that looks like this:
<geohash text, category int, payload text>
The only query I want to run is:
SELECT category, payload FROM table WHERE geohash IN (list of 9 geohashes)
What would be the best schema in this case?
I know I could simply make my geohash the primary key and be done with it, but is there a better approach?
What are the benefits for defining PRIMARY KEY (geohash, category, payload)?
It depends on the size of your data for each row (geohash text, category int, payload text). If your payload size does not reach to tens of Mb, then you may want to put more geohash values into the same partition by using an artificial bucketId int, so your query can be performed on a server. Schema would look like this
geohash text, bucketId int, category int, payload text where the partition key is goehash and bucketId.
The recommendation is to have a sizeable partition <= 100 Mb, so you don't have to look up too many partitions. More is available here.
If you have a primary key on (geohash, category, payload), then you can have your data sorted on category and payload in the ascending order.
So based on the query, it sounds like you're considering a CQL schema that looks like this:
CREATE TABLE geohash_data (
geohash text,
category int,
data text,
PRIMARY KEY (geohash)
);
In Cassandra, the first (and in this case only) column in your PRIMARY KEY is the Partition Key. The Partition Key is what's used to distribute data around the cluster. So when you do your SELECT ... IN () query, you're basically querying for the data in 9 different partitions which, depending on how large your cluster is, the replication factor, and the consistency level you use to do the query, could end up querying at least 9 servers (and maybe more). Why does that matter?
Latency: The more partitions (and thus replicas/servers) involved in our query, the more potential for a slow server being able to negatively impact how quickly the data is returned.
Availability: The more partitions (and thus replicas/servers) involved in our query, the more potential that a single server going down could make it impossible for the query to be satisfied at all.
Both of those are bad scenarios so (as Toan rightly points out in his answer and the link he provided), we try to data model in Cassandra so that our queries will hit as few partitions (and thus replicas/servers) as possible. What does that mean for your scenario? Without knowing all the details, it's hard to say for sure, but let me make a couple guesses about your scenario and give you an example of how I'd try to solve it.
It sounds like maybe you already know the list of possible geohash values ahead of time (maybe they're at some regularly spaced interval of a predefined grid). It also sounds like maybe you're querying for 9 geohash values because you're doing some kind of "proximity" search where you're trying to get the data for the 9 geohashes in each direction around a given point.
If that's the case, the trick could be to denormalize the data at write time into a data model optimized for reading. For example, a schema like this:
CREATE TABLE geohash_data (
geohash text,
data_geohash text,
category int,
data text,
PRIMARY KEY (geohash, data_geohash)
);
When you INSERT a data point, you'd calculate the geohashes for the surrounding areas where you expect that data should show up in the results. You'd then INSERT the data multiple times for each geohash you calculated. So the value for geohash is the calculated value where you expect it to show up in the query results and the value for data_geohash is the actual value from the data you're inserting. Thus you'd have multiple (up to 9?) rows in your partition for a given geohash which represent the data of the surrounding geohashes.
This means your SELECT query now doesn't have to do an IN and hit multiple partitions. You just query WHERE geohash = ? for the point you want to search around.

Need recommendation on appropriate primary key structure

I have a lot of time series data that I would like to store in a Cassandra database. Since I can only do WHERE clauses on fields in the primary key, I need some recommendations on how to lay this out based on the way that I will need to query it.
My data is in this format:
SYSTEM_SERIAL_NUMBER,DEVICE_ID,TIMESTAMP,...OTHER COLUMNS
Each serial number has multiple devices, and I will have thousands of timestamps for every device, so my primary key to uniquely identify each set of data has to include all three.
There are basically two types of queries I will do on this data.
SELECT * FROM TABLE WHERE system_serial_number = 'X' and device_id = 'x' and timestamp (is in a range)
or
SELECT * FROM TABLE WHERE system_serial_number = 'X' and timestamp (is in a range)
The second one is the more likely query, because I am typically going to input a time range in the application and I want to see data from every single device for a given serial number. But I can't leave the device name out of the key because you need serial/device/timestamp to be able to uniquely identify an entire row.
I've tried to create my tables as follows:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY ((system_serial_number,device_id),time_stamp)
);
And also as:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY (system_serial_number,device_id,time_stamp)
);
The first one I think would keep me from hitting column limitations, but it always requires me to enter a Device ID along with the Serial every time I query. The second one is less column efficient (based on my understanding), and it allows me to search by serial only. Neither one of them lets me search by just serial/timestamp, which is actually the most common search that I am going to do, but isn't unique enough to be a primary key.
The only way I've even been able to get a query to work is by using the first one with the compound key and then adding a secondary index for just serial number, which then allows me to search by serial/timestamp, but I have to use the inefficient ALLOW FILTERING.
Any suggestions on the best way to get what I need?
The simplest answer is:
PRIMARY KEY (system_serial_number, time_stamp, device_id)
system_serial_number will be the partition key that identifies which replicas (nodes) will contain the data. All data for a single serial number will need to fit in the same partition. For efficient access, all queries will be required to specify a serial number. If partition size is a concern, there may be ways to further subdivide if the use case allows.
time_stamp will be the clustering key used to sort the rows within the partition. That is, all logical rows for the same serial number will be ordered by the timestamp, irrespective of the device. The first PK column that is not a part of the partition key determines the sort order.
device_id is an additional PK column to distinguish your logical rows, but does not help you sort or do other range scans.
Since you mentioned that each device would generate thousands of timestamps, and each serial number will have many devices, you may also need to be concerned about the size of your partitions if you take the above approach. A common approach is to break the data for a single serial number across multiple partitions, but that can make querying your data either more efficient or more troublesome, depending on how you decide to subdivide the data.
You will have to use some imagination and knowledge of your specific use cases to decide on the proper partitioning layout. Off the top of my head, I can think of some ideas:
PRIMARY KEY ((system_serial_number, device_hash_modulus), time_stamp, device_id)
Idea: hash your device IDs and apply a modulus to split the data across a fixed number of "buckets"
Advantage: with an even hash distribution, spreads data evenly across a known number of nodes
Disadvantage: querying across "all devices" for a given serial number requires making N queries, one for each "bucket" based on the number chosen for the modulo operation
Disadvantage: may need to adjust bucketing scheme (and migrate data) if initial choice is too small for eventual data size
PRIMARY KEY ((system_serial_number, coarse_time_stamp), time_stamp, device_id)
Idea: split the data over time into different partitions, size determined by how coarse you make the partitioning timestamp (year? year+month?, year+day?, etc.). The decision should be made based on how many unique records are expected within a given time period.
Advantage: assuming the cluster is configured with a random partitioner, the data will be evenly distributed around the cluster as time moves forward.
Disadvantage: querying for records across a range of time may involve making separate queries to different partitions, making the program logic more complex. If the partition timestamp isn't coarse enough, or the timestamp range to be searched is too wide, performance will be impacted.
There may be other options available to you, but it will all depend on how well you understand your current use cases (and how well you can predict the future behavior of your data set).

Resources