Is there difference in storing a list of floats vs. denormalising into multiple rows? - cassandra

I need to store multiple floating point numbers per record in Cassandra. My current schema looks like:
CREATE
TABLE
data_point
( account ASCII
, groupkey TINYINT
, productid TEXT
, vectors LIST<FLOAT>
, PRIMARY KEY ((account, groupkey), productid))
WITH CLUSTERING
ORDER
BY
( productid ASC
);
Each record has 1280 floats. These rows, once inserted, are never updated or deleted. While this works, I've been thinking if it better to have these in separate 1280 rows.
CREATE
TABLE
data_point
( account ASCII
, groupkey TINYINT
, productid TEXT
, vector FLOAT
, PRIMARY KEY ((account, groupkey), productid))
WITH CLUSTERING
ORDER
BY
( productid ASC
);
The Datastax docs reads:
Collections are meant for storing/denormalizing relatively small amount of data.
...but I'm unsure what defines a little or lot. The ordering of the list is not relevant. The rows are never individually read. All reads come from Spark and use token ranges to read large swathes of data.

If data is never changing, then use frozen version of the list, so all points will be stored as one binary object:
vectors frozen<LIST<FLOAT>>
Using the separate rows make sense only if you need to read only one value, or something like. If you always read the whole dataset - use frozen list.

I would echo Alex's advice, a frozen list would suit your use case better than the non-frozen above - however there is also some points I would add.
On the 2nd table example, there is no additional column to denote the different list items when normalized - the primary key remains the same, so in essence that would store just 1 value per primary key and not 1280 you intended. There would have to be an additional column within the key to make it a unique row per list entry still.
For the 1st table, while you can use a frozen list - if there is no actual order to the items within the list and no duplication, you could opt for a set which would be simpler since there is no ordinal position being stored / considered. (The lack of any ordering denoted in the 2nd table design is the catalyst for the consideration)

Related

Trying to visual how wide and skinny rows are layed out

Can someone give and show me how the data is layed out when you design your tables for wide vs. skinny rows.
I'm not sure I fully grasp how the data is spread out with a "wide" row.
Is there a difference in how you can fetch the data or will it be the same i.e. if it is ordered it doesn't matter if the data is vertical (skinny) or horizontally (wide) organized.
Update
Is a table considered with if the primary key consists of more than one column?
Or table will have wide rows only if the partition key is a composite partition key?
Wide... Skinny... Terms that make your head explode... I prefer to oversimplify the thing as such:
All the tables have wide rows
You simply need to take care of how wide the rows gets
This allows me to think this as follow (mangling a bit the C* terminology):
Number of RECORDS in a partition
1 <--------------------------------------- ... 2Billion
^ ^
Skinny rows wide rows
The lesser records in a partition, the skinner is the "partition", and vice-versa.
When designing for C* I always keep in mind a couple of things:
I want to use "skinny partitions" when my data can be fetched with one query and it is fully contained in one record of one partition. Typical example is something along SELECT * FROM table WHERE username = 'xmas79'; where the table has a primary key in the form of PRIMARY KEY (username)that let me get all the data belonging to a particular username.
I want to use "wide rows" when my data can be fetched with one query and it is fully contained on multiple records of one partition. Typical examples are range queries like SELECT * FROM table WHERE sensor = 'pressure' AND time >= '2016-09-22';, where the table has a primary key in the form of PRIMARY KEY (sensor, time).
So, first approach for one shot queries, second approach for range queries. Beware that this second approach have the (major) drawback that you can keep adding data to the partition, and it will get wider and wider, hurting performances.
In order to control how wide your partitions are, you need to add something to the partition key. In the sensor example above, if your don't violate your requirements of course, you can "group" some measurements by date, eg you split the measures in a day-by-day groups, making the primary key like PRIMARY KEY ((sensor, day), time), where the partition key was transformed to (sensor, day). By this approach, you have full (well, let's say good at least) control on the wideness of your partitions.
You only need to find a good compromise between your query capabilities and the desired performance.
I suggest these three readings for further investigation on the details:
Wide Rows in Cassandra CQL
Does CQL support dynamic columns / wide rows?
CQL3 for Cassandra experts
Beware that in the 1. there's a mistake in the second to last picture: the primary key should be
PRIMARY KEY ((user_id, tweet_id))
with double parenthesis around the columns instead of one.

Cassandra: secondary index inside a single partition (per partition indexing)?

This question is I hope not answered in the usual "secondary index v. clustering key" questions.
Here is a simple model I have:
CREATE TABLE ks.table1 (
name text,
timestamp int,
device text,
value int,
PRIMARY KEY (md_name, timestamp, device)
)
Basically I view my data as datasets with name name, each dataset is a kind of sparse 2D matrix (rows = timestamps, columns = device) containing value.
As the problem and the queries can be pretty symmetric (ie. is my "matrix" the best representation, or should I use the transposed "matrix") I couldn't decide easily what clustering key I should put first. It makes a bit more sense the way I did: for each timestamp I have a set of data (values for each devices present at that timestamp).
The usual query is then
select * from cycles where md_name = 'xyz';
It targets a single partition, that will be super fast, easy enough. If there's a large amount of data my users could do something like this instead:
select * from cycles where md_name = 'xyz' and timestamp < n;
However I'd like to be able to "transpose" the problem and do this:
select * from cycles where md_name = 'xyz' and device='uvw';
That means I have to create a secondary index on device.
But (and that's where the question starts"), this index is a bit different from usual indexes, as it is used for queries inside a single partition. Create the index allows to do the same on multiple partitions:
select * from cycles where device='uvw'
Which is not necessary in my case.
Can I improve my model to support such queries without too much duplication?
Is there something like a "per-partition index"?
The index would allow you to do queries like this:
select * from cycles where md_name='xyz' and device='uvw'
But that would return all timestamps for that device in the xyz partition.
So it sounds like maybe you want two views of the data. Once based on name and time range, and one based on name, device, and time range.
If that's what you're asking, then you probably need two tables. If you're using C* 3.0, then you could use the materialized views feature to create the second view. If you're on an earlier version, then you'd have to create the two tables and do a write to each table in your application.

Data modelling of raw data for further transformation in cassandra

I am working on a system for storing and processing time series data from a couple of plants. Every plant has a different number of raw measurement values, each of them represented as a key-value pair.
The raw data needs to be preprocessed to obtain semantics. I also need to save the raw data, because the transformation process should be configurable. While I am new to No-Sql databases and Cassandra I searched for resources on the web and found the weather station example (similar described on other resources, too).
My requirements are similar to this example, but as extension I need a way to store a variable number of measurement values (key-pair) per plant. I also know, that my table model highly depends on the queries I want to run against it. The most common queries will be:
Get all values per key for a specific time (range) and plant.
Get all values per multiple keys for a specific time (range) and plant.
My question now is, how would a table structure look like that best fit theses requirements?
I thought about something like that, but don't know if it contains some drawbacks:
CREATE TABLE values_per_day (
plant_id text,
date text,
event_time timestamp,
key text,
value text,
PRIMARY KEY ((plant_id, date), event_time, address)
);
The recommendation for Cassandra is to start with the queries you want to perform. For each query, consider the inputs to the query, which indicate what data you want it to return. For each query you should have a table that has the inputs to the query as its primary key. If you want to query for a rangeof values, that value should be the cluster key (not the partition key) of a primary key, with the other inputs the partition key. If you want to query for very long value ranges, consider slicing that value into buckets.

Need recommendation on appropriate primary key structure

I have a lot of time series data that I would like to store in a Cassandra database. Since I can only do WHERE clauses on fields in the primary key, I need some recommendations on how to lay this out based on the way that I will need to query it.
My data is in this format:
SYSTEM_SERIAL_NUMBER,DEVICE_ID,TIMESTAMP,...OTHER COLUMNS
Each serial number has multiple devices, and I will have thousands of timestamps for every device, so my primary key to uniquely identify each set of data has to include all three.
There are basically two types of queries I will do on this data.
SELECT * FROM TABLE WHERE system_serial_number = 'X' and device_id = 'x' and timestamp (is in a range)
or
SELECT * FROM TABLE WHERE system_serial_number = 'X' and timestamp (is in a range)
The second one is the more likely query, because I am typically going to input a time range in the application and I want to see data from every single device for a given serial number. But I can't leave the device name out of the key because you need serial/device/timestamp to be able to uniquely identify an entire row.
I've tried to create my tables as follows:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY ((system_serial_number,device_id),time_stamp)
);
And also as:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY (system_serial_number,device_id,time_stamp)
);
The first one I think would keep me from hitting column limitations, but it always requires me to enter a Device ID along with the Serial every time I query. The second one is less column efficient (based on my understanding), and it allows me to search by serial only. Neither one of them lets me search by just serial/timestamp, which is actually the most common search that I am going to do, but isn't unique enough to be a primary key.
The only way I've even been able to get a query to work is by using the first one with the compound key and then adding a secondary index for just serial number, which then allows me to search by serial/timestamp, but I have to use the inefficient ALLOW FILTERING.
Any suggestions on the best way to get what I need?
The simplest answer is:
PRIMARY KEY (system_serial_number, time_stamp, device_id)
system_serial_number will be the partition key that identifies which replicas (nodes) will contain the data. All data for a single serial number will need to fit in the same partition. For efficient access, all queries will be required to specify a serial number. If partition size is a concern, there may be ways to further subdivide if the use case allows.
time_stamp will be the clustering key used to sort the rows within the partition. That is, all logical rows for the same serial number will be ordered by the timestamp, irrespective of the device. The first PK column that is not a part of the partition key determines the sort order.
device_id is an additional PK column to distinguish your logical rows, but does not help you sort or do other range scans.
Since you mentioned that each device would generate thousands of timestamps, and each serial number will have many devices, you may also need to be concerned about the size of your partitions if you take the above approach. A common approach is to break the data for a single serial number across multiple partitions, but that can make querying your data either more efficient or more troublesome, depending on how you decide to subdivide the data.
You will have to use some imagination and knowledge of your specific use cases to decide on the proper partitioning layout. Off the top of my head, I can think of some ideas:
PRIMARY KEY ((system_serial_number, device_hash_modulus), time_stamp, device_id)
Idea: hash your device IDs and apply a modulus to split the data across a fixed number of "buckets"
Advantage: with an even hash distribution, spreads data evenly across a known number of nodes
Disadvantage: querying across "all devices" for a given serial number requires making N queries, one for each "bucket" based on the number chosen for the modulo operation
Disadvantage: may need to adjust bucketing scheme (and migrate data) if initial choice is too small for eventual data size
PRIMARY KEY ((system_serial_number, coarse_time_stamp), time_stamp, device_id)
Idea: split the data over time into different partitions, size determined by how coarse you make the partitioning timestamp (year? year+month?, year+day?, etc.). The decision should be made based on how many unique records are expected within a given time period.
Advantage: assuming the cluster is configured with a random partitioner, the data will be evenly distributed around the cluster as time moves forward.
Disadvantage: querying for records across a range of time may involve making separate queries to different partitions, making the program logic more complex. If the partition timestamp isn't coarse enough, or the timestamp range to be searched is too wide, performance will be impacted.
There may be other options available to you, but it will all depend on how well you understand your current use cases (and how well you can predict the future behavior of your data set).

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.
According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.
In my opinion, this is strictly row-oriented. Is there something I'm missing?
If you take a look at the Readme file at Apache Cassandra git repo, it says that,
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
Column oriented or columnar databases are stored on disk column wise.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as,
"Bonuses" : {
row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
...
}
Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.
Read this for more details.
Yes, the "column-oriented" terminology is a bit confusing.
The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.
So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.
apple -> colour weight price variety
"red" 100 40 "Cox"
orange -> colour weight price origin
"orange" 120 50 "Spain"
One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.
However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):
temperature -> 2012-09-01 2012-09-02 2012-09-03 ...
40 41 39 ...
which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".
You both make good points and it can be confusing. In the example where
apple -> colour weight price variety
"red" 100 40 "Cox"
apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.
Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example
apple -> colour weight price variety
"red" 100 40 "Cox"
grape -> colour weight price variety
"red" 100 40 "Cox"
We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.
Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?
Larry
The most unambiguous term I have come across is wide-column store.
It is a kind of two-dimensional key-value store, where you use a row key and a column key to access data.
The main difference between this model and the relational ones (both row-oriented and column-oriented) is that the column information is part of the data.
This implies data can be sparse. That means different rows don't need to share the same column names nor number of columns. This enables semi-structured data or schema free tables.
You can think of wide-column stores as tables that can hold an unlimited number of columns, and thus are wide.
Here's a couple of links to back this up:
This mongodb article
This Datastax article mentions it too, although it classifies Cassandra as a key-value store.
This db-engines article
This 2013 article
Wikipedia
Column Family does not mean it is column-oriented. Cassandra is column family but not column-oriented. It stores the row with all its column families together.
Hbase is column family as well as stores column families in column-oriented fashion. Different column families are stored separately in a node or they can even reside in different node.
IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:
Primary Key, Partitioning Key, Clustering Columns, and Data Columns:
Every table must have a primary key with unique constraint.
Primary Key = Partition key + Clustering Columns
# Example
Primary Key: ((col1, col2), col3, col4) # primary key uniquely identifies a row
# we need to choose its components partition key
# and clustering columns so that each row can be
# uniquely identified
Partition Key: (col1, col2) # decides on which node to store the data
# partitioning key is mandatory, and it
# can be made up of one column or multiple
Clustering Columns: col3, col4 # decides arrangement within a partition
# clustering columns are optional
Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.
Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.
Tokens:
Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.
A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.
Cassandra is Row-partition store:
Row is the smallest unit that stores related data in Cassandra.
Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):
the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key
This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.
Cassandra has a concept of column families(table), which originally comes from BigTable. Though, it is really misleading to call them column-oriented as you mentioned. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Resources