Querying Cassandra for multiple columns - cassandra

I am using Cassandra to store stocks information. Each 'row' has some base fields like: time, price, close, open, low, high, etc. on top of these fields I have a list of floats-typed values which contains some internal system calculations.
Example for an object:
Class stockentry
time timestamp;
price float;
close float;
open float;
low float;
high float;
x float;
y float;
z float;
xx2 float;
xx3 float;
xx... yy... z...
a lot more...
Creating a lot of columns in a column family and storing all this data is no problem with Cassandra. The problem is querying it.
I would like to query on fields like x,y,xx2.. and these fields contains a very unique data values (floats with 4 decimal places).
Adding all these columns (100-150) as secondary indexes is not likely to be a good solution and is not recommended by the Cassandra docs.
What is the recommended data modeling, considering the requirements, when working with Cassandra?

Cassandra data modeling follows a query-driven design pattern. What this means is that instead of building a model to naturally represent the data (as we might in an RDBMS), we design schemas to accomodate data access patterns instead.
So for example, if you knew that the majority of your queries would involve a where clause on the column x, and ordered by the rows in column y, you might want to create an additional table in which the partition key was x, and the clustering column was y. For example:
CREATE TABLE <tablename>
"x" float,
"y" float,
"price" float,
.
.
<rest of columns>
.
.
PRIMARY KEY("x","y"));
Now, querying in column x becomes very efficient as the data for a particular value of x is stored together.
For queries in which a range of values is required (x> pricerange), you would be wise to store them as clustering columns.
Admittedly, this leads to multiple writes, as the values in columns x and y must be written across both tables. Cassandra encourages writes as storing data in this day and age is cheap. Essentially, in Cassandra you trade off additional writes for blazing fast reads.
Therefore, before designing your data model, think about what kind of queries you would most likely be doing and design accordingly.

CREATE TABLE pricing(
id blob,
price_tag string, // open, close, high, low, ...
time timestamp,
value float, // I would suggest blob with custom/thrift serialization
PRIMARY KEY (id, price_tag, time)
)
It will give very efficient queries for different price types over time.
You can find more in great presentation: http://www.slideshare.net/carlyeks/nyc-big-tech-day-2013?ref=http://techblog.bluemountaincapital.com/

Related

Is there difference in storing a list of floats vs. denormalising into multiple rows?

I need to store multiple floating point numbers per record in Cassandra. My current schema looks like:
CREATE
TABLE
data_point
( account ASCII
, groupkey TINYINT
, productid TEXT
, vectors LIST<FLOAT>
, PRIMARY KEY ((account, groupkey), productid))
WITH CLUSTERING
ORDER
BY
( productid ASC
);
Each record has 1280 floats. These rows, once inserted, are never updated or deleted. While this works, I've been thinking if it better to have these in separate 1280 rows.
CREATE
TABLE
data_point
( account ASCII
, groupkey TINYINT
, productid TEXT
, vector FLOAT
, PRIMARY KEY ((account, groupkey), productid))
WITH CLUSTERING
ORDER
BY
( productid ASC
);
The Datastax docs reads:
Collections are meant for storing/denormalizing relatively small amount of data.
...but I'm unsure what defines a little or lot. The ordering of the list is not relevant. The rows are never individually read. All reads come from Spark and use token ranges to read large swathes of data.
If data is never changing, then use frozen version of the list, so all points will be stored as one binary object:
vectors frozen<LIST<FLOAT>>
Using the separate rows make sense only if you need to read only one value, or something like. If you always read the whole dataset - use frozen list.
I would echo Alex's advice, a frozen list would suit your use case better than the non-frozen above - however there is also some points I would add.
On the 2nd table example, there is no additional column to denote the different list items when normalized - the primary key remains the same, so in essence that would store just 1 value per primary key and not 1280 you intended. There would have to be an additional column within the key to make it a unique row per list entry still.
For the 1st table, while you can use a frozen list - if there is no actual order to the items within the list and no duplication, you could opt for a set which would be simpler since there is no ordinal position being stored / considered. (The lack of any ordering denoted in the 2nd table design is the catalyst for the consideration)

Timeseries with Spark/Cassandra - How to find timestamps when values satisfy a condition?

I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!
It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.

Cassandra Schema for standard SELECT/FROM/WHERE/IN query

Pretty new to Cassandra - I have data that looks like this:
<geohash text, category int, payload text>
The only query I want to run is:
SELECT category, payload FROM table WHERE geohash IN (list of 9 geohashes)
What would be the best schema in this case?
I know I could simply make my geohash the primary key and be done with it, but is there a better approach?
What are the benefits for defining PRIMARY KEY (geohash, category, payload)?
It depends on the size of your data for each row (geohash text, category int, payload text). If your payload size does not reach to tens of Mb, then you may want to put more geohash values into the same partition by using an artificial bucketId int, so your query can be performed on a server. Schema would look like this
geohash text, bucketId int, category int, payload text where the partition key is goehash and bucketId.
The recommendation is to have a sizeable partition <= 100 Mb, so you don't have to look up too many partitions. More is available here.
If you have a primary key on (geohash, category, payload), then you can have your data sorted on category and payload in the ascending order.
So based on the query, it sounds like you're considering a CQL schema that looks like this:
CREATE TABLE geohash_data (
geohash text,
category int,
data text,
PRIMARY KEY (geohash)
);
In Cassandra, the first (and in this case only) column in your PRIMARY KEY is the Partition Key. The Partition Key is what's used to distribute data around the cluster. So when you do your SELECT ... IN () query, you're basically querying for the data in 9 different partitions which, depending on how large your cluster is, the replication factor, and the consistency level you use to do the query, could end up querying at least 9 servers (and maybe more). Why does that matter?
Latency: The more partitions (and thus replicas/servers) involved in our query, the more potential for a slow server being able to negatively impact how quickly the data is returned.
Availability: The more partitions (and thus replicas/servers) involved in our query, the more potential that a single server going down could make it impossible for the query to be satisfied at all.
Both of those are bad scenarios so (as Toan rightly points out in his answer and the link he provided), we try to data model in Cassandra so that our queries will hit as few partitions (and thus replicas/servers) as possible. What does that mean for your scenario? Without knowing all the details, it's hard to say for sure, but let me make a couple guesses about your scenario and give you an example of how I'd try to solve it.
It sounds like maybe you already know the list of possible geohash values ahead of time (maybe they're at some regularly spaced interval of a predefined grid). It also sounds like maybe you're querying for 9 geohash values because you're doing some kind of "proximity" search where you're trying to get the data for the 9 geohashes in each direction around a given point.
If that's the case, the trick could be to denormalize the data at write time into a data model optimized for reading. For example, a schema like this:
CREATE TABLE geohash_data (
geohash text,
data_geohash text,
category int,
data text,
PRIMARY KEY (geohash, data_geohash)
);
When you INSERT a data point, you'd calculate the geohashes for the surrounding areas where you expect that data should show up in the results. You'd then INSERT the data multiple times for each geohash you calculated. So the value for geohash is the calculated value where you expect it to show up in the query results and the value for data_geohash is the actual value from the data you're inserting. Thus you'd have multiple (up to 9?) rows in your partition for a given geohash which represent the data of the surrounding geohashes.
This means your SELECT query now doesn't have to do an IN and hit multiple partitions. You just query WHERE geohash = ? for the point you want to search around.

Cassandra: secondary index inside a single partition (per partition indexing)?

This question is I hope not answered in the usual "secondary index v. clustering key" questions.
Here is a simple model I have:
CREATE TABLE ks.table1 (
name text,
timestamp int,
device text,
value int,
PRIMARY KEY (md_name, timestamp, device)
)
Basically I view my data as datasets with name name, each dataset is a kind of sparse 2D matrix (rows = timestamps, columns = device) containing value.
As the problem and the queries can be pretty symmetric (ie. is my "matrix" the best representation, or should I use the transposed "matrix") I couldn't decide easily what clustering key I should put first. It makes a bit more sense the way I did: for each timestamp I have a set of data (values for each devices present at that timestamp).
The usual query is then
select * from cycles where md_name = 'xyz';
It targets a single partition, that will be super fast, easy enough. If there's a large amount of data my users could do something like this instead:
select * from cycles where md_name = 'xyz' and timestamp < n;
However I'd like to be able to "transpose" the problem and do this:
select * from cycles where md_name = 'xyz' and device='uvw';
That means I have to create a secondary index on device.
But (and that's where the question starts"), this index is a bit different from usual indexes, as it is used for queries inside a single partition. Create the index allows to do the same on multiple partitions:
select * from cycles where device='uvw'
Which is not necessary in my case.
Can I improve my model to support such queries without too much duplication?
Is there something like a "per-partition index"?
The index would allow you to do queries like this:
select * from cycles where md_name='xyz' and device='uvw'
But that would return all timestamps for that device in the xyz partition.
So it sounds like maybe you want two views of the data. Once based on name and time range, and one based on name, device, and time range.
If that's what you're asking, then you probably need two tables. If you're using C* 3.0, then you could use the materialized views feature to create the second view. If you're on an earlier version, then you'd have to create the two tables and do a write to each table in your application.

Need recommendation on appropriate primary key structure

I have a lot of time series data that I would like to store in a Cassandra database. Since I can only do WHERE clauses on fields in the primary key, I need some recommendations on how to lay this out based on the way that I will need to query it.
My data is in this format:
SYSTEM_SERIAL_NUMBER,DEVICE_ID,TIMESTAMP,...OTHER COLUMNS
Each serial number has multiple devices, and I will have thousands of timestamps for every device, so my primary key to uniquely identify each set of data has to include all three.
There are basically two types of queries I will do on this data.
SELECT * FROM TABLE WHERE system_serial_number = 'X' and device_id = 'x' and timestamp (is in a range)
or
SELECT * FROM TABLE WHERE system_serial_number = 'X' and timestamp (is in a range)
The second one is the more likely query, because I am typically going to input a time range in the application and I want to see data from every single device for a given serial number. But I can't leave the device name out of the key because you need serial/device/timestamp to be able to uniquely identify an entire row.
I've tried to create my tables as follows:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY ((system_serial_number,device_id),time_stamp)
);
And also as:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY (system_serial_number,device_id,time_stamp)
);
The first one I think would keep me from hitting column limitations, but it always requires me to enter a Device ID along with the Serial every time I query. The second one is less column efficient (based on my understanding), and it allows me to search by serial only. Neither one of them lets me search by just serial/timestamp, which is actually the most common search that I am going to do, but isn't unique enough to be a primary key.
The only way I've even been able to get a query to work is by using the first one with the compound key and then adding a secondary index for just serial number, which then allows me to search by serial/timestamp, but I have to use the inefficient ALLOW FILTERING.
Any suggestions on the best way to get what I need?
The simplest answer is:
PRIMARY KEY (system_serial_number, time_stamp, device_id)
system_serial_number will be the partition key that identifies which replicas (nodes) will contain the data. All data for a single serial number will need to fit in the same partition. For efficient access, all queries will be required to specify a serial number. If partition size is a concern, there may be ways to further subdivide if the use case allows.
time_stamp will be the clustering key used to sort the rows within the partition. That is, all logical rows for the same serial number will be ordered by the timestamp, irrespective of the device. The first PK column that is not a part of the partition key determines the sort order.
device_id is an additional PK column to distinguish your logical rows, but does not help you sort or do other range scans.
Since you mentioned that each device would generate thousands of timestamps, and each serial number will have many devices, you may also need to be concerned about the size of your partitions if you take the above approach. A common approach is to break the data for a single serial number across multiple partitions, but that can make querying your data either more efficient or more troublesome, depending on how you decide to subdivide the data.
You will have to use some imagination and knowledge of your specific use cases to decide on the proper partitioning layout. Off the top of my head, I can think of some ideas:
PRIMARY KEY ((system_serial_number, device_hash_modulus), time_stamp, device_id)
Idea: hash your device IDs and apply a modulus to split the data across a fixed number of "buckets"
Advantage: with an even hash distribution, spreads data evenly across a known number of nodes
Disadvantage: querying across "all devices" for a given serial number requires making N queries, one for each "bucket" based on the number chosen for the modulo operation
Disadvantage: may need to adjust bucketing scheme (and migrate data) if initial choice is too small for eventual data size
PRIMARY KEY ((system_serial_number, coarse_time_stamp), time_stamp, device_id)
Idea: split the data over time into different partitions, size determined by how coarse you make the partitioning timestamp (year? year+month?, year+day?, etc.). The decision should be made based on how many unique records are expected within a given time period.
Advantage: assuming the cluster is configured with a random partitioner, the data will be evenly distributed around the cluster as time moves forward.
Disadvantage: querying for records across a range of time may involve making separate queries to different partitions, making the program logic more complex. If the partition timestamp isn't coarse enough, or the timestamp range to be searched is too wide, performance will be impacted.
There may be other options available to you, but it will all depend on how well you understand your current use cases (and how well you can predict the future behavior of your data set).

Resources