Cassandra: secondary index inside a single partition (per partition indexing)? - cassandra

This question is I hope not answered in the usual "secondary index v. clustering key" questions.
Here is a simple model I have:
CREATE TABLE ks.table1 (
name text,
timestamp int,
device text,
value int,
PRIMARY KEY (md_name, timestamp, device)
Basically I view my data as datasets with name name, each dataset is a kind of sparse 2D matrix (rows = timestamps, columns = device) containing value.
As the problem and the queries can be pretty symmetric (ie. is my "matrix" the best representation, or should I use the transposed "matrix") I couldn't decide easily what clustering key I should put first. It makes a bit more sense the way I did: for each timestamp I have a set of data (values for each devices present at that timestamp).
The usual query is then
select * from cycles where md_name = 'xyz';
It targets a single partition, that will be super fast, easy enough. If there's a large amount of data my users could do something like this instead:
select * from cycles where md_name = 'xyz' and timestamp < n;
However I'd like to be able to "transpose" the problem and do this:
select * from cycles where md_name = 'xyz' and device='uvw';
That means I have to create a secondary index on device.
But (and that's where the question starts"), this index is a bit different from usual indexes, as it is used for queries inside a single partition. Create the index allows to do the same on multiple partitions:
select * from cycles where device='uvw'
Which is not necessary in my case.
Can I improve my model to support such queries without too much duplication?
Is there something like a "per-partition index"?

The index would allow you to do queries like this:
select * from cycles where md_name='xyz' and device='uvw'
But that would return all timestamps for that device in the xyz partition.
So it sounds like maybe you want two views of the data. Once based on name and time range, and one based on name, device, and time range.
If that's what you're asking, then you probably need two tables. If you're using C* 3.0, then you could use the materialized views feature to create the second view. If you're on an earlier version, then you'd have to create the two tables and do a write to each table in your application.


How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra Schema for standard SELECT/FROM/WHERE/IN query

Pretty new to Cassandra - I have data that looks like this:
<geohash text, category int, payload text>
The only query I want to run is:
SELECT category, payload FROM table WHERE geohash IN (list of 9 geohashes)
What would be the best schema in this case?
I know I could simply make my geohash the primary key and be done with it, but is there a better approach?
What are the benefits for defining PRIMARY KEY (geohash, category, payload)?
It depends on the size of your data for each row (geohash text, category int, payload text). If your payload size does not reach to tens of Mb, then you may want to put more geohash values into the same partition by using an artificial bucketId int, so your query can be performed on a server. Schema would look like this
geohash text, bucketId int, category int, payload text where the partition key is goehash and bucketId.
The recommendation is to have a sizeable partition <= 100 Mb, so you don't have to look up too many partitions. More is available here.
If you have a primary key on (geohash, category, payload), then you can have your data sorted on category and payload in the ascending order.
So based on the query, it sounds like you're considering a CQL schema that looks like this:
CREATE TABLE geohash_data (
geohash text,
category int,
data text,
PRIMARY KEY (geohash)
In Cassandra, the first (and in this case only) column in your PRIMARY KEY is the Partition Key. The Partition Key is what's used to distribute data around the cluster. So when you do your SELECT ... IN () query, you're basically querying for the data in 9 different partitions which, depending on how large your cluster is, the replication factor, and the consistency level you use to do the query, could end up querying at least 9 servers (and maybe more). Why does that matter?
Latency: The more partitions (and thus replicas/servers) involved in our query, the more potential for a slow server being able to negatively impact how quickly the data is returned.
Availability: The more partitions (and thus replicas/servers) involved in our query, the more potential that a single server going down could make it impossible for the query to be satisfied at all.
Both of those are bad scenarios so (as Toan rightly points out in his answer and the link he provided), we try to data model in Cassandra so that our queries will hit as few partitions (and thus replicas/servers) as possible. What does that mean for your scenario? Without knowing all the details, it's hard to say for sure, but let me make a couple guesses about your scenario and give you an example of how I'd try to solve it.
It sounds like maybe you already know the list of possible geohash values ahead of time (maybe they're at some regularly spaced interval of a predefined grid). It also sounds like maybe you're querying for 9 geohash values because you're doing some kind of "proximity" search where you're trying to get the data for the 9 geohashes in each direction around a given point.
If that's the case, the trick could be to denormalize the data at write time into a data model optimized for reading. For example, a schema like this:
CREATE TABLE geohash_data (
geohash text,
data_geohash text,
category int,
data text,
PRIMARY KEY (geohash, data_geohash)
When you INSERT a data point, you'd calculate the geohashes for the surrounding areas where you expect that data should show up in the results. You'd then INSERT the data multiple times for each geohash you calculated. So the value for geohash is the calculated value where you expect it to show up in the query results and the value for data_geohash is the actual value from the data you're inserting. Thus you'd have multiple (up to 9?) rows in your partition for a given geohash which represent the data of the surrounding geohashes.
This means your SELECT query now doesn't have to do an IN and hit multiple partitions. You just query WHERE geohash = ? for the point you want to search around.

Data modelling of raw data for further transformation in cassandra

I am working on a system for storing and processing time series data from a couple of plants. Every plant has a different number of raw measurement values, each of them represented as a key-value pair.
The raw data needs to be preprocessed to obtain semantics. I also need to save the raw data, because the transformation process should be configurable. While I am new to No-Sql databases and Cassandra I searched for resources on the web and found the weather station example (similar described on other resources, too).
My requirements are similar to this example, but as extension I need a way to store a variable number of measurement values (key-pair) per plant. I also know, that my table model highly depends on the queries I want to run against it. The most common queries will be:
Get all values per key for a specific time (range) and plant.
Get all values per multiple keys for a specific time (range) and plant.
My question now is, how would a table structure look like that best fit theses requirements?
I thought about something like that, but don't know if it contains some drawbacks:
CREATE TABLE values_per_day (
plant_id text,
date text,
event_time timestamp,
key text,
value text,
PRIMARY KEY ((plant_id, date), event_time, address)
The recommendation for Cassandra is to start with the queries you want to perform. For each query, consider the inputs to the query, which indicate what data you want it to return. For each query you should have a table that has the inputs to the query as its primary key. If you want to query for a rangeof values, that value should be the cluster key (not the partition key) of a primary key, with the other inputs the partition key. If you want to query for very long value ranges, consider slicing that value into buckets.

Need recommendation on appropriate primary key structure

I have a lot of time series data that I would like to store in a Cassandra database. Since I can only do WHERE clauses on fields in the primary key, I need some recommendations on how to lay this out based on the way that I will need to query it.
My data is in this format:
Each serial number has multiple devices, and I will have thousands of timestamps for every device, so my primary key to uniquely identify each set of data has to include all three.
There are basically two types of queries I will do on this data.
SELECT * FROM TABLE WHERE system_serial_number = 'X' and device_id = 'x' and timestamp (is in a range)
SELECT * FROM TABLE WHERE system_serial_number = 'X' and timestamp (is in a range)
The second one is the more likely query, because I am typically going to input a time range in the application and I want to see data from every single device for a given serial number. But I can't leave the device name out of the key because you need serial/device/timestamp to be able to uniquely identify an entire row.
I've tried to create my tables as follows:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
PRIMARY KEY ((system_serial_number,device_id),time_stamp)
And also as:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
PRIMARY KEY (system_serial_number,device_id,time_stamp)
The first one I think would keep me from hitting column limitations, but it always requires me to enter a Device ID along with the Serial every time I query. The second one is less column efficient (based on my understanding), and it allows me to search by serial only. Neither one of them lets me search by just serial/timestamp, which is actually the most common search that I am going to do, but isn't unique enough to be a primary key.
The only way I've even been able to get a query to work is by using the first one with the compound key and then adding a secondary index for just serial number, which then allows me to search by serial/timestamp, but I have to use the inefficient ALLOW FILTERING.
Any suggestions on the best way to get what I need?
The simplest answer is:
PRIMARY KEY (system_serial_number, time_stamp, device_id)
system_serial_number will be the partition key that identifies which replicas (nodes) will contain the data. All data for a single serial number will need to fit in the same partition. For efficient access, all queries will be required to specify a serial number. If partition size is a concern, there may be ways to further subdivide if the use case allows.
time_stamp will be the clustering key used to sort the rows within the partition. That is, all logical rows for the same serial number will be ordered by the timestamp, irrespective of the device. The first PK column that is not a part of the partition key determines the sort order.
device_id is an additional PK column to distinguish your logical rows, but does not help you sort or do other range scans.
Since you mentioned that each device would generate thousands of timestamps, and each serial number will have many devices, you may also need to be concerned about the size of your partitions if you take the above approach. A common approach is to break the data for a single serial number across multiple partitions, but that can make querying your data either more efficient or more troublesome, depending on how you decide to subdivide the data.
You will have to use some imagination and knowledge of your specific use cases to decide on the proper partitioning layout. Off the top of my head, I can think of some ideas:
PRIMARY KEY ((system_serial_number, device_hash_modulus), time_stamp, device_id)
Idea: hash your device IDs and apply a modulus to split the data across a fixed number of "buckets"
Advantage: with an even hash distribution, spreads data evenly across a known number of nodes
Disadvantage: querying across "all devices" for a given serial number requires making N queries, one for each "bucket" based on the number chosen for the modulo operation
Disadvantage: may need to adjust bucketing scheme (and migrate data) if initial choice is too small for eventual data size
PRIMARY KEY ((system_serial_number, coarse_time_stamp), time_stamp, device_id)
Idea: split the data over time into different partitions, size determined by how coarse you make the partitioning timestamp (year? year+month?, year+day?, etc.). The decision should be made based on how many unique records are expected within a given time period.
Advantage: assuming the cluster is configured with a random partitioner, the data will be evenly distributed around the cluster as time moves forward.
Disadvantage: querying for records across a range of time may involve making separate queries to different partitions, making the program logic more complex. If the partition timestamp isn't coarse enough, or the timestamp range to be searched is too wide, performance will be impacted.
There may be other options available to you, but it will all depend on how well you understand your current use cases (and how well you can predict the future behavior of your data set).

get_range_slices and CQL query handling, need for ALLOW FILTERING

I have a following CQL table (a bit simplified for clarity):
CREATE TABLE test_table (
user uuid,
app_id ascii,
domain_id ascii,
props map<ascii,blob>,
PRIMARY KEY ((user), app_id, domain_id)
The idea is that this table would contain many users (i.e. rows, say, dozens of millions). For each user there would be a few domains of interest and there would be a few apps per domain. And for each user/domain/app there would be a small set of properties.
I need to scan this entire table and load its contents in chunks for given app_id and domain_id. My idea was to use the TOKEN function to be able to read the whole data set in several iterations. So, something like this:
SELECT props FROM test_table WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
I was assuming that this query would be efficient because I specify the range of the row keys and by specifying the values of the clustering keys I effectively specify the column range. But when I try to run this query I get the error message "Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING".
I have limited experience with Cassandra and I assumed that this sort of query would map into get_range_slices() call, which accepts the slice predicate (i.e. the range of columns defined by my app_id/domain_id values) and the key range defined by my token range. It seems either I misunderstand how this sort of query is handled or maybe I misunderstand about the efficiency of get_range_slices() call.
To be more specific, my questions are:
- if this data model does make sense for the kind of query I have in mind
- if this query is expected to be efficient
- if it is efficient, then why am I getting this error message asking me to ALLOW FILTERING
My only guess about the last one was that the rows that do not have the given combination of app_id/domain_id would need to be skipped from the result.
--- update ----
Thank for all the comments. I have been doing more research on this and there is still something that I do not fully understand.
In the given structure what I am trying to get is like a rectangular area from my data set (assuming that all rows have the same columns). Where top and the bottom of the rectangle is determined by the token range (range) and the left/right sides are defined as column range (slice). So, this should naturally transform into get_range_slices request. My understanding (correct me if I am wrong) that the reason why CQL requires me to put ALLOW FILTERING clause is because there will be rows that do not contain the columns I am looking for, so they will have to be skipped. And since nobody knows if it will have to skip every second row or first million rows before finding one that fits my criteria (in the given range) - this is what causes the unpredictable latency and possibly even timeout. Am I right? I have tried to write a test that does the same kind of query but using low-level Astyanax API (over the same table, I had to read the data generated with CQL, it turned out to be quite simple) and this test does work - except that it returns keys with no columns where the row does not contain the slice of columns I am asking for. Of course I had to implement some kind of simple paging based on the starting token and limit to fetch the data in small chunks.
Now I am wondering - again, considering that I would need to deal with dozens of millions of users: would it be better to partially "rotate" this table and organize it in something like this:
Row key: domain_id + app_id + partition no (something like hash(user) mod X)
Clustering key: column partition no (something like hash(user) >> 16 mod Y) + user
For the "column partition no"...I am not sure if it is really needed. I assume that if I go with this model I will have relatively small number of rows (X=1000..10000) for each domain + app combination. This will allow me to query the individual partitions, even in parallel if I want to. But (assuming the user is random UUID) for 100M users it will result in dozens or hundreds of thousands of columns per row. Is it a good idea to read one such a row in one request? It should created some memory pressure for Cassandra, I am sure. So maybe reading them in groups (say, Y=10..100) would be better?
I realize that what I am trying to do is not what Cassandra does well - reading "all" or large subset of CF data in chunks that can be pre-calculated (like token range or partition keys) for parallel fetching from different hosts. But I am trying to find a pattern that is the most efficient for such a use case.
By the way, the query like "select * from ... where TOKEN(user)>X and TOKEN(user)
Short answer
This warning means that Cassandra would have to read non-indexed data and filter out the rows that don't satisfy the criteria. If you add ALLOW FILTERING to the end of query, it will work, however it will scan a lot of data:
SELECT props FROM test_table
WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807
Longer explanation
In your example primary key consists of two parts: user is used as partition key, and <app_id, domain_id> form remaining part. Rows for different users are distributed across the cluster, each node responsible for specific range of token ring.
Rows on a single node are sorted by the hash of partition key (token(user) in your example). Different rows for single user are stored on a single node, sorted by <app_id, domain_id> tuple.
So, the primary key forms a tree-like structure. Partition key adds one level of hierarchy, and each remaining field of a primary key adds another one. By default, Cassandra processes only the queries that return all rows from the continuos range of the tree (or several ranges if you use key in (...) construct). If Cassandra should filter out some rows, ALLOW FILTERING must be specified.
Example queries that don't require ALLOW FILTERING:
SELECT * FROM test_table
WHERE user = 'user1';
//OK, returns all rows for a single partition key
SELECT * FROM test_table
WHERE TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
//OK, returns all rows for a continuos range of the token ring
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id='myapp1';
//OK, the rows for specific user/app combination
//are stored together, sorted by domain_id field
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id > 'abc' AND app_id < 'xyz';
//OK, since rows for a single user are sorted by app
Example queries that do require ALLOW FILTERING:
SELECT props FROM test_table
WHERE app_id='myapp1';
//Must scan all the cluster for rows,
//but return only those with specific app_id
SELECT props FROM test_table
WHERE user='user1'
AND domain_id='mydomain1';
//Must scan all rows having user='user1' (all app_ids),
//but return only those having specific domain
SELECT props FROM test_table
WHERE user='user1'
AND app_id > 'abc' AND app_id < 'xyz'
AND domain_id='mydomain1';
//Must scan the range of rows satisfying <user, app_id> condition,
//but return only those having specific domain
What to do?
In Cassandra it's not possible to create a secondary index on the part of the primary key. There are few options, each having its pros and cons:
Add a separate table that has primary key ((app_id), domain_id, user) and duplicate the necessary data in two tables. It will allow you to query necessary data for a specific app_id or <app_id, domain_id> combination. If you need to query specific domain and all apps - third table is necessary. This approach is called materialized views
Use some sort of parallel processing (hadoop, spark, etc) to perform necessary calculations for all app/domain combinations. Since Cassandra needs to read all the data anyway, there probably won't be much difference from a single pair. If the result for other pairs might be cached for later use, it will probably save some time.
Just use ALLOW FILTERING if query performance is acceptable for your needs. Dozens of millions partition keys is probably not too much for Cassandra.
Presuming you are using the Murmur3Partitioner (which is the right choice), you do not want to run range queries on the row key. This key is hashed to determine which node holds the row, and is therefore not stored in sorted order. Doing this kind of range query would therefore require a full scan.
If you want to do this query, you should store some known value as a sentinel for your row key, such that you can query for equality rather than range. From your data it appears that either app_id or domain_id would be a good choice, since it sounds like you always know these values when performing your query.
