cassandra filtering on an indexed column isn't working - cassandra

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?

Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Related

Cassandra Data modelling : Timestamp as partition keys

I need to be able to return all users that performed an action during a specified interval. The table definition in Cassandra is just below:
create table t ( timestamp from, timestamp to, user text, PRIMARY KEY((from,to), user))
I'm trying to implement the following query in Cassandra:
select * from t WHERE from > :startInterval and to < :toInterval
However, this query will obviously not work because it represents a range query on the partition key, forcing Cassandra to search all nodes in the cluster, defeating its purpose as an efficient database.
Is there an efficient to model this query in Cassandra?
My solution would be to split both timestamps into their corresponding years and months and use those as the partition key. The table would look like this:
create table t_updated ( yearFrom int, monthFrom int,yearTo int,monthTo int, timestamp from, timestamp to, user text, PRIMARY KEY((yearFrom,monthFrom,yearTo,monthTo), user) )
If i wanted the users that performed the action between Jan 2017 and July 2017 the query would look like the following:
select user from t_updated where yearFrom IN (2017) and monthFrom IN (1,2,3,4,5,6,7) and yearTo IN (2017) and monthTo IN (1,2,3,4,5,6,7)
Would there be a better way to model this query in Cassandra? How would you approach this issue?
First, the partition key has to operate on equals operator. It is better to use PRIMARY KEY (BUCKET, TIME_STAMP) here where bucket can be combination of year, month (or include days, hrs etc depending on how big your data set is).
It is better to execute multiple queries and combine the result in client side.
The answer depends on the expected number of entries. Thumb rule, is that a partition should not exceed 100mb. So if you expect a moderate number of entries, it would be enough to go with year as partition key.
We use Week-First-Date as a partition key in a iot scenario, where values get written at most once a minute.

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

How to model for repeated information on many records on cassandra

I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:
I have a generic table:
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
PRIMARY KEY ((key, key2) time)
)
This table has 700.000.000+ records.
I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.
The queries I want to run that I can't now are:
Which records on the readings table were gotten from a given source?
What is the source of a given record?
A solution would be for me to create a table such as:
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.
If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.
How does one go about modelling this in cassandra?
Your schema
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots
For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table
For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.
Do you realize that this query can potentially return tens of millions of records ?
If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
source text,
PRIMARY KEY ((key, key2), time)
)
CREATE CUSTOM INDEX source_idx ON readings(source)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)
http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.
normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.
There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.
Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.
This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!
There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.
Sorry for rambling...

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Resources