Cassandra Data Model design for vnodes enabled cluster? - cassandra

I have recently started working with Cassandra. We have cassandra cluster which is using DSE 4.0 version and has VNODES enabled. We have a tables like this -
Below is my first table -
CREATE TABLE customers (
customer_id int PRIMARY KEY,
last_modified_date timeuuid,
customer_value text
)
Read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes.
select customer_id, customer_value from datakeyspace.customers;
We have second table like this -
CREATE TABLE client_data (
client_name text PRIMARY KEY,
client_id text,
creation_date timestamp,
is_valid int,
last_modified_date timestamp
)
CREATE INDEX idx_is_valid_clnt_data ON client_data (is_valid);
Right now in the above table, we have 500 records and all those records has "is_valid" column value set as 1. And the read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes so the below query will return me all 500 records since everything has is_valid set to 1.
select client_name, client_id from datakeyspace.client_data where is_valid=1;
Since our cluster is VNODES enabled so my above query pattern is not efficient at all and it is taking lot of time to get the data from Cassandra. It takes around 50 seconds to get the data from cqlsh client. We are reading from these table with consistency level QUORUM.
Is there any possibility of improving our data model by using wide rows concept or anything else?
Any suggestions will be greatly appreciated.

Related

Cassandra data model for time series data

For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type, metric_value, host_name, component_type and time_stamp. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name, component_type, metric_type and time_stamp. I design my table like below:
CREATE TABLE metrics (
component_type text,
host_name text,
metric_type text,
time_stamp bigint,
metric_value text,
PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)
where component_type, host_name and metric_type are partitions key and time_stamp is clustering key.
The metrics table is suitable for the queries that gets some data according to their timestamp just for a host_name or a metric_type or a component_type, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.
Besides that, I need a query that fetches all data just using time_stamp. For example :
SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;
I know the metric table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:
Is it optimal to set time_stamp as partition key? because of I insert data into the database every second and the partition key numbers will be a lot!
I need my queries to be interval on time_stamp and I know interval conditions are not allowed in partition keys, just allowed on clustering keys!
So what is the best Cassandra data model for such time series data and query?
Using time_stamp as partition key is not optimal in my opinion, as it would create a lot of partitions.
I would propose 2 solutions:
1) Go with a "week_first_day" as partition key. You would have to compute the correct week_first_day keys on your application side and then emit multiple select queries.
2) You could use ElasticSearch on top of cassandra. Cassandra remains the primary data source, but you have the freedom, to do complex selects. If you are interested, I would recommend to take a look at Elassandra .

Cassandra select query giving timeout error with Gocql driver

I am getting timeout error when executing more than 2000 SELECT queries simultaneously. I am using gocql client for Cassandra 3.7 (JAVA version 8).
"error":"gocql: no response received from cassandra within timeout period"...
I am having following table as schema,
CREATE TABLE my_db.my_message (
id text,
message_id uuid,
message text,
version text,
status tinyint,
PRIMARY KEY (id, message_id)
)
CREATE INDEX IF NOT EXISTS ON my_db.my_message(status);
Below is my query that gives timeout error when executing more than 2000 queries simultaneously.
"SELECT * FROM my_db.my_message WHERE id=? AND status = ?"
'id' is primary key and 'status' is secondary index in where clause.
'message_id' is also primary key but not used in this select query.
Any help would be appreciated. Thanks in advance.
Do not use index on frequently updated or deleted column
Remember when not to use an index
On high-cardinality columns for a query of a huge volume of records for a small number of results
In tables that use a counter column.
On a frequently updated or deleted column
To look for a row in a large partition unless narrowly queried
I think your column status has low cardinality and frequently updated. Since you are narrowing your search by providing partition key id, So low cardinality is not a problem for you. The main problem is you frequently update indexed column status. Every time you update cassandra store a tombstone.
Cassandra stores tombstones in the index until the tombstone limit reaches 100K cells. After exceeding the tombstone limit, the query that uses the indexed value will fail.
So you should filter data by status column value in the application layer.

Cassandra sorting the results by non-clustering key

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition
CREATE TABLE blogs_by_visitor (
blogposturl text,
visitor text,
visited_ts timestamp,
PRIMARY KEY (blogposturl, visitor)
);
Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.
In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.
Select visitor from blog_table
where
blogposturl = ?
and rownum <= 10
order by timestamp desc
An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.
I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).
Any idea on how we could model the table differently?
The actual table has additional columns, reduced it here for simplicity
These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.
Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.
Let's create the Materialized View :
CREATE MATERIALIZED VIEW unique_visitor AS
SELECT *
FROM blogs_by_visitor
WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
PRIMARY KEY (blogposturl, visited_ts, visitor)
WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);
Now you can just select the 10 recent unique visitor of a blogpost.
SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;
you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC
Note That : The above schema will result huge amount of unexpected tombstone generation in the Materialized Views
Or You could change your table schmea like below :
CREATE TABLE blogs_by_visitor (
blogposturl text,
year int,
month int,
day int,
visitor text,
visited_ts timestamp,
PRIMARY KEY ((blogposturl, year, month, day), visitor)
);
Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources