different queries for the same table in Cassandra - cassandra

In my Cassandra query-based model, I design a table with the following primary keys: ((timestamp, fraction_in_time), sensor_id, big_sensor_id)
As you guess the timestamp and fraction_in_time are partition key and sensor_id and big_sensor_id are clustering key.
The Domain is storing data from sensors and we have two sensors, a big sensor including several small sensors.
The primary key is design for this main query: Get all(or subset specified using id) sensors data for the given period of time.
On the other hand, I want to pass another query: Get all(or subset specified using time) data for a given sensor id.
I have created a materialized view for the second question using primary key: ((sensor_id, big_sensor_id), timestamp, fraction_in_time), but it duplicates all data and need so much more storage! Is there any other standard way to handle this situation?

Related

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Design data model for messaging system with Cassandra

I am new to Cassandra and trying to build a data model for messaging system. I found few solutions but none of them exactly match my requirements. There are two main requirements:
Get a list of last messages for a particular user, from all other users, sorted by time.
Get a list of messages for one-to-one message history, sorted by time as well.
I thought of something like this,
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user,from_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
But this design has few issues, like I wont be able to satisfy first requirement since this design requires to pass from_user as well. And also this would be inefficient when number of (to_user,from_user) pair increases.
You are right. That one table won't satisfy both queries, so you will need two tables. One for each query. This is a core concept with Cassandra data modeling. Query driven design.
So the query looking for messages to a user:
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
Messages from a user to another user.
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),from_user,time)
) WITH CLUSTERING ORDER BY (time DESC);
Slight difference from yours: from_user is a clustering column and not a part of the partition key. This is minimize the amount of select queries needed in application code.
It's possible to use the second table to satisfy both queries, but you will have to supply the 'from_user' to use a range query on time.

Write performance of Cassandra with Kundera ORM

I am designing an application which will accept data/events from customer facing systems persist them for audit and act as source to replay messages in case downstream systems needed a correction in any data feed.
I don't plan to do much analytics on this data ( which will be done in a downstream system ). But I am expected to persist this data and let run adhoc queries.
Few characteristics of my system
(1) 99 % write - 1 % read
(2) High write throughput (Roughly 30000 Events a second , each event having roughly 100 attributes in it)
(3) Dynamic nature of data. Cant conform to fixed schema.
These characteristics makes me think of Apache cassandra as an option either with widerow feature or map to store my attributes .
I did few samples with single node and Kundera ORM to write events to map , and get a maximum write throughput of 1500 events a second / thread . I can scale it out with more threads and cassandra nodes.
But, is it close to what I should be getting from your experience ? Few of the benchmarks available on net looks confusing .. ( I am on cassandra 2.0, with Kundra ORM 2.13)
It seems that your Cassandra data model is "overusing" the map collection type. If that answering your concern about "Dynamic nature of data. Cant conform to fixed schema.", there are other ways.
CREATE TABLE user_events ( event_time timeuuid PRIMARY KEY, attributes map, session_token text, state text, system text, user text )
It looks like the key-value pairs stored in the attributes column are the actual payload of your event. Therefore they should be rows in partitions, using the keys of your map as the clustering key.
CREATE TABLE user_events(
event_time TIMEUUID,
session_token TEXT STATIC,
state TEXT STATIC,
system TEXT STATIC,
USER TEXT STATIC,
attribute TEXT,
value TEXT,
PRIMARY KEY(event_time, attribute)
);
This makes event_time and attribute part of the primary key, event_time is the partition key and attribute is the clustering key.
The STATIC part makes these data "properties" of the events and are stored only once per partition.
Have you tried to go through cassandra.yaml and cassandra-env.sh? tuning the nodes cluster it is very important for optimizing performance, you also might want to take a look on OS parameters, you also need to make sure swap memory is 0. That helped me to increase my cluster performance

Cassandra: selecting first entry for each value of an indexed column

I have a table of events and would like to extract the first timestamp (column unixtime) for each user.
Is there a way to do this with a single Cassandra query?
The schema is the following:
CREATE TABLE events (
id VARCHAR,
unixtime bigint,
u bigint,
type VARCHAR,
payload map<text, text>,
PRIMARY KEY(id)
);
CREATE INDEX events_u
ON events (u);
CREATE INDEX events_unixtime
ON events (unixtime);
CREATE INDEX events_type
ON events (type);
According to your schema, each user will have a single time stamp. If you want one event per entry, consider:
PRIMARY KEY (id, unixtime).
Assuming that is your schema, the entries for a user will be stored in ascending unixtime order. Be careful though...if it's an unbounded event stream and users have lots of events, the partition for the id will grow and grow. It's recommended to keep partition sizes to tens or hundreds of megs. If you anticipate larger, you'll need to start some form of bucketing.
Now, on to your query. In a word, no. If you don't hit a partition (by specifying the partition key), your query becomes a cluster wide operation. With little data it'll work. But with lots of data, you'll get timeouts. If you do have the data in its current form, then I recommend you use the Cassandra Spark connector and Apache Spark to do your query. An added benefit of the spark connectory is that if you have cassandra nodes as spark worker nodes, due to locality, you can efficiently hit a secondary index without specifying the partition key (which would normally cause a cluster wide query with timeout issues, etc.). You could even use Spark to get the required data and store it into another cassandra table for fast querying.

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources