Cassandra: selecting first entry for each value of an indexed column - cassandra

I have a table of events and would like to extract the first timestamp (column unixtime) for each user.
Is there a way to do this with a single Cassandra query?
The schema is the following:
CREATE TABLE events (
id VARCHAR,
unixtime bigint,
u bigint,
type VARCHAR,
payload map<text, text>,
PRIMARY KEY(id)
);
CREATE INDEX events_u
ON events (u);
CREATE INDEX events_unixtime
ON events (unixtime);
CREATE INDEX events_type
ON events (type);

According to your schema, each user will have a single time stamp. If you want one event per entry, consider:
PRIMARY KEY (id, unixtime).
Assuming that is your schema, the entries for a user will be stored in ascending unixtime order. Be careful though...if it's an unbounded event stream and users have lots of events, the partition for the id will grow and grow. It's recommended to keep partition sizes to tens or hundreds of megs. If you anticipate larger, you'll need to start some form of bucketing.
Now, on to your query. In a word, no. If you don't hit a partition (by specifying the partition key), your query becomes a cluster wide operation. With little data it'll work. But with lots of data, you'll get timeouts. If you do have the data in its current form, then I recommend you use the Cassandra Spark connector and Apache Spark to do your query. An added benefit of the spark connectory is that if you have cassandra nodes as spark worker nodes, due to locality, you can efficiently hit a secondary index without specifying the partition key (which would normally cause a cluster wide query with timeout issues, etc.). You could even use Spark to get the required data and store it into another cassandra table for fast querying.

Related

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Data Modelling in Cassandra for job queues

I am trying to store all the scheduler jobs in Cassandra.
I designed all the locking tables and seems fine. I am finding difficulty in creating a job queue table.
My Requirement is
1) I need to query all the jobs whichever is not completed.
CREATE TABLE jobs(
jobId text,
startTime timestamp,
endTime timestamp,
status text,
state text,
jobDetails text,
primary key (X,X))
with clustering order by (X desc);
where, state - on / off
status - running / failed / completed
I am not sure which one to keep as primary key (Since it is unique), Also I need to query all the jobs in 'on' state. Could somebody help me in designing this in Cassandra. Even If you propose anything with composite partition key, I am fine with it.
Edited :
I come up with the data model like this ,
CREATE TABLE job(
jobId text,
startTime timestamp,
endTime timestamp,
state text,
status text,
jobDetails text,
primary key (state,jobId, startTime)
with clustering order by (startTime desc);
I am able to insert like this,
INSERT INTO job (jobId, startTime, endTime, status,state, jobDetails) VALUES('nodestat',toTimestamp(now()), 0,'running','on','{
"jobID": "job_0002",
"jobName": "Job 2",
"description": "This does job 2",
"taskHandler": require("./jobs/job2").runTask,
"intervalInMs": 1000
}');
Query like this,
SELECT * FROM job WHERE state = 'on';
will this create any performance impact ?
You are maybe implementing an antipattern for cassandra.
See https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets for a blog post discussing what might be your problem when using cassandra as message queue.
Apart from that, there is some information how to do it the "right way" in cassandra on Slideshare: https://de.slideshare.net/alimenkou/high-performance-queues-with-cassandra
There are many projects out there which fit scheduling and or messaging better, for example http://www.quartz-scheduler.org/overview/features.html.
Update for your edit above:
primary key (state,jobId, startTime)
This will create one partition for each state - resulting in huge partitions and hotspots. Transitioning a jobs state will move it to a different partition - you will have deleted entries and possible compation and performance issues (depending on your number of jobs).
All jobs with state='on' will be on one node (and it's replicas) all jobs with state='off' on another node. You will have two partitions in your design.
Since you are open to changes to the model, see if below model works for you
CREATE TABLE job(
partition_key,
jobId text,
startTime timestamp,
endTime timestamp,
state text,
status text,
jobDetails text,
primary key (partition_key,state,jobId, startTime)
with clustering order by (startTime desc);
Here the partion_key column value can be calculated based on your volume of jobs.
For example:
If your job count is less than 100K jobs for a single day, then you can keep the partition at single day level i.e. YYYYMMDD (20180105) or if it is 100K per one hour, you can change it to YYYYMMDDHH (2018010518). Change the cluster columns depending upon your filtering order.
This way you can able to query the state only if you know when you want to query.
Avoiding creating too many partitions or exploding the partition with too many columns
It will evenly distribute load into partitions.
It will be helpful to design the model better if you can specify what adjustments/additions you can make to your query.
You need to include equality columns in partition key so your equality columns are status and state. You need to check whether these 2 makes good partition key or not, if not you need to use either custom column or any other existing column as part of partition key. As jobid is to make record unique you can keep it in clustering column. I am assuming you are not querying table on job id.

Write performance of Cassandra with Kundera ORM

I am designing an application which will accept data/events from customer facing systems persist them for audit and act as source to replay messages in case downstream systems needed a correction in any data feed.
I don't plan to do much analytics on this data ( which will be done in a downstream system ). But I am expected to persist this data and let run adhoc queries.
Few characteristics of my system
(1) 99 % write - 1 % read
(2) High write throughput (Roughly 30000 Events a second , each event having roughly 100 attributes in it)
(3) Dynamic nature of data. Cant conform to fixed schema.
These characteristics makes me think of Apache cassandra as an option either with widerow feature or map to store my attributes .
I did few samples with single node and Kundera ORM to write events to map , and get a maximum write throughput of 1500 events a second / thread . I can scale it out with more threads and cassandra nodes.
But, is it close to what I should be getting from your experience ? Few of the benchmarks available on net looks confusing .. ( I am on cassandra 2.0, with Kundra ORM 2.13)
It seems that your Cassandra data model is "overusing" the map collection type. If that answering your concern about "Dynamic nature of data. Cant conform to fixed schema.", there are other ways.
CREATE TABLE user_events ( event_time timeuuid PRIMARY KEY, attributes map, session_token text, state text, system text, user text )
It looks like the key-value pairs stored in the attributes column are the actual payload of your event. Therefore they should be rows in partitions, using the keys of your map as the clustering key.
CREATE TABLE user_events(
event_time TIMEUUID,
session_token TEXT STATIC,
state TEXT STATIC,
system TEXT STATIC,
USER TEXT STATIC,
attribute TEXT,
value TEXT,
PRIMARY KEY(event_time, attribute)
);
This makes event_time and attribute part of the primary key, event_time is the partition key and attribute is the clustering key.
The STATIC part makes these data "properties" of the events and are stored only once per partition.
Have you tried to go through cassandra.yaml and cassandra-env.sh? tuning the nodes cluster it is very important for optimizing performance, you also might want to take a look on OS parameters, you also need to make sure swap memory is 0. That helped me to increase my cluster performance

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Cassandra data model for application logs (billions of operations!)

Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?
This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.

Resources