If I want to partition my primary key by time window would it be better (for storage and retrieval efficiency) to use a textual representation of the time or a truncated native timestamp ie
CREATE TABLE user_data (
user_id TEXT,
log_day TEXT, -- store as 'yyyymmdd' string
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
or
CREATE TABLE user_data (
user_id TEXT,
log_day TIMESTAMP, -- store as (timestamp-in-milli - (timestamp-in-mills mod 86400)
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
Regarding your column key "log_timestamp":
If you are working with multiple writing clients - which I suggest, since otherwise you probably won't get near the possible throughput in a distributed write-optimized data base like C* - you should consider using TimeUUIDs instead of timestamps, as they are conflict-free (assuming MAC addresses are unique). Otherwise you would have to guarantee that no two inserts happen at the same time, otherwise you will lose this data. You can do column slice queries on TimeUUIDs and other time based operations.
I'd use unix time (i.e. 1234567890) over either of those formats - to point to an entire day, you'd just use the timestamp for 00:00.
However, I very much recommend reading Advanced Time Series with Cassandra on the DataStax dev blog. It covers some important things to consider in your model, with regards to bucketing/splitting.
Related
For my chat table design in cassandra I have the following scheme:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId, fromUserId), date)
) WITH CLUSTERING ORDER BY (date ASC);
The following query:
SELECT * FROM public_messages WHERE chatroomid=? LIMIT 20
Results in the typical message:
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING;
Obviously I'm doing something wrong with the partitioning here.
I'm not experienced with Cassandra and a bit confused about online suggestions that Cassandra will make an entire table scan, which is something that I don't really get realistically. Why would I want to fetch an entire table.
Another suggestion I read about is to create partitioning, e.g. to fetch the latest per day. But this doesn't work for me. You don't know when the latest chat message occurred.
Could be last day, last hour, or last week or month for that matter.
I'm pretty much used to sql or nosql like mongo, but this simple use case seems to be a problem for Cassandra. So what is the recommended approach here?
Edit:
It seems that it is common practise to add a bucket integer.
Let's say I create a bucket per 50 messages, is there a way to auto-increment it when the bucket is full?
I would prefer not having to do a fetch of MAX bucket and calculate when the bucket is full. Seems like bad performance for doing inserts.
Also it seems like a bad idea to manage the buckets in Java. Things like app restarts or load balancing would require extra logic.
(I currently use Java Spring JPA for Cassandra).
It works without bucketing using the following table design:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId), date)
) WITH CLUSTERING ORDER BY (date DESC);
I had to remove the fromUserId from the partition key, I assume it is required to include it in the where clause to avoid the error.
The jpa query:
publicMessageRepository.findFirst20ByPkChatRoomIdOrderByPkDateDesc(chatRoomId);
I was using timestamp as primary key for my data by calling toTimestamp(now()), but unfortunately this creates collision.
I understand that timeuuid guarantees uniqueness, but if I do ORDER BY timeuuid, does timeuuid also guarantee the original order?
From the docs:
Timeuuid types can be entered as integers for CQL input. A value of the timeuuid type is a Version 1 UUID. A Version 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/uuid_type_r.html
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html
I was searching for pagination in cassandra and found this perfect topic here: Results pagination in Cassandra (CQL) , with this answer accepted by majority of people. But I want to do same thing on multiple computers. I'll provide an example...
The problem
Lets say I have three computers that are connected to same cassandra DB. Each computer wants to take a few rows from the following table:
CREATE TABLE IF NOT EXISTS lp_webmap.page (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
location text,
title text,
rank float,
updated timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), location
)
);
Every computer takes few rows and performs time consuming calculations for them. For a fixed partition key (domain_name1st, domain_name2nd, domain_name3rd) and different clustering key (location), there can be still thousands of results.
And now the problem comes...how to lock quickly a couple of rows with that computer1 is working for other computers?
Unusable solution
In a standard SQL I would use something like this:
CREATE TABLE IF NOT EXISTS lp_registry.page_lock (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
page_from int,
page_count int,
locked timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), locked, page_from
)
) WITH CLUSTERING ORDER BY (locked DESC);
This would allow me to do following:
Select first 10 pages on computer 1 and lock them (page_from=1, page_count=10)
Check locks quickly on other two machines and get unused pages for calculations
Take and lock bigger amount of pages on faster computers
Delete all locks for given partition key after all pages are processed
Question
However, I can't do LIMIT 20,10 in Cassandra and also I can't do this, since I want to paginate on different computers. Is there any chance how can I paginate through these pages quickly?
I am designing an application which will accept data/events from customer facing systems persist them for audit and act as source to replay messages in case downstream systems needed a correction in any data feed.
I don't plan to do much analytics on this data ( which will be done in a downstream system ). But I am expected to persist this data and let run adhoc queries.
Few characteristics of my system
(1) 99 % write - 1 % read
(2) High write throughput (Roughly 30000 Events a second , each event having roughly 100 attributes in it)
(3) Dynamic nature of data. Cant conform to fixed schema.
These characteristics makes me think of Apache cassandra as an option either with widerow feature or map to store my attributes .
I did few samples with single node and Kundera ORM to write events to map , and get a maximum write throughput of 1500 events a second / thread . I can scale it out with more threads and cassandra nodes.
But, is it close to what I should be getting from your experience ? Few of the benchmarks available on net looks confusing .. ( I am on cassandra 2.0, with Kundra ORM 2.13)
It seems that your Cassandra data model is "overusing" the map collection type. If that answering your concern about "Dynamic nature of data. Cant conform to fixed schema.", there are other ways.
CREATE TABLE user_events ( event_time timeuuid PRIMARY KEY, attributes map, session_token text, state text, system text, user text )
It looks like the key-value pairs stored in the attributes column are the actual payload of your event. Therefore they should be rows in partitions, using the keys of your map as the clustering key.
CREATE TABLE user_events(
event_time TIMEUUID,
session_token TEXT STATIC,
state TEXT STATIC,
system TEXT STATIC,
USER TEXT STATIC,
attribute TEXT,
value TEXT,
PRIMARY KEY(event_time, attribute)
);
This makes event_time and attribute part of the primary key, event_time is the partition key and attribute is the clustering key.
The STATIC part makes these data "properties" of the events and are stored only once per partition.
Have you tried to go through cassandra.yaml and cassandra-env.sh? tuning the nodes cluster it is very important for optimizing performance, you also might want to take a look on OS parameters, you also need to make sure swap memory is 0. That helped me to increase my cluster performance
Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.