Obviously when dealing with time-series data which relates to some natural partition key like sensor id it can be used as a primary key. But what to do if we are interested in a global view and there is no natural candidate for the partition key? If we model the schema like this:
CREATE TABLE my_data(
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
It is (probably) going to work just fine for most cases but given we know what year and days to fetch.
What if we don't care what day is it but we expect to see first 50 most recent items? What if we then want to see next 50 items? Is there a way to do it in Cassandra? What is the recommended way of doing this?
Keep a 2nd table of the year/days. When reading can grab from it first. When adding to my_data update that as well but keep a cache of days inserted so each app would only try the insert once per day. ie for example adding extra key so can have multiple streams not just a single table per time series:
CREATE TABLE my_data (
key blob,
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((key, year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
CREATE TABLE my_data_keys (
key blob,
year smallint,
day smallint,
PRIMARY KEY ((key), year, day)
)
For inserts:
INSERT INTO my_data_keys (key, year, day) VALUES (0x01, 1, 2)
INSERT INTO my_data ...
Then keep a in memory Set somewhere that you stored that key/year/data so you dont need to insert it every time. To read most recent:
SELECT year, day FROM my_data_keys WHERE key = 0x01;
driver returns iterator, for each element in it make query to my_data until 50 records reached.
If inserts are frequent enough can just work backwards from "today", issuing queries until you get 50 events. If data sparse though that can be a lot of wasted reads and another table work better.
Related
I'am working on insane Time Series Data. So, I have two Kafka Topic -
1) Real time Time-Series Data of moving vehicles every 5 seconds.
2) History Time-Series Data of 10% of vehicles in case vehicles travels in remote area so, data is send once it comes into network, it may be after few hours, days or week.
So, my cassandra Table is somewhat like this
CREATE TABLE locationinfo (
imei text,
date text,
entrydt timestamp,
gpsdt timestamp,
lastgpsdt timestamp,
latitude text,
longitude text,
odo int,
speed int,
PRIMARY KEY ((imei, date), gpsdt)
) WITH CLUSTERING ORDER BY (gpsdt ASC)
& I'm using Spark Streaming to fetch data from Kafka and inserting into Cassandra, here clustering key is gpsdt. Whenever History data comes from Kafka, lot of shuffle happens in table as we know the architecture of Cassandra. Data is nothing but stored in sequential order on the partition defined & for history entries records comes from between the lines. So, What happens is after a certain period of time, spark streaming application gets hang. After lot of search I found that there might be some problem with my table strategy, So if I create a table schema like this -
CREATE TABLE locationinfo (
imei text,
date text,
entrydt timestamp,
gpsdt timestamp,
lastgpsdt timestamp,
latitude text,
longitude text,
odo int,
speed int,
PRIMARY KEY ((imei, date), entrydt)
) WITH CLUSTERING ORDER BY (entrydt ASC)
Here order is defined as per insertion time so whenever history data will come it will always append in the last and there will be no overhead of shuffling. But, in this case I wont be able to make range queries on gpsdt. So, I would like to know what should be the best strategy to handle this scenario. My load from kafka is more than 2k/sec.
Currently I have a simple table as follows:
CREATE TABLE datatable (timestamp bigint, value bigint, PRIMARY KEY (timestamp))
This table is only growing and never being modified. The key is unique timestamp. All queries are range queries of the form:
SELECT * from datatable WHERE timestamp > 123456 ALLOW FILTERING
Moreover, queries request only a small set of the latest rows inserted. The problem that I have right now is that performance of these queries negatively correlated with the table size. As table grows, it takes significantly longer to get response, even if query returns just a few rows.
Could you advise on how I should modify table schema to avoid performance degradation (e.g., create index or set clustering)?
Thanks!
Add some time bucketing like
CREATE TABLE datatable (
bucket timestamp,
time timestamp,
value bigint,
PRIMARY KEY ((bucket), time)
) WITH CLUSTERING ORDER BY (time DESC);
where bucket is the date truncated to the day or week or month (can figure out how many based on approx ingestion rate, a decent goal is about 64mb per partition but thats very flexible), that way you will collect all the rows for a period within a single partition very efficiently.
Having billions of partitions per node will cause slow down repairs and compactions significantly. Also partitioning order is random (murmur3 hash of the partition key order) so you cannot do things like have your above your query in order.
With the above you can then iterate from the bucket of your start time to the current bucket without ALLOW FILTERING (which you should never ever use outside of toy amounts of data or test environment kinda things) and the results will be in the order of the timestamps.
I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis
I am going to use cassandra to store activity logs. I have something like this
CREATE TABLE general_actionlog (
date text,
time text,
date_added timestamp,
action text,
PRIMARY KEY ((date,time),date_added)
);
I want to store all the activity in an hour in a single row (=a time serie. "time" is only the hour of the day in the format H:00:00, ignoring minutes and seconds, so I have a row for each Y-m-d H:00:00)
The problem appears when two actions happen in the same timestamp (ex. two page views in the same second), so the second one overwrites the first one.
How can I solve this in a way that I still can query using slices?
Thanks
marc
You want to use timeuuid instead of timestamp for the date_added column. A timeuuid is a v1 UUID. It has a timestamp component (and is sorted by the timestamp), so it effectively provides a conflict-free timestamp.
Prior to CQL3 one could insert arbitrary columns such as columns that are named by a date:
cqlsh:test>CREATE TABLE seen_ships (day text PRIMARY KEY)
WITH comparator=timestamp AND default_validation=text;
cqlsh:test>INSERT INTO seen_ships (day, '2013-02-02 00:08:22')
VALUES ('Tuesday', 'Sunrise');
Per this post It seems that things are different in CQL3. Is it still somehow possible to insert arbitrary columns? Here's my failed attempt:
cqlsh:test>CREATE TABLE seen_ships (
day text,
time_seen timestamp,
shipname text,
PRIMARY KEY (day, time_seen)
);
cqlsh:test>INSERT INTO seen_ships (day, 'foo') VALUES ('Tuesday', 'bar');
Here I get Bad Request: line 1:29 no viable alternative at input 'foo'
So I try a slightly different table because maybe this is a limitation of compound keys:
cqlsh:test>CREATE TABLE seen_ships ( day text PRIMARY KEY );
cqlsh:test>INSERT INTO seen_ships (day, 'foo') VALUES ('Tuesday', 'bar');
Again with the Bad Request: line 1:29 no viable alternative at input 'foo'
What am I missing here?
There's a good blog post over on the Datastax blog about this: http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
The answer is that yes, CQL3 supports dynamic colums, just not the way it worked in earlier versions of CQL. I don't really understand your example, you mix datestamps with strings in a way I don't see how it worked in CQL2 either. If I understand you correctly you want to make a timeline of ship sightings, where the partition key (row key) is the day and each sighting is a time/name pair. Here's a suggestion:
CREATE TABLE ship_sightings (
day TEXT,
time TIMESTAMP,
ship TEXT,
PRIMARY KEY (day, time)
)
And you insert entries with
INSERT INTO ship_sightings (day, time, ship) VALUES ('Tuesday', NOW(), 'Titanic')
however, you should probably use a TIMEUUID instead of TIMESTAMP (and the primary key could be a DATE), since otherwise you might add two sightings with the same timestamp and only one will survive.
This was an example of wide rows, but then there's the issue of dynamic columns, which isn't exactly the same thing. Here's an example of that in CQL3:
CREATE TABLE ship_sightings_with_properties (
day TEXT,
time TIMEUUID,
ship TEXT,
property TEXT,
value TEXT,
PRIMARY KEY (day, time, ship, property)
)
which you can insert into like this:
INSERT INTO ship_sightings_with_properties (day, time, ship, property, value)
VALUES ('Sunday', NOW(), 'Titanic', 'Color', 'Black')
# you need to repeat the INSERT INTO for each statement, multiple VALUES isn't
# supported, but I've not included them here to make this example shorter
VALUES ('Sunday', NOW(), 'Titanic', 'Captain', 'Edward John Smith')
VALUES ('Sunday', NOW(), 'Titanic', 'Status', 'Steaming on')
VALUES ('Monday', NOW(), 'Carapathia', 'Status', 'Saving the passengers off the Titanic')
The downside with this kind of dynamic columns is that the property names will be stored multiple times (so if you have a thousand sightings in a row and each has a property called "Captain", that string is saved a thousand times). On-disk compression takes away most of that overhead, and most of the time it's nothing to worry about.
Finally a note about collections in CQL3. They're a useful feature, but they are not a way to implement wide rows or dynamic columns. First of all they have a limit of 65536 items, but Cassandra can't enforce this limit, so if you add too many elements you might not be able to read them back later. Collections are mostly for small multi-values fields -- the canonical example is an address book where each row is an entry and where entries only have a single name, but multiple phone numbers, email addresses, etc.
It is not truly dynamic column, but most times you can get away with collections. Using Map column you might store some dynamic data