I am going to use cassandra to store activity logs. I have something like this
CREATE TABLE general_actionlog (
date text,
time text,
date_added timestamp,
action text,
PRIMARY KEY ((date,time),date_added)
);
I want to store all the activity in an hour in a single row (=a time serie. "time" is only the hour of the day in the format H:00:00, ignoring minutes and seconds, so I have a row for each Y-m-d H:00:00)
The problem appears when two actions happen in the same timestamp (ex. two page views in the same second), so the second one overwrites the first one.
How can I solve this in a way that I still can query using slices?
Thanks
marc
You want to use timeuuid instead of timestamp for the date_added column. A timeuuid is a v1 UUID. It has a timestamp component (and is sorted by the timestamp), so it effectively provides a conflict-free timestamp.
Related
Obviously when dealing with time-series data which relates to some natural partition key like sensor id it can be used as a primary key. But what to do if we are interested in a global view and there is no natural candidate for the partition key? If we model the schema like this:
CREATE TABLE my_data(
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
It is (probably) going to work just fine for most cases but given we know what year and days to fetch.
What if we don't care what day is it but we expect to see first 50 most recent items? What if we then want to see next 50 items? Is there a way to do it in Cassandra? What is the recommended way of doing this?
Keep a 2nd table of the year/days. When reading can grab from it first. When adding to my_data update that as well but keep a cache of days inserted so each app would only try the insert once per day. ie for example adding extra key so can have multiple streams not just a single table per time series:
CREATE TABLE my_data (
key blob,
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((key, year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
CREATE TABLE my_data_keys (
key blob,
year smallint,
day smallint,
PRIMARY KEY ((key), year, day)
)
For inserts:
INSERT INTO my_data_keys (key, year, day) VALUES (0x01, 1, 2)
INSERT INTO my_data ...
Then keep a in memory Set somewhere that you stored that key/year/data so you dont need to insert it every time. To read most recent:
SELECT year, day FROM my_data_keys WHERE key = 0x01;
driver returns iterator, for each element in it make query to my_data until 50 records reached.
If inserts are frequent enough can just work backwards from "today", issuing queries until you get 50 events. If data sparse though that can be a lot of wasted reads and another table work better.
How to get Last 6 Month data comparing with timestamp column using cassandra query?
I need to get all account statement which belongs to last 3/6 months comparing with updatedTime(TimeStamp column) and CurrentTime.
For example in SQL we are using DateAdd() function tor this to get. i dont know how to proceed this in cassandra.
If anyone know,reply.Thanks in Advance.
Cassandra 2.2 and later allows users to define functions (UDT) that can be applied to data stored in a table as part of a query result.
You can create your own method if you use Cassandra 2.2 and later UDF
CREATE FUNCTION monthadd(date timestamp, month int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.MONTH, month);return c.getTime();$$
This method receive two parameter
date timestamp: The date from you want add or subtract number of month
month int: Number of month you want to or add(+) subtract(-) from date
Return the date timestamp
Here is how you can use this :
SELECT * FROM ttest WHERE id = 1 AND updated_time >= monthAdd(dateof(now()), -6) ;
Here monthAdd method subtract 1 mont from the current timestamp, So this query will data of last month
Note : By default User-defined-functions are disabled in cassandra.yaml - set enable_user_defined_functions=true to enable if you are aware of the security risks
In cassandra you have to build the queries upfront.
Also be aware that you will probably have to bucket the data depending on the number of accounts that you have within some period of time.
If your whole database doesn't contain more than let's say 100k entries you are fine with just defining a single generic partition let's say with name 'all'. But usually people have a lot of data that simply goes into bucket that carries a name of month, week, hour. This depends on the number of inserts you get.
The reason for creating buckets is that every node can find a partition by it's partition key. This is the first part of the primary key definition. Then on every node the data is sorted by the second information that you pass in to the primary key. Having the data sorted enables you to "scan" over them i.e. you will be able to retrieve them by giving timestamp parameter.
Let's say you want to retrieve accounts from the last 6 months and that you are saving all the accounts from one month in the same bucket.
The schema might be something on the lines of:
create table accounts {
month text,
created_time timestamp,
account text,
PRIMARY KEY (month, created_time)
}
Usually you will do this at the application level, merging queries is an anti pattern but is o.k. for smaller amount of queries:
select account
from accounts
where month = '201701';
Output:
'201702'
'201703'
and so on.
If you have something really simple with let's say expected 100 000 entries then you could use the above schema and just do something like:
create table accounts {
bucket text,
created_time timestamp,
account text,
PRIMARY KEY (bucket, created_time)
}
select account
from accounts
where bucket = 'some_predefined_name'
and created_time > '2016-10-04 00:00:00'
Once more as a wrap-up, with cassandra you always have to prepare the structures for the access pattern you are going to use.
I am trying to store & retrieve data in cassandra in the following way:
Storing Data:
I created the table in the following way:
CREATE TABLE mydata (
myKey TEXT,
datetime TIMESTAMP,
value TEXT,
PRIMARY KEY (myKey,datetime)
);
Where i would store a value for every minute for last 5 years. So it stores 1440 * 365 * 5 = 2628000 records/columns per row (myKey as row key).
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:01:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:02:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:03:00','72F');
.................
I am able to store data and all fine. However, i would like to know, if this is efficient way of doing (storing) data horizontally (2628000 values for each key for 1 million such keys altogether) ?
Retrieving Data:
After storing the data in above format, i am able to select data by using a simple select query for a period.
Ex:
SELECT *
FROM mydata
WHERE myKey='1234ABCD' AND datetime > '2013-04-03 07:01:00' AND datetime < '2013-04-03 07:04:00';
The query works fine and i get result as expected.
However my question is:
How can i select only those values at certain intervals. For example, if i query data for a day, i would get 1440 values (1 for every minute). I would like to get values at every 10 minutes interval (value at every 10th minute) limiting the no. of values to 144.
Is there a way to query the table if we use the above storage strategy?
If not, what are possible options to meet my requirement of querying data at a specific interval like 1-min, 10-min, 1-hour, 1-day etc?
Appreciate any other suggestions.
No it not right ,in future you will face problem because per row key we can only store 2 billion records or columns. After that it will not give error but it will store data also .
For your problem split column timestamp into year , month , day and time .
like 2016 , 04 , 04 and 15:03:00 .Put also year , month , day into partition key .
You definitely need to bound your partition with a modular version of the timestamp. But the granularity really depends on your reads.
If you are mainly going to read per day then use something like this PK((myKey, yyyymmdd), time)
If mainly by weeks PK((mykey, yyyyww), time), or month...
The problem is then if you want to read values for a whole year, then you better have a partition per weeks or month, or even year would do I think if you don't do any deletes, your partition size needs to be smaller than 100MB
I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis
My cassandra data model:
CREATE TABLE last_activity_tracker ( id uuid, recent_activity_time timestamp, PRIMARY KEY(id));
CREATE INDEX activity_idx ON last_activity_tracker (recent_activity_time) ;
The idea is to keep track of 'id's and their most recent activity of an event.
I need to find the 'id's whose last activity was an year ago.
So, I tried:
SELECT * from last_activity_tracker WHERE recent_activity_time < '2013-12-31' allow filtering;
I understand that I cannot use other than '=' for secondary indexed columns.
However, I cannot add 'recent_activity_time' to the key as I need to update this column with the most recent activity time of an event if any.
Any ideas in solving my problem are highly appreciated.
I can see an issue with your query. You're not hitting a partition. As such, the performance of your query will be quite bad. It'll need to query across your whole cluster (assuming you took measures to make this work).
If you're looking to query the last activity time for an id, think about storing it in a more query friendly format. You might try this:
create table tracker (dummy int, day timestamp, id uuid, primary key(dummy, day, id));
You can then insert with the day to be the epoch for the date (ignoring the time), and dummy = 0.
That should enable you to do:
select * from tracker where dummy=0 and day > '2013-12-31';
You can set a ttl on insert so that old entries expire (maybe after a year in this case). The idea is that you're storing information in a way that suits your query.