Cassandra export/forward data only once - cassandra

I have the requirement to forward data at certain intervals from my system to an external system. To do this, I already stored all rows in a table. Already forwarded data should not be exported again.
The idea is to memorize the last export time on client side and export the following records the next time. Old rows are deleted after a successful export.
CREATE TABLE export(
id int,
import_date_time timestamp,
data text,
PRIMARY KEY (id, import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC)
insert into export(id, import_date_time, data) values (1, toUnixTimestamp(now()), 'content')
select * from export where id = 1 and import_date_time > '2017-03-30 16:22:37'
delete from export where id = 1 and import_date_time <= '2017-03-30 16:22:37'
Has anyone already implemented similar or do you have a different
solution?
If possible, I do not need an id for the request because I want to
export all data

If you used fixed partition key value (id = 1), then all the insert, select and delete will happen on a same node (If RF=1) over and over. And also for every delete cassandra create a tombstone entry, when you execute select query cassandra needs to merge each entry. So your select query performance will degrade.
So instead of having fixed value, use dynamic value like the below one :
CREATE TABLE export(
hour int,
day int,
month int,
year int,
import_date_time timestamp,
data text,
PRIMARY KEY ((hour, day, month, year), import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC);
Here you can insert the value of hour, day, month, year extracted from import_date_time
You need to take care of two case When selecting data :
Previous export time and current export time both at same hour.
Both time are not inside same hour.
For case one you need only one query and for case two you have to execute two query.
Example Query :
SELECT * FROM export WHERE hour = 16 AND day = 30 AND month = 3 AND year = 2017 AND import_date_time > '2017-03-30 16:22:37';

Related

Retrieve rows from last 24 hours

I have a table with the following (with other fields removed)
CREATE TABLE if NOT EXISTS request_audit (
user_id text,
request_body text,
lookup_timestamp TIMESTAMP
PRIMARY KEY ((user_id), lookup_timestamp)
) WITH CLUSTERING ORDER BY ( lookup_timestamp DESC);
I create a record with the following
INSERT INTO request_audit (user_id, lookup_timestamp, request_body) VALUES (?, ?, toTimestamp(now()))
I am trying to retrieve all rows within the last 24 hours, but I am having trouble with the timestamp,
I have tried
SELECT * from request_audit WHERE user_id = '1234' AND lookup_timestamp > toTimestamp(now() - "1 day" )
and various other ways of trying to take a day away from the query.
Cassandra has a very limited date operation support. What you need is a custom function to do date math calculation.
Inspired from here.
How to get Last 6 Month data comparing with timestamp column using cassandra query?
you can write a UDF (user defined function) to date operation.
CREATE FUNCTION dateAdd(date timestamp, day int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS
$$java.util.Calendar c = java.util.Calendar.getInstance();
c.setTime(date);
c.add(java.util.Calendar.DAY_OF_MONTH, day);
return c.getTime();$$ ;
remember that you would have to enable UDF in config. Cassandra.yml. Hope that is possible.
enable_user_defined_functions: true
once done this query works perfectly.
SELECT * from request_audit WHERE user_id = '1234' AND lookup_timestamp > dateAdd(dateof(now()), -1)
You couldn't do it directly from CQL, as it doesn't support this kind of expressions. If you're running this query from cqlsh, then you can try to substitute the desired date with something like this:
date --date='-1 day' '+%F %T%z'
and execute this query.
If you're invoking this from your program, just use corresponding date/time library to get date corresponding -1 day, but this depends on the language that you're using.

How to determine time stamps for Cassandra queries

One of The values inserted into the table is current time. I compute the current time using toTimestamp(now()). Now, I want to compute current time minus 90 days , current time minus 15 days.
My question is how do I compute current time - nth day ?
Query for current timestamp :
INSERT INTO TABLE_NAME (col_1, col_2, col_3) VALUES ('val_1', toTimestamp(now()), val_3);
In the above query, val_2 is current timestamp. Current time stamp is determined by
toTimestamp(now())
How do I compute current time - 90 days , current time - 2weeks
This functionality is not built into CQL.
If you are able to use UDFs, you can (building on the example given here:
How to get Last 6 Month data comparing with timestamp column using cassandra query?) do the following:
Enable UDFs as needed by adding or changing this line to true in cassandra.yaml:
enable_user_defined_functions: true
Then add two user defined functions like this:
CREATE FUNCTION dateadd(date timestamp, daydiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, daydiff);return c.getTime();$$
CREATE FUNCTION weekadd(date timestamp, weekdiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, weekdiff*7);return c.getTime();$$
Select the data from your table like this:
select dateadd(col_2,-90) from TABLE_NAME;
select weekadd(col_2,-2) from TABLE_NAME;

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Select 2000 most recent log entries in cassandra table using CQL (Latest version)

How do you query and filter by timeuuid, ie assuming you have a table with
create table mystuff(uuid timeuuid primary key, stuff text);
ie how do you do:
select uuid, unixTimestampOf(uuid), stuff
from mystuff
order by uuid desc
limit 2000
I also want to be able to fetch the next older 2000 and so on, but thats a different problem. The error is:
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
and just in case it matters, the real table is actually this:
CREATE TABLE audit_event (
uuid timeuuid PRIMARY KEY,
event_time bigint,
ip text,
level text,
message text,
person_uuid timeuuid
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
I would recommend that you design your table a bit differently. It would be rather hard to achieve what you're asking for with the design you have currently.
At the moment each of your entries in the audit_event table will receive another uuid, internally Cassandra will create many short rows. Querying for such rows is inefficient, and additionally they are ordered randomly (unless using Byte Ordered Partitioner, which you should avoid for good reasons).
However Cassandra is pretty good at sorting columns. If (back to your example) you declared your table like this :
CREATE TABLE mystuff(
yymmddhh varchar,
created timeuuid,
stuff text,
PRIMARY KEY(yymmddhh, created)
);
Cassandra internally would create a row, where the key would be the hour of a day, column names would be the actual created timestamp and data would be the stuff. That would make it efficient to query.
Consider you have following data (to make it easier I won't go to 2k records, but the idea is the same):
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '90');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '91');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '92');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '93');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '94');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '95');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '96');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '97');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '98');
Now lets say that we want to select last two entries (let's a assume for the moment that we know that the "latest" row key to be '13081616'), you can do it by executing query like this:
SELECT * FROM mystuff WHERE yymmddhh = '13081616' ORDER BY created DESC LIMIT 2 ;
which should give you something like this:
yymmddhh | created | stuff
----------+--------------------------------------+-------
13081616 | 547fe280-067e-11e3-8751-97db6b0653ce | 98
13081616 | 547f4640-067e-11e3-8751-97db6b0653ce | 97
to get next 2 rows you have to take the last value from the created column and use it for the next query:
SELECT * FROM mystuff WHERE yymmddhh = '13081616'
AND created < 547f4640-067e-11e3-8751-97db6b0653ce
ORDER BY created DESC LIMIT 2 ;
If you received less rows than expected you should change your row key to another hour.
Row key handling / calculation
For now I've assumed that we know the row key with which we want to query the data. If you log a lot of information I'd say that's not the problem - you can take just current time and issue a query with the hour set to what hour we have now. If we run out of rows we can subtract one hour and issue another query.
However if you don't know where your data lies, or if it's not distributed evenly, you can create metadata table, where you'd store the information about the row keys:
CREATE TABLE mystuff_metadata(
yyyy varchar,
yymmddhh varchar,
PRIMARY KEY(yyyy, yymmddhh)
) WITH COMPACT STORAGE;
The row keys would be organized by a year, so to get the latest row key from the current year you'd have to issue a query:
SELECT yymmddhh
FROM mystuff_metadata where yyyy = '2013'
ORDER BY yymmddhh DESC LIMIT 1;
Your audit software would have to make an entry to that table on start and later on each hour change (for example before inserting data to mystuff).

Resources