SELECT range from Map in Cassandra - cassandra

I have table in Cassandra and I wish to select 10 last blob from usermgmt.user_history.history
CREATE TABLE usermgmt.user_history (
id uuid,
history Map<timeuuid, blob>,
PRIMARY KEY(id)
);
I feel like it was easy with 5 year old Cassandra design with ordered typed column names. But now I can't find method to select range of 10 last entries in recent Cassandra 3.0

How about this:
CREATE TABLE usermgmt.user_history (
id uuid,
history_time timestamp,
history_blob blob,
PRIMARY KEY(id, history_time)
);
Then,
SELECT * FROM usermgmt.user_history WHERE id = your-uuid ORDER BY history_time limit 10;

Related

Not able to run multiple where clause without Cassandra allow filtering

Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )
The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.

Update column value in Cassandra table if value exists

I have a Cassandra table as below
CREATE TABLE inventory(
prodid varchar,
loc varchar,
qty float,
PRIMARY KEY (prodid)
) ;
Requirement :
For the provided primary key, if no record exists in table, we need to insert, which is straight forward. but when the record exists for the primary key, then we need to update the qty column by adding the existing value in the table with new values received.
As per my understanding, I need to query the table first for the provided primary key and get the value of the qty column and add with new value received from the request and execute the update query with light weight transaction.
Ex: table has say qty 10 for the prodid=1 and if I receive from user new qty as 2 (which is delta), then I need to update qty as 12 for the prodid=1.
Is that logic is correct? or any better way to design the table or handle the use case? Will this approach introduce latency issue during the load as we need to do select query first and if data exists update the column value with new value ? Please help.
You can change the qty column to static. This way you do not have to update the table but Insert. Updates are resource intensive so cassandra treats UPDATE statement as insert statement. So, your table definition should be -
CREATE TABLE inventory(
prodid varchar,
loc varchar,
qty float static,
PRIMARY KEY (prodid) ) ;
So you can use your business logic to calculate the new value of QTY column and use INSERT statement, which intern update the same column.
Other way is to use counter column -
CREATE TABLE inventory(
prodid varchar,
loc varchar,
qty counter,
PRIMARY KEY (prodid, loc ) ) ;
Which this design you can just use update query like below -
update inventory set qty = qty + <calculated Quantity> where prodid = 1;
Notice that, in second table design, all other columns have to the part of primary key. In your case, it is easy and convenient.

Cassandra data model with obsolete data removal possibility

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks.
I would like to model data with future removal possibility.
I have 100,000,000 records per day of this structure:
transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information
I will need to fetch data by user_name (I have about 5,000,000 users).
Also I will need to find transaction details by its id.
All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.
As much I have found, TTL-s expire column values, not rows.
So far I came across with this model, and as I understand it will imply really wide rows:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY ((transaction_date, user_name), transaction_id)
);
CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);
but this model does not allow deletions by transaction_date.
this also builds indexes with high cardinality, what cassandra docs strongly discourages
So what will be the correct model for this task?
EDIT:
Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model
CREATE TABLE user_transactions_YYYYMMDD (
user_name text,
transaction_id text,
transaction_time timestamp,
transaction_type int,
PRIMARY KEY (user_name)
);
YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.
Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (user_name, transaction_id)
);
So you can query using the partition key directly like this:
SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';
And for the id you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (transaction_id)
);
so the query could be something like this:
SELECT * FROM user_transactions WHERE transaction_id = 'ID';
By this way you dont need indexes.
About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).
Perhaps my answer will be a little useful.
I would have done so:
CREATE TABLE user_transactions (
date timestamp,
user_name text,
id text,
type int,
PRIMARY KEY (id)
);
CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);
No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.
INSERT INTO user_transactions ... USING TTL 86400;
will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.
But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL.
If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.
And indeed as a driver to work with Cassandra I recommend you use Astyanax.

Choosing the right schema for cassandra "table" in CQL3

We are trying to store lots of attributes for a particular profile_id inside a table (using CQL3) and cannot wrap our heads around which approach is the best:
a. create table mytable (profile_id, a1 int, a2 int, a3 int, a4 int ... a3000 int) primary key (profile_id);
OR
b. create MANY tables, eg.
create table mytable_a1(profile_id, value int) primary key (profile_id);
create table mytable_a2(profile_id, value int) primary key (profile_id);
...
create table mytable_a3000(profile_id, value int) primary key (profile_id);
OR
c. create table mytable (profile_id, a_all text) primary key (profile_id);
and just store 3000 "columns" inside a_all, like:
insert into mytable (profile_id, a_all) values (1, "a1:1,a2:5,a3:55, .... a3000:5");
OR
d. none of the above
The type of query we would be running on this table:
select * from mytable where profile_id in (1,2,3,4,5423,44)
We tried the first approach and the queries keep timing out and sometimes even kill cassandra nodes.
The answer would be to use a clustering column. A clustering column allows you to create dynamic columns that you could use to hold the attribute name (col name) and it's value (col value).
The table would be
create table mytable (
profile_id text,
attr_name text,
attr_value int,
PRIMARY KEY(profile_id, attr_name)
)
This allows you to add inserts like
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a1', 3);
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a2', 1031);
.....
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'an', 2);
This would be the optimal solution.
Because you then want to do the following
'The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)'
This would require 6 queries under the hood but cassandra should be able to do this in no time especially if you have a multi node cluster.
Also if you use the DataStax Java Driver you can run this requests asynchronously and concurrently on your cluster.
For more on data modelling and the DataStax Java Driver check out DataStax's free online training. Its worth a look
http://www.datastax.com/what-we-offer/products-services/training/virtual-training
Hope it helps.

Select 2000 most recent log entries in cassandra table using CQL (Latest version)

How do you query and filter by timeuuid, ie assuming you have a table with
create table mystuff(uuid timeuuid primary key, stuff text);
ie how do you do:
select uuid, unixTimestampOf(uuid), stuff
from mystuff
order by uuid desc
limit 2000
I also want to be able to fetch the next older 2000 and so on, but thats a different problem. The error is:
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
and just in case it matters, the real table is actually this:
CREATE TABLE audit_event (
uuid timeuuid PRIMARY KEY,
event_time bigint,
ip text,
level text,
message text,
person_uuid timeuuid
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
I would recommend that you design your table a bit differently. It would be rather hard to achieve what you're asking for with the design you have currently.
At the moment each of your entries in the audit_event table will receive another uuid, internally Cassandra will create many short rows. Querying for such rows is inefficient, and additionally they are ordered randomly (unless using Byte Ordered Partitioner, which you should avoid for good reasons).
However Cassandra is pretty good at sorting columns. If (back to your example) you declared your table like this :
CREATE TABLE mystuff(
yymmddhh varchar,
created timeuuid,
stuff text,
PRIMARY KEY(yymmddhh, created)
);
Cassandra internally would create a row, where the key would be the hour of a day, column names would be the actual created timestamp and data would be the stuff. That would make it efficient to query.
Consider you have following data (to make it easier I won't go to 2k records, but the idea is the same):
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '90');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '91');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '92');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '93');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '94');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '95');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '96');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '97');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '98');
Now lets say that we want to select last two entries (let's a assume for the moment that we know that the "latest" row key to be '13081616'), you can do it by executing query like this:
SELECT * FROM mystuff WHERE yymmddhh = '13081616' ORDER BY created DESC LIMIT 2 ;
which should give you something like this:
yymmddhh | created | stuff
----------+--------------------------------------+-------
13081616 | 547fe280-067e-11e3-8751-97db6b0653ce | 98
13081616 | 547f4640-067e-11e3-8751-97db6b0653ce | 97
to get next 2 rows you have to take the last value from the created column and use it for the next query:
SELECT * FROM mystuff WHERE yymmddhh = '13081616'
AND created < 547f4640-067e-11e3-8751-97db6b0653ce
ORDER BY created DESC LIMIT 2 ;
If you received less rows than expected you should change your row key to another hour.
Row key handling / calculation
For now I've assumed that we know the row key with which we want to query the data. If you log a lot of information I'd say that's not the problem - you can take just current time and issue a query with the hour set to what hour we have now. If we run out of rows we can subtract one hour and issue another query.
However if you don't know where your data lies, or if it's not distributed evenly, you can create metadata table, where you'd store the information about the row keys:
CREATE TABLE mystuff_metadata(
yyyy varchar,
yymmddhh varchar,
PRIMARY KEY(yyyy, yymmddhh)
) WITH COMPACT STORAGE;
The row keys would be organized by a year, so to get the latest row key from the current year you'd have to issue a query:
SELECT yymmddhh
FROM mystuff_metadata where yyyy = '2013'
ORDER BY yymmddhh DESC LIMIT 1;
Your audit software would have to make an entry to that table on start and later on each hour change (for example before inserting data to mystuff).

Resources