Get Date Range for Cassandra - Select timeuuid with IN returning 0 rows - cassandra

I'm trying to get data from a date range on Cassandra, the table is like this:
CREATE TABLE test6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((time), id)
)
But when I select a data range I get nothing:
SELECT * FROM teste WHERE time IN ( minTimeuuid('2013-01-01 00:05+0000'), now() );
(0 rows)
How can I get a date range from a Cassandra Query?

The IN condition is used to specify multiple keys for a SELECT query. To run a date range query for your table, (you're close) but you'll want to use greater-than and less-than.
Of course, you can't run a greater-than/less-than query on a partition key, so you'll need to flip your keys for this to work. This also means that you'll need to specify your id in the WHERE clause, as well:
CREATE TABLE teste6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((id), time)
)
INSERT INTO teste6 (time,id,checked,email,name)
VALUES (now(),'B26354',true,'rdeckard#lapd.gov','Rick Deckard');
SELECT * FROM teste6
WHERE id='B26354'
AND time >= minTimeuuid('2013-01-01 00:05+0000')
AND time <= now();
id | time | checked | email | name
--------+--------------------------------------+---------+-------------------+--------------
B26354 | bf0711f0-b87a-11e4-9dbe-21b264d4c94d | True | rdeckard#lapd.gov | Rick Deckard
(1 rows)
Now while this will technically work, partitioning your data by id might not work for your application. So you may need to put some more thought behind your data model and come up with a better partition key.
Edit:
Remember with Cassandra, the idea is to get a handle on what kind of queries you need to be able to fulfill. Then build your data model around that. Your original table structure might work well for a relational database, but in Cassandra that type of model actually makes it difficult to query your data in the way that you're asking.
Take a look at the modifications that I have made to your table (basically, I just reversed your partition and clustering keys). If you still need help, Patrick McFadin (DataStax's Chief Evangelist) wrote a really good article called Getting Started with Time Series Data Modeling. He has three examples that are similar to yours. In fact his first one is very similar to what I have suggested for you here.

Related

Cassandra order by timestemp desc

I just begin study cassandra.
It was a table and queries.
CREATE TABLE finance.tickdata(
id_symbol int,
ts timestamp,
bid double,
ask double,
PRIMARY KEY(id_symbol,ts)
);
And query is successful,
select ts,ask,bid
from finance.tickdata
where id_symbol=3
order by ts desc;
Next it was decision move id_symbol in table name, new table(s) scripts.
CREATE TABLE IF NOT EXISTS mts_src.ticks_3(
ts timestamp PRIMARY KEY,
bid double,
ask double
);
And now query fails,
select * from mts_src.ticks_3 order by ts desc
I read from docs, that I need use and filter (WHERE) by primary key (partition key),
but technically my both examples same. Why cassandra so restricted in this aspect?
And one more question, It is good idea in general? move id_symbol in table name -
potentially it can be 1000 of unique id_symbol and a lot of data for each. Separate this data on individual tables look like good idea!? But I lose order by possibility, that is so necessary for me to take fresh data by each symbol_id.
Thanks.
You can't sort on the partition key, you can sort only on clustering columns inside the single partition. So you need to model your data accordingly. But you need to be very careful not to create very large partitions (when using ticker_id as partition key, for example). In this case you may need to create a composite keys, like, ticker_id + year, or month, depending on how often you're inserting the data.
Regarding the table per ticker, that's not very good idea, because every table has overhead, it will lead to increased resource consumption. 200 tables is already high number, and 500 is almost "hard limit"

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

cassandra - simple/basic data modeling to retrieve all employees

Creating the following employee column family in Cassandra
Case 1:
CREATE TABLE employee (
name text,
designation text,
gender text,
created_by text,
created_date timestamp,
modified_by text,
modified_date timestamp,
PRIMARY KEY (name)
);
From UI, if i wanted to get all employee, it is not possible to
retrieve. is it true?
select * from employee; //not possible as it is partitioned by name
Case 2:
I was told to do this way to retrieve all employees.
We need to design this with a static key, to retrieve all the employees.
CREATE TABLE employee (
static_name text,
name text,
designation text,
gender text,
created_by text,
created_date timestamp,
modified_by text,
modified_date timestamp,
PRIMARY KEY (static_name,name)
);
static_name i.e.) "EMPLOYEE" will be the partition key and name will the clustering key. Primary key, combination of both static_name and name
static_name -> every time you add the employee , insert with the static value i.e) EMPLOYEE
now, you will be able to do "select all employees query"
//this will return you all the employees
select * from employee where static_name='EMPLOYEE';
is this true? can't we use case 1 to return all the employees?
Both approaches are o.k. with some catches
Approach 1:
When you say UI I guess you mean to use simple select * ... it's correct that this won't really work out of the box if you want to get every single one of them out. Especially if the data set is big. You could use pagination on a driver (I'm not 100% sure since I hadn't had a case in a while to use it) but when I needed to jump over all the partition I would use the token function i.e.:
select token(name), name from employee limit 1;
system.token(name) | name
----------------------+------
-8839064797231613815 | a
now you use the result of the token and put it into next query. This would have to be done by your program. After it would fetch all the elements that are greater than ... you would also need to start for all lower than the -8839064797231613815.
select token(name), name from employee where token(name) > -8839064797231613815 limit 1;
system.token(name) | name
----------------------+------
-8198557465434950441 | c
and then I would wrap this into a loop until I would fetch all the elements out. (I think this is also how spark cassandra does it when retrieving wide rows out from a cluster).
Disadvantage of this model is that it's really bad because it has to go all over the cluster and is more or less to be used in analytical work loads. Since you mentioned UI, It would take the user too long to get the result, so I would advise not to use approach 1 in UI related stuff.
Approach 2.
Disadvantage of the second one is that it would be what is called a hot row. Meaning every update would go to a single partition and this is most of the time bad model.
The advantage is that you could easily paginate over the one partition and get your data out by pagination functions built into the driver.
This would how ever behave just fine if you have moderate load (tens or low hundreds updates per second) and relatively low number of users, let's say for 100 000 this would work just fine. If your numbers are greater you have to somehow split up the users into multiple partitions so that the "load" gets distributed more evenly.
One possibility is to include letter of alphabet into "EMPLOYEE" ... so you would have "EMPLOYE_A", "EMPLOYEE_B" etc ... this would work relatively well. Not ideal again because of the lexicographical distribution and some partitions may get relatively larger amounts of that which is also not ideal.
One approach would be to create some artificial columns, let's say by design you say there are 10 buckets and when you insert into "EMPLOYEE" partition you just add (random bucket to the static prefix) "EMPLOYEE_1" and so on ... but when retrieving you go over specific partition until you exhaust the result.

How to get last inserted row in Cassandra?

I want to get last inserted row in Cassandra table. How to get it? Any idea?
I am developing a project for that I am replacing mysql with cassandra. I want to get rid off all sql queries and writing them all in cassandra.
Just to impart a little understanding...
As with all Cassandra query problems, the query needs to be served by model specifically designed for it. This is known as query-based modeling. Querying the last inserted row is not an intrinsic capability built into every table. You would need to design your model to support that ahead of time.
For instance, let's say I have a table storing data for users.
CREATE TABLE users (
username TEXT,
email TEXT,
firstname TEXT,
lastname TEXT,
PRIMARY KEY (username));
If I were to run a SELECT * FROM users LIMIT 1 on this table, my result set would contain a single row. That row would be the one containing the lowest hashed value of username (my partition key), because that's how Cassandra stores data in the cluster. I would have no way of knowing if it was the last one added or not, so this wouldn't be terribly useful to you.
On the other hand, let's say I had a table designed to track updates that users had made to their account info.
CREATE TABLE userUpdates (
username TEXT,
lastUpdated TIMEUUID,
email TEXT,
firstname TEXT,
lastname TEXT,
PRIMARY KEY (username,lastUpdated))
WITH CLUSTERING ORDER BY (lastUpdated DESC);
Next I'll upsert 3 rows:
> INSERT INTO userUpdates (username,lastUpdated,email,firstname,lastname)
VALUES ('bkerman',now(),'bkerman#ksp.com','Bob','Kerman');
> INSERT INTO userUpdates (username,lastUpdated,email,firstname,lastname)
VALUES ('jkerman',now(),'jkerman#ksp.com','Jebediah','Kerman');
> INSERT INTO userUpdates (username,lastUpdated,email,firstname,lastname)
VALUES ('bkerman',now(),'bobkerman#ksp.com','Bob','Kerman');
> SELECT username, email, dateof(lastUpdated) FROM userupdates;
username | email | system.dateof(lastupdated)
----------+-------------------+----------------------------
jkerman | jkerman#ksp.com | 2016-02-17 15:31:39+0000
bkerman | bobkerman#ksp.com | 2016-02-17 15:32:22+0000
bkerman | bkerman#ksp.com | 2016-02-17 15:31:38+0000
(3 rows)
If I just SELECT username, email, dateof(lastUpdated) FROM userupdates LIMIT 1 I'll get Jedediah Kerman's data, which is not the most-recently updated. However, if I limit my partition to username='bkerman', with a LIMIT 1 I will get the most-recent row for Bob Kerman.
> SELECT username, email, dateof(lastUpdated) FROM userupdates WHERE username='bkerman' LIMIT 1;
username | email | system.dateof(lastupdated)
----------+-------------------+----------------------------
bkerman | bobkerman#ksp.com | 2016-02-17 15:32:22+0000
(1 rows)
This works, because I specified a clustering order of descending on lastUpdated:
WITH CLUSTERING ORDER BY (lastUpdated DESC);
In this way, results within each partition will be returned with the most-recently upserted row at the top, hence LIMIT 1 becomes the way to query the most-recent row.
In summary, it is important to understand that:
Cassandra orders data in the cluster by the hashed value of a partition key. This helps ensure more-even data distribution.
Cassandra CLUSTERING ORDER enforces on-disk sort order of data within a partition key.
While you won't be able to get the most-recently upserted row for each table, you can design models to return that row to you for each partition.
tl;dr; Querying in Cassandra is MUCH different from that of MySQL or any RDBMS. If querying the last upserted row (for a partition) is something you need to do, there are probably ways in which you can model your table to support it.
I want to get last inserted row in Cassandra table. How to get it? Any idea?
It is not possible, what you request is a queue pattern (give me last message in) and queue is a known anti-pattern for Cassandra

CQL: Search a table in cassandra using '<' on a indexed column

My cassandra data model:
CREATE TABLE last_activity_tracker ( id uuid, recent_activity_time timestamp, PRIMARY KEY(id));
CREATE INDEX activity_idx ON last_activity_tracker (recent_activity_time) ;
The idea is to keep track of 'id's and their most recent activity of an event.
I need to find the 'id's whose last activity was an year ago.
So, I tried:
SELECT * from last_activity_tracker WHERE recent_activity_time < '2013-12-31' allow filtering;
I understand that I cannot use other than '=' for secondary indexed columns.
However, I cannot add 'recent_activity_time' to the key as I need to update this column with the most recent activity time of an event if any.
Any ideas in solving my problem are highly appreciated.
I can see an issue with your query. You're not hitting a partition. As such, the performance of your query will be quite bad. It'll need to query across your whole cluster (assuming you took measures to make this work).
If you're looking to query the last activity time for an id, think about storing it in a more query friendly format. You might try this:
create table tracker (dummy int, day timestamp, id uuid, primary key(dummy, day, id));
You can then insert with the day to be the epoch for the date (ignoring the time), and dummy = 0.
That should enable you to do:
select * from tracker where dummy=0 and day > '2013-12-31';
You can set a ttl on insert so that old entries expire (maybe after a year in this case). The idea is that you're storing information in a way that suits your query.

Resources