How it is ordered if I query by secondary index? - cassandra

I created this table on cassandra.
CREATE TABLE user_event(
userId bigint,
type varchar,
createdAt timestamp,
PRIMARY KEY ((userId), createdAt)
) WITH CLUSTERING ORDER BY (createdAt DESC);
CREATE INDEX user_event_type ON user_event(type);
If I query by userId query result will be ordered by createdAt column.
SELECT * FROM user_event WHERE userId = 1;
But how it is ordered if I query by type? Can I get last SIGN_IN event?
SELECT * FROM user_event WHERE userId = 1 AND type = 'SIGN_IN' LIMIT 1;
Is there any guarantee that result is ordered by createdAt?

The key to understanding this scenario, is to remember that result set order can only be enforced within a partition. As you are still querying by partition key (userId) all data within each partition will still be ordered by createdAt (DESCending).
"Guarantee" is a strong word, and one that I am hesitant to use. The results queried in this way should maintain their on-disk sort order. I would definitely test it out. But as long as you provide userId as a part of the query, the results should be returned sorted by createdAt.

Related

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

Get last row in table of time series?

I am already able to get the last row of time-series table as:
SELECT * from myapp.locations WHERE organization_id=1 and user_id=15 and date='2017-2-22' ORDER BY unix_time DESC LIMIT 1;
That works fine, however, I am wondering about performance and overhead of executing ORDER BY as rows are already sorted, I just use it to get the last row, is it an overhead in my case?
If I don't use ORDER BY, I will always get the first row in the table, so, I though I might be able to use INSERT in another way, ex: insert always in the beginning instead of end of table?
Any advice? shall I use ORDER BY without worries about performance?
Just define your clustering key order to DESC
Like the below schema :
CREATE TABLE locations (
organization_id int,
user_id int,
date text,
unix_time bigint,
lat double,
long double,
PRIMARY KEY ((organization_id, user_id, date), unix_time)
) WITH CLUSTERING ORDER BY (unix_time DESC);
So by default your data will sorted by unix_time desc, you don't need to specify in query
Now you can just use the below query to get the last row :
SELECT * from myapp.locations WHERE organization_id = 1 and user_id = 15 and date = '2017-2-22' LIMIT 1;
If your query pattern for that table is always ORDER BY unix_time DESC then you are in a reverse order time-series scenario, and I can say that your model is inaccurate (not wrong).
There's no reason not to sort the records in reverse order by adding a WITH CLUSTERING ORDER BY unix_time DESC in the table definition, and in my opinion the ORDER BY unix_time DESC will perform at most on par with something explicitly meant for these use cases (well, I think it will perform worse).

Cassandra cql - sorting unique IDs by time

I am writting messaging chat system, similar to FB messaging. I did not find the way, how to effectively store conversation list (each row different user with last sent message most recent on top). If I list conversations from this table:
CREATE TABLE "conversation_list" (
"user_id" int,
"partner_user_id" int,
"last_message_time" time,
"last_message_text" text,
PRIMARY KEY ("user_id", "partner_user_id")
)
I can select from this table conversations for any user_id. When new message is sent, we can simply update the row:
UPDATE conversation_list SET last_message_time = '...', last_message_text='...' WHERE user_id = '...' AND partner_user_id = '...'
But it is sorted by clustering key of course. My question: How to create list of conversations, which is sorted by last_message_time, but partner_user_id will be unique for given user_id?
If last_message_time is clustering key and we delete the row and insert new (to keep partner_user_id unique), I will have many so many thumbstones in the table.
Thank you.
A slight change to your original model should do what you want:
CREATE TABLE conversation_list (
user_id int,
partner_user_id int,
last_message_time timestamp,
last_message_text text,
PRIMARY KEY ((user_id, partner_user_id), last_message_time)
) WITH CLUSTERING ORDER BY (last_message_time DESC);
I combined "user_id" and "partner_user_id" into one partition key. "last_message_time" can be the single clustering column and provide sorting. I reversed the default sort order with the CLUSTERING ORDER BY to make the timestamps descending. Now you should be able to just insert any time there is a message from a user to a partner id.
The select now will give you the ability to look for the last message sent. Like this:
SELECT last_message_time, last_message_text
FROM conversation_list
WHERE user_id= ? AND partner_user_id = ?
LIMIT 1

Cassandra order by on combination of composite keys

I originally wrote a table that tracks feeds that have been assigned to a user for review.
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, languageid, topicid, dateinserted)
};
I realized soon after I created this table that I wouldn't be able to sort this table (order by DESC) by dateinserted because for some weird reason, in Cassandra I can only order by the second (and last) column of a composite key table (as in, the table has to have 2 composite keys and order by can only happen on the second column of this key) so I changed my table to this:
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, dateinserted)
};
and now I was able to run a query to get the latest feeds for the user, using order by.
However, I have a new requirement that requires me to sort the feeds by a combination of (languageid + userid) or (topicid + userid) or (languageid + topicid + userid).
I had an idea to create three new tables and have the keys combined into one key column. For example, for userid + topic query, I would use:
create table user_feed_by_topic
{
usertopicidkey text,
dateinserted timeuuid,
primary key (usertopicidkey, dateinserted)
};
where usertopididkey = userid.toString() + topicid.toString().
Of course, this solution requires 4 separate inserts whenever I need to insert a new feed row since I have 4 rows, tracking identical data but partitioned differently to allow sorting.
My question is, is there a better way to do this? Is there any way to achieve what I want (query by a combination of columns and order by another column) or am I stuck with my 4 table design approach?
Many thanks,
Cassandra will order all rows based on the PKs clustering columns. In case your PK is primary key (userid, languageid, topicid, dateinserted) all rows will be sorted by languageid, topicid and dateinserted in ascending order. This implies that all rows will only be sorted within a specific language and topic by date. You'd have to use the date as the first clustering key column to change this behaviour.
Its common practice to denormalize your data across multiple tables to implement different ordering strategies.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Resources