Cassandra ordered result of a range query - cassandra

I have the following table:
create table tweets_by_hashtags(
hashtag text,
tweet_id text,
tweet_posted_time timestamp,
retweet_count int,
body text,
primary key(hashtag, tweet_id)
)
I want to perform the following query; and I need the result to be order by retweet_count desc
select
*
from
tweets_by_hashtags
where
hashtag = 'some_hashtag' and
tweet_posted_time >= 'from_time' and
tweet_posted_time < 'to_time'
please help me with the design of the primary/partition/clustering keys.

Since you need your data to be partitioned by hashtag and ordered by time as well (you are doing range queries so you need to know between which times certain tweets happened) your table should be created like this:
create table tweets_by_hashtags(
hashtag text,
tweet_id text,
tweet_posted_time timestamp,
retweet_count int,
body text,
primary key((hashtag), tweet_posted_time, tweet_id)
)
Where hashtag is partition key and tweets are clustered first by time (ordering by time which is enabling range queries) and tweet_id is added for uniqueness (if two tweets happen at same exact time you need to differentiate them).
This will enable select query as you proposed where you need tweets by hashtag between some start and end time.
As for other part of the question I see two possible solutions:
1. Order on application level
When you pull your list of tweets you can loop through list and order by retweet count, this way you will have ordered tweets between times you want.
2. Fixed time buckets
If you have resolution you need, i.e. daily tweets, hourly tweets or something and you can skip range criteria in your query you can create your table with composite primary key composed of hashtag and time resolution and use retweet count as clustering key.
create table hourly_tweets_by_hashtags(
hashtag text,
tweet_id text,
tweet_posted_time timestamp,
tweet_posted_date text,
tweet_posted_hour int,
retweet_count int,
body text,
primary key((tweet_posted_date, tweet_posted_hour, hashtag), retweet_count, tweet_id)
) WITH CLUSTERING ORDER BY (retweet_count DESC)
Now your composite primary key is composed of date, hour in day and hashtag and tweets are ordered by retweet_count. Again tweet_id is added because of uniqueness.
Now you can do query like this:
select
*
from
hourly_tweets_by_hashtags
where
hashtag = 'some_hashtag' and
tweet_posted_date = '22/01/2016' and
tweet_posted_hour = 16;
and this query will return all tweets on certain date at 16h ordered by retweet_count. Clustering order is added to put most retweets on top.

Related

How to sum up cassandra counter grouping by only one column in the primary key set?

I am trying to keep track of the amount of events of each type that occured in one-hour buckets of time, and then sum the counts per category in arbitrary time ranges. So, I create a table like this:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY ((sensor_id), datetime_hour_bucket, activity_type)
)
WITH CLUSTERING ORDER BY(datetime_hour_bucket DESC, activity_type ASC);
I would like to be able to achieve this kind of query:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type
Cassandra complains about because grouping must be done in the order of the primary key columns. And, if I change the order I won't be able to query by a range over any activity_type.
Some notes:
I am grouping by hours because some users could ask me to show the data in different timezones and I want to be able to perform a decent conversion.
The activity_type has low cardinality, however I can not be sure I'll always be able to predict it's possible values.
Right now my solution was to query the whole data in the range and perform the aggregation myself in code. Have you have faced similar situation and what was your solution? Would you suggest a different way of querying or arranging the data?
I hope you've found the solution of your problem, however I have a way to you try.
First, you can chage the create table to change the order of fields:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY (activity_type, sensor_id, datetime_hour_bucket, activity_count)
)
WITH CLUSTERING ORDER BY(activity_type ASC, datetime_hour_bucket DESC);
Then, the query you can add the field "datetime_hour_bucket" in the Group By clause:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type, datetime_hour_bucket;

How should I design the schema to get the last 2 records of each clustering key in Cassandra?

Each row in my table has 4 values product_id, user_id, updated_at, rating.
I'd like to create a table to find out how many users changed rating during a given period.
Currently my schema looks like:
CREATE TABLE IF NOT EXISTS ratings_by_product (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((product_id ), updated_at , user_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, user_id ASC);
but I couldn't figure out the way to only get the last 2 rows of each user in a given time window.
Any advice on query or changing the schema would be appreciated.
Cassandra requires a query-based approach to table design. Which means that typically one table will serve one query. So to serve the query you are talking about (last two updated rows per user) you should build a table specifically designed to serve it:
CREATE TABLE ratings_by_user_by_time (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((user_id ), updated_at, product_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, product_id ASC );
Then you will be able to get the last two updated ratings for a user by doing the following:
SELECT * FROM ratings_by_user_by_time
WHERE user_id = 'Bob' LIMIT 2;
Note that you'll need to keep the two ratings tables in-sync yourself, and using a batch statement is a good way to accomplish that.

Cassandra cql - sorting unique IDs by time

I am writting messaging chat system, similar to FB messaging. I did not find the way, how to effectively store conversation list (each row different user with last sent message most recent on top). If I list conversations from this table:
CREATE TABLE "conversation_list" (
"user_id" int,
"partner_user_id" int,
"last_message_time" time,
"last_message_text" text,
PRIMARY KEY ("user_id", "partner_user_id")
)
I can select from this table conversations for any user_id. When new message is sent, we can simply update the row:
UPDATE conversation_list SET last_message_time = '...', last_message_text='...' WHERE user_id = '...' AND partner_user_id = '...'
But it is sorted by clustering key of course. My question: How to create list of conversations, which is sorted by last_message_time, but partner_user_id will be unique for given user_id?
If last_message_time is clustering key and we delete the row and insert new (to keep partner_user_id unique), I will have many so many thumbstones in the table.
Thank you.
A slight change to your original model should do what you want:
CREATE TABLE conversation_list (
user_id int,
partner_user_id int,
last_message_time timestamp,
last_message_text text,
PRIMARY KEY ((user_id, partner_user_id), last_message_time)
) WITH CLUSTERING ORDER BY (last_message_time DESC);
I combined "user_id" and "partner_user_id" into one partition key. "last_message_time" can be the single clustering column and provide sorting. I reversed the default sort order with the CLUSTERING ORDER BY to make the timestamps descending. Now you should be able to just insert any time there is a message from a user to a partner id.
The select now will give you the ability to look for the last message sent. Like this:
SELECT last_message_time, last_message_text
FROM conversation_list
WHERE user_id= ? AND partner_user_id = ?
LIMIT 1

Apache Cassandra table not sorting by name or title correctly

I have the following Apache Cassandra Table working.
CREATE TABLE user_songs (
member_id int,
song_id int,
title text,
timestamp timeuuid,
album_id int,
album_title text,
artist_names set<text>,
PRIMARY KEY ((member_id, song_id), title)
) WITH CLUSTERING ORDER BY (title ASC)
CREATE INDEX user_songs_member_id_idx ON music.user_songs (member_id);
When I try to do a select * FROM user_songs WHERE member_id = 1; I thought the Clustering Order by title would have given me a sorted ASC of the return - but it doesn't
Two questions:
Is there something with the table in terms of ordering or PK?
Do I need more tables for my needs in order to have a sorted title by member_id.
Note - my Cassandra queries for this table are:
Find all songs with member_id
Remove a song from memeber_id given song_id
Hence why the PK is composite
UPDATE
It is simialr to: Query results not ordered despite WITH CLUSTERING ORDER BY
However one of the suggestion in the comments is to put member_id,song_id,title as primary instead of the composite that I currently have. When I do that It seems that I cannot delete with only song_id and member_id which is the data that I get for deleting (hence title is missing when deleting)

Cassandra data model with obsolete data removal possibility

I'm new to cassandra and would like to ask what would be correct model design pattern for such tasks.
I would like to model data with future removal possibility.
I have 100,000,000 records per day of this structure:
transaction_id <- this is unique
transaction_time
transaction_type
user_name
... some other information
I will need to fetch data by user_name (I have about 5,000,000 users).
Also I will need to find transaction details by its id.
All the data will be irrelevant after say about 30 days, so need to find a way to delete outdated rows.
As much I have found, TTL-s expire column values, not rows.
So far I came across with this model, and as I understand it will imply really wide rows:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY ((transaction_date, user_name), transaction_id)
);
CREATE INDEX idx_user_transactions_uname ON USER_TRANSACTIONS(user_name);
CREATE INDEX idx_user_transactions_tid ON USER_TRANSACTIONS(transaction_id);
but this model does not allow deletions by transaction_date.
this also builds indexes with high cardinality, what cassandra docs strongly discourages
So what will be the correct model for this task?
EDIT:
Ugly workaround I came with so far is to create single table per date partition. Mind you, I call this workaround and not a solution. I'm still looking for right data model
CREATE TABLE user_transactions_YYYYMMDD (
user_name text,
transaction_id text,
transaction_time timestamp,
transaction_type int,
PRIMARY KEY (user_name)
);
YYYYMMDD is date part of transaction. we can create similar table with transaction_id for transaction lookup. obsolete tables can be dropped or truncated.
Maybe you should denormalized your data model. For example to query by user_name you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (user_name, transaction_id)
);
So you can query using the partition key directly like this:
SELECT * FROM user_transactions WHERE user_name = 'USER_NAME';
And for the id you can use a cf like this:
CREATE TABLE user_transactions (
transaction_date timestamp, //date part of transactiom
user_name text,
transaction_id text,
transaction_time timestamp, //original transaction time
transaction_type int,
PRIMARY KEY (transaction_id)
);
so the query could be something like this:
SELECT * FROM user_transactions WHERE transaction_id = 'ID';
By this way you dont need indexes.
About the TTL, maybe you could programatically ensure that you update all the columns in the row at the same time (same cql sentence).
Perhaps my answer will be a little useful.
I would have done so:
CREATE TABLE user_transactions (
date timestamp,
user_name text,
id text,
type int,
PRIMARY KEY (id)
);
CREATE INDEX idx_user_transactions_uname ON user_transactions (user_name);
No need in 'transaction_time timestamp', because this time will be set by Cassandra to each column, and can be fetched by WRITETIME(column name) function. Because you write all the columns simultaneously, then you can call this function on any column.
INSERT INTO user_transactions ... USING TTL 86400;
will expire all columns simultaneously. So do not worry about deleting rows. See here: Expiring columns.
But as far as I know, you can not delete an entire row - key column still remains, and in the other columns will be written NULL.
If you want to delete the rows manually, or just want to have an estimate of rows to be deleted by a TTL, then I recommend driver Astyanax: AllRowsReader All rows query.
And indeed as a driver to work with Cassandra I recommend you use Astyanax.

Resources