Sorting with Cassandra

Sorting with Cassandra - cassandra

I have a table with schema:
CREATE TABLE messages {
chatroom_id,
id,
createdAt,
senderType,
...,
PRIMARY KEY ((chatroom_id), createdAt)
} WITH CLUSTERING ORDER BY (createdAt DESC);
also I have on this table secondary index on column senderType.
all the queries (until now) needed to be ordered by createdAt DESC
but now I need to do a new query like:
select * from messages where chatroom_id = xx
and senderType = yy
order by createdAt ASC;
Is there any option to create this query besides materialized view?
Thanks.

Unfortunately the Cassandra code has an explicitly check that no secondary indexes are used in query with ORDER BY inside.
IMHO you have following options:
create a new table & fill it from your code. It could be much faster from performance point of view than using materialized views. But this requires more coding in your app.
use materialized view - it's slower than explicit table, but doesn't require any additional code. But remember that MVs are still experimental in Cassandra, and you can get inconsistencies;
perform sorting in your application - if you don't have so much data, so you can fetch into your app & sort - in this case, the query like select * from messages where chatroom_id = xx and senderType = xx; will work, just return data in DESC...
for options 1 & 2 I would recommend to change table or MV structure to include senderType as part of primary key, as it will be much faster to perform, something like:
PRIMARY KEY ((chatroom_id), senderType, createdAt)
) WITH CLUSTERING ORDER BY (createdAt ASC);

Related

Cassandra order by timestemp desc

I just begin study cassandra.
It was a table and queries.
CREATE TABLE finance.tickdata(
id_symbol int,
ts timestamp,
bid double,
ask double,
PRIMARY KEY(id_symbol,ts)
);
And query is successful,
select ts,ask,bid
from finance.tickdata
where id_symbol=3
order by ts desc;
Next it was decision move id_symbol in table name, new table(s) scripts.
CREATE TABLE IF NOT EXISTS mts_src.ticks_3(
ts timestamp PRIMARY KEY,
bid double,
ask double
);
And now query fails,
select * from mts_src.ticks_3 order by ts desc
I read from docs, that I need use and filter (WHERE) by primary key (partition key),
but technically my both examples same. Why cassandra so restricted in this aspect?
And one more question, It is good idea in general? move id_symbol in table name -
potentially it can be 1000 of unique id_symbol and a lot of data for each. Separate this data on individual tables look like good idea!? But I lose order by possibility, that is so necessary for me to take fresh data by each symbol_id.
Thanks.

You can't sort on the partition key, you can sort only on clustering columns inside the single partition. So you need to model your data accordingly. But you need to be very careful not to create very large partitions (when using ticker_id as partition key, for example). In this case you may need to create a composite keys, like, ticker_id + year, or month, depending on how often you're inserting the data.
Regarding the table per ticker, that's not very good idea, because every table has overhead, it will lead to increased resource consumption. 200 tables is already high number, and 500 is almost "hard limit"

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?

To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.

Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

Cassandra sorting the results by non-clustering key

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition
CREATE TABLE blogs_by_visitor (
blogposturl text,
visitor text,
visited_ts timestamp,
PRIMARY KEY (blogposturl, visitor)
);
Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.
In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.
Select visitor from blog_table
where
blogposturl = ?
and rownum <= 10
order by timestamp desc
An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.
I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).
Any idea on how we could model the table differently?
The actual table has additional columns, reduced it here for simplicity

These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.
Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.
Let's create the Materialized View :
CREATE MATERIALIZED VIEW unique_visitor AS
SELECT *
FROM blogs_by_visitor
WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
PRIMARY KEY (blogposturl, visited_ts, visitor)
WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);
Now you can just select the 10 recent unique visitor of a blogpost.
SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;
you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC
Note That : The above schema will result huge amount of unexpected tombstone generation in the Materialized Views
Or You could change your table schmea like below :
CREATE TABLE blogs_by_visitor (
blogposturl text,
year int,
month int,
day int,
visitor text,
visited_ts timestamp,
PRIMARY KEY ((blogposturl, year, month, day), visitor)
);
Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.

Cassandra asks for ALLOW FILTERING even though column is clustering key

Very new to Cassandra so apologies if the question is simple.
I created a table:
create table ApiLog (
LogId uuid,
DateCreated timestamp,
ClientIpAddress varchar,
primary key (LogId, DateCreated));
This work fine:
select * from apilog
If I try to add a where clause with the DateCreated like this:
select * from apilog where datecreated <= '2016-07-14'
I get this:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
From other questions here on SO and from the tutorials on datastax it is my understanding that since the datecreated column is a clustering key it can be used to filter data.
I also tried to create an index but I get the same message back. And I tried to remove the DateCreated from the primary key and have it only as an index and I still get the same back:
create index ApiLog_DateCreated on dotnetdemo.apilog (datecreated);

The partition key LogId determines on which node each partition will be stored. So if you don't specify the partition key, then Cassandra has to filter all the partitions of this table on all the nodes to find matching data. That's why you have to say ALLOW FILTERING, since that operation is very inefficient and is discouraged.
If you specify a specific LogId, then Cassandra can find the partition on a single node and efficiently do a range query by the clustering key.
So you need to plan your schema such that you can do your range queries within a single partition and not have to do a full table scan like you're trying to do.

When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do. You always have multiple options.
You can change your data model, add an index, use another table or use ALLOW FILTERING.
You have to make the right choice for your specific use case.

Anyway you want to make it work.
select * from dev."3" where "column" = '' limit 1000 ALLOW FILTERING;

Using secondary indexes to update rows in Cassandra 2.1

I'm using Cassandra 2.1 and have a model that roughly looks as follows:
CREATE TABLE events (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((client_id, bucket), timestamp)
);
CREATE INDEX events_ticket ON events(ticket_id);
As you can see, I've created a secondary index on ticket_id. This index works ok. events contains around 100 million rows, while only 5 million of these rows have around 50,000 distinct tickets. So a ticket - on average - has 100 events.
Querying the secondary index works without supplying the partition key, which is convenient in our situation. As the bucket column is sometimes hard to determine beforehand (i.e. you should know the date of the events, bucket is currently the date).
cqlsh> select * from events where ticket_id = 123;
client_id | bucket | timestamp | ... | ticket_id
-----------+--------+-----------+-----+-----------
(0 rows)
How do I solve the problem when all events of a ticket should be moved to another ticket? I.e. the following query won't work:
cqlsh> UPDATE events SET ticket_id = 321 WHERE ticket_id = 123;
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY ticket_id found in where clause"
Does this imply secondary indexes cannot be used in UPDATE queries?
What model should I use to support these changes?

First of all, UPDATE and INSERT operations are treated the same in Cassandra. They are colloquially known as "UPSERTs."
Does this imply secondary indexes cannot be used in UPDATE queries?
Correct. You cannot perform an UPSERT in Cassandra without specifying the complete PRIMARY KEY. Even UPSERTs with a partial PRIMARY KEY will not work. And (as you have discovered) UPSERTing by an indexed value does not work, either.
How do I solve the problem when all events of a ticket should be moved to another ticket?
Unfortunately, the only way to accomplish this, is to query the keys of each row in events (with a particular ticket_id) and UPSERT ticket_id by those keys. The nice thing, is that you don't have to first DELETE them, because ticket_id is not part of the PRIMARY KEY.
How do I solve the problem when all events of a ticket should be moved to another ticket?
I think your best plan here would be to forego a secondary index all together, and create a query table to work alongside your events table:
CREATE TABLE eventsbyticketid (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((ticket_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
This would allow you to query by ticket_id quickly (to obtain your client_id, bucket, and timestamp. This would give you the information you need to UPSERT the new ticket_id on your events table.
You could also then perform a DELETE by ticket_id (on the eventsbyticketid table). Cassandra does allow a DELETE operation with a partial PRIMARY KEY, as long as you have the full partition key (ticket_id). So removing old ticket_ids from the query table would be easy. And to ensure write atomicity, you could batch the UPSERTs together:
BEGIN BATCH
UPDATE events SET ticket_id = 321 WHERE client_id=2112 AND bucket='2015-04-22 14:53' AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d;
UPDATE eventsbyticketid SET client_id=2112, bucket='2015-04-22 14:53' WHERE ticket_id=321 AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d
APPLY BATCH;
Which is actually the same as performing:
BEGIN BATCH
INSERT INTO events (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
INSERT INTO eventsbyticketid (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
APPLY BATCH;
Side note: timestamp is actually a (reserved word) data type in Cassandra. This makes it a pretty lousy name for a timeuuid column.

You can use the secondary index to query the events for the old ticket, and then use the primary key from those retrieved events to update the events.
I'm not sure why you need to do this manually, seems like something Cassandra should be able to do under the hood.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sorting with Cassandra - cassandra

Related

Cassandra order by timestemp desc

How to select data in Cassandra either by ID or date?

Cassandra sorting the results by non-clustering key

Cassandra asks for ALLOW FILTERING even though column is clustering key

Using secondary indexes to update rows in Cassandra 2.1

Categories

Resources