Cassandra get latest entry for each element contained within IN clause - cassandra

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);

IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost

Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

Related

Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field?

I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.
second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

How does ALLOW FILTERING work when we provide all of the partition keys?

I've read at least 50 articles on this and still don't know the answer ...
I know how partitioning, clustering and ALLOW FILTERING work, but can't figure out what is the situation behind using ALLOW FILTERING with all partition keys provided in a query.
I have a table like this:
CREATE TABLE IF NOT EXISTS keyspace.events (
date_string varchar,
starting_timestamp bigint,
event_name varchar,
sport_id varchar
PRIMARY KEY ((date_string), starting_timestamp, id)
);
How does query like this work ?
SELECT * FROM keyspace.events
WHERE
date_string IN ('', '', '') AND
starting_timestamp < '' AND
sport_id = 1 /* not in partitioning nor clustering key */
ALLOW FILTERING;
Is the 'sport_id' filtering done on records retreived earlier by the correctly defined keys ? Is ALLOW FILTERING still discouraged in this kind of query ?
How should I perform filtering in this particular situation ?
Thanks in advance
Yes, it should first filter out the partitions and then only will do the filtering on the non-key value and as per the experiment mentioned here : https://dzone.com/articles/apache-cassandra-and-allow-filtering
I think its safe to use the allow filtering after all the keys in most case.
It will highly depend on how much data you are filtering out as well - if the last condition of sport_id = 1 is trying to filter out most of the data then it will be a bad idea as it gives a lot of pressure to the database, so you need to consider the trade-offs here.
Its not a good idea to use an IN clause with the partition key - especially the above query doesnt look good because its using both IN clause on Partition key and the allow filtering.
Suggestion - Cassandra is very good at processing as many requests as you need in a second and the design idea should be to send more lighter queries at once than trying to send one query which does lot of work. So my suggestion would be to fire N calls to Cassandra each with = condition on partition key without filtering the last column and then combine and do final filter in the code (which ever language you are using I assume it can support sending all these calls parallel to the database). By doing so you will get the advantage in performance in long term when the data grows.

Is a read with one secondary index faster than a read with multiple in cassandra?

I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.
If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.

Cassandra or Hbase?

I have a requirement, where I want to store the following:
Mac Address // PKEY
TimeStamp // PKEY
LocationID
ownerName
Signal Strength
The insertion logic is as follows:
Store the above statistics for each active device (MacAddress) once every hour at each location (LocationID)
The entries are created at end of each hour, so the primary key will always be MAC+TimeStamp
There are no updates, only insertions
The queries which can be performed are as follows:
Give me all the entries for last 'N' hours Where MacAddress = "...."
Give me all the entries for last 'N' hours Where LocationID IN (locID1, locID2, ..);
Needless to say, there are billions of entries, and I want to use either HBASE or Cassandra. I've tried to explore, and it seems that Cassandra may not be correct choice.
The reasons for that is if I have the following in cassandra:
< < RowKey > MacAddress:TimeStamp > >
+ LocationID
+ OwnerName
+ Signal Strength
Both the queries will scan the whole database, right? Even if I add an index on LocationID, that is only going to help in the second query to some extent, because there is no index on timestamp (I believe that seaching on timestamp is not fast, as the MacAddress:TimeStamp composite Key would not allow us to search only on timestamp, and instead, a full scan would happen, is that correct?).
I'm stuck here big time, and any insights would really help, if we should opt HBase or Cassandra.
The right way to model this with Cassandra is to use a table partitioned by mac address, ordered by timestamp, and indexed on location id. See the Cassandra data model documentation, especially the section on clustering [predefined sorting]. None of your queries will require a full table scan.
You have to remember that NoSql instances like Cassandra allow horizontal scaling and make it a lot easier to shard the data. By developing a shard strategy (identifying shard key, etc) you could dramatically reduce the size of the data on a single instance and make queries (even when trying to query massive data sets) doable.
Either one would work for this query:
Give me all the entries for last 'N' hours Where MacAddress = "...."
In cassandra you would want to use an ordered partitioner so you can do easy scans. That way you would not have to scan the entire table. (I'm a little rusty on Cassandra).
In hbase it is always ordered by the rowkey so the scan becomes easy. You would just set a start and stop rowkey. Conceptually it would be:
scan.setStartRow(mac+":"+timestamp);
scan.setStopRow(mac+":"+endtimestamp);
And then it would only scan over the rows for the given mac address for the given time period--only a small subset of the data.
This query is much harder:
Give me all the entries for last 'N' hours Where LocationID IN
(locID1, locID2, ..);
Cassandra does have secondary indexes so it seems like it would be "easy" but I don't know how much data it would scan through. I haven't looked at Cassandra since it added secondary indexes.
In hbase you'd have to scan the entire table or create a second table. I would recommend creating a second table where the rowkey would be < location:timestamp > and you'd duplicate the data. Then you'd use that table to lookup the data by location using a scan and setting the start and end keys.

Resources