Cassandra light weight transaction confusion - cassandra

I am a bit confused here in terms of terminology for light weight transactions. I am not sure why in most of the cassandra literature it says that it works only for a single partition.
Like when I use IF NOT EXISTS, IF EXISTS, it should apply to the whole primary key not just partition key as it says in this post as well How the LWT- Light Weight Transaction is working when we use IF NOT EXIST?
However, in the book Cassandra, the Definition Guide, I see this example
INSERT INTO reservation.reservations_by_confirmation
(confirm_number,
hotel_id, start_date, end_date, room_number, guest_id) VALUES (
'RS2G0Z', 'NY456', '2020-06-08', '2020-06-10', 111, 1b4d86f4-ccff-
4256-a63d-45c905df2677) IF NOT EXISTS;
This command checks to see if there is a record with the partition key, which for this table consists of the confirm_number. So let’s find out what happens when you execute this command a second time:
INSERT INTO reservation.reservations_by_confirmation
(confirm_number,
hotel_id, start_date, end_date, room_number, guest_id) VALUES (
'RS2G0Z', 'NY456', '2020-06-08', '2020-06-10', 111, 1b4d86f4-ccff-
4256-a63d-45c905df2677) IF NOT EXISTS;
In this case, the transaction fails, because there is already a reservation with the number “RS2G0Z,” and cqlsh helpfully echoes back a row containing a failure indication and the values you tried to enter.
Now my question is if I run, another query
INSERT INTO reservation.reservations_by_confirmation
(confirm_number,
hotel_id, start_date, end_date, room_number, guest_id) VALUES (
'RS2G0Z', 'NY466', '2020-06-08', '2020-06-10', 111, 1b4d86f4-ccff-
4256-a63d-45c905df2677) IF NOT EXISTS;
which is a different primary key but with the same partition key, it should succeed
So when the book says
This command checks to see if there is a record with the partition key
Isn't this a wrong statement? Please let me know if I am misinterpreting something

It's a error in the book, although diagram shows complex primary key, the table reservation.reservations_by_confirmation has very simple primary key - confirm_number, so in this case queries work as described in the text, and it doesn't allow to insert duplicate primary key.
When you see mentioning of the partition key in context of the LWT, this usually means that coordination happens between nodes that have replica of given partition...

Related

Cloud Spanner complex primary key and queries

I'm playing with Cloud Spanner and I created an imgur clone with the schema as follows:
CREATE TABLE Images (id STRING(36) NOT NULL, createdAt TIMESTAMP, caption STRING(1024), fileType STRING(10)) PRIMARY KEY (id, createdAt DESC)
The id is a version 4 UUID as the GCP documentation specifies so that I avoid hotspots. The createdAt is a timestamp when an image is first created. I have my PRIMARY KEY defined as (id, createdAt DESC) so that I can more easily query by latest added images.
What I don't understand is what happens if I want to get a single image using only SELECT * FROM Images WHERE id = 'some UUID? Will Spanner still search by key in an efficient way, meaning getting the information from the server that stores the specific key in its key range even though I only specified a part of the primary key?
In your simple example, yes. It will try to come up with an efficient execution plan which may include using an index (automatically created for PKs) even though your predicate is on just 1 of the 2-column composite PK because it is on the 1 column. If your predicate was just "...createdAt= then it will scan the table. It would be far more expensive to find matches for col2 in your composite PK of (col1, col2) than it is to just scan col2.
This assumes there's enough data to matter. For example, if you have 42 rows, it really won't matter how you execute the query or what predicates were provided; the number off I/O requests (often the most expensive part of a query) will be the same.
In general, Spanner tries to pick the index it thinks will be most efficient. The actual physical steps don't work like that but conceptually, it's a reasonable way to think about it.
Whether an index is helpful or not depends on a few things and whether it gets picked or not also has dependencies. Does it have statistics, are the statistics correct/fresh, is it making correct estimates on row counts, etc... Composite indexes/keys are a just a bit more interesting as noted above.
Just make sure you always test with enough data (closely matching your production environment if possible).

ON CONFLICT operator in Cassandra

I have a table in Cassandra with 2 columns: id and date_proc and plan to insert a lot of inserts. Is it possible to use something like ON CONFLICT in Postgres to get previous value on inserting?
Could you tell me another way to avoid 2 requests to Cassandra (select and insert)? Maybe some solution in DataStax?
ddl:
create table test.date_dict (
id text,
date_proc text,
PRIMARY KEY (id));
example of inserting:
INSERT INTO test.date_dict (id, date_proc) VALUES ('1', '2020-01-01'); // return '2020-01-01'
INSERT INTO test.date_dict (id, date_proc) VALUES ('1', '2020-01-05'); // return '2020-01-01'
"Normal" inserts and updates in Cassandra are just appends into the memtable (and then flushed into SSTables) - no read happens during these operations. And it will just overwrite previous data if it has lower timestamp.
Potentially you can use lightweight transactions (LWTs) to achieve what you need - they return previous value if there is a conflict (row exists already when you use IF NOT EXISTS, or value is different than you specify in the IF condition). But LWTs are very bad for performance, so they should be used carefully.
I would try to reformulate your task such way so it will fit into "normal" inserts/updates behavior.

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

Resources