Is it possible to do sequential batch in cassandra.
eg:
insert into table1 and take uuid from this insert operation and pass this to table2 insert statement.
If table 2 insert fails, fail the entire operaion.
If not , whats my best option?
(Its kind of transactional)
You'r best shot is Cassandra Batch statement:
BATCH - Cassandra documentation
Combined with "IF EXISTS" constraints (like here: DELETE - Cassandra documentation) it may be what you need.
However, I don't believe there is a possibility to "insert into table1 and take uuid from this insert operation and pass this to table2 insert statement". You can think of batches in C* as transactions in SQL - it's fully executed or not.
Important things to note:
batches can span multiple tables in C*
although batches are atomic, they are not isolated. Some portion of a batch can be executed, in another query you can read those changes, but it may happen they will be revoked because the batch will fail.
Related
How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.
You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.
Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.
As per Datastax documentation a read before a write in Cassandra is an anti pattern.
Whenever we use UPDATE either in CQLSH or using the Datastax drivers to set a few columns (with IFs & collection updates), does it not do a read before write first? Is that not an anti pattern? Am I missing something?
P.S I am not talking about mere UPSERTS but UPDATES on specific columns.
TIA!
No, Update is not an anti-pattern.
In Cassandra update is an upsert operation similar to insert.
UPDATE writes one or more column values to a row in a Cassandra table. Like INSERT, UPDATE is an upsert operation: if the specified row does not exist, the command creates it. All UPDATEs within the same partition key are applied atomically and in isolation.
But Lightweight transactions are read before write operation. Actually at the cost of four round trips.
Example of Lightweight transaction :
#Lightweight transaction Insert
INSERT INTO customer_account (customerID, customer_email)
VALUES (‘LauraS’, ‘lauras#gmail.com’)
IF NOT EXISTS;
#Lightweight transaction Update
UPDATE customer_account
SET customer_email=’laurass#gmail.com’
IF customerID=’LauraS’;
Both of the above statement are Lightweight transactions
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html#cqlUpdate__description
I have a CQL table (cql 3, cassandra 2.0.*) that looks something like:
CREATE TABLE IF NOT EXISTS user_things (
user_id bigint,
thing_id bigint,
created_at timeuuid,
PRIMARY KEY (user_id, thing_id)
);
I want to do an insert like
INSERT INTO user_things (user_id, thing_id, created_at) VALUES (?, ?, now())
but only if the row doesn't exist.
I could do this in two synchronous statements (first a SELECT, followed by an INSERT if the SELECT didn't return a row) or I could use INSERT ... IF NOT EXISTS.
The CQL docs state "But please note that using IF NOT EXISTS will incur a non negligible performance cost (internally, Paxos will be used) so this should be used sparingly."
I'm wondering if anybody has done benchmarking to see what is more performant if we have lots of these operations happening? (say hundreds a second)
It depends a lot on what topology you are using. The IF NOT EXISTS is pretty fast if you restrict it to a local data center (with LOCAL_SERIAL) and use a small replication factor. If you try to use it across multiple data centers or with higher replication factors, then it slows down dramatically. There is an open ticket to improve its performance, so hopefully that will get completed soon since it is currently an overly expensive operation with lots of round trips.
Another thing that will slow IF NOT EXISTS down is when you use it on clustered rows. It seems to work the fastest when your table only has a compound partition key and no clustering columns.
If you go the read before write route, then you've got other problems to deal with. First off you will have a race condition since if two clients do a read around the same time, and then both decide to do a write, you'll get one overwriting the other, which kind of makes the read pointless (see another approach here: collision detection. If somehow you don't mind the race condition, and use a low consistency like ONE for the read and write, then it will likely outperform IF NOT EXISTS.
Pretty much you'd have to benchmark it for your system and schema to see which one was faster in your situation.
While I have not done the benchmarking myself, I would imagine that the two synchronous statements would be faster to operate because simply, it's not doing as much. It's executing two well-designed CQL queries, whereas the other method involves at least 4 communication 'phases' between the nodes.
But if you do use this method, are you able to guarantee that these queries are executed atomically and that there won't be an INSERT with the same user_id and thing_id in the time between running the SELECT and running the INSERT? The need to avoid this situation is what drives using lightweight transactions in Cassandra and Paxos in general.
I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.
When I'm sending batch of inserts to only one table while each row as a unique key with condition if not exists and there is a problem when even if one of the rows exists.
I need to insert the batch per row and not per the whole batch.
Let's say I've a table "users" with only one column "user_name" and contains the row "jhon", Now I'm trying to import new users:
BEGIN BATCH
INSERT INTO "users" ("user_name") VALUES ("jhon") IF NOT EXISTS;
INSERT INTO "users" ("user_name") VALUES ("mandy") IF NOT EXISTS;
APPLY BATCH;
It will not insert "mandy" because that "jhon" exists, What can I do to isolate them?
I've a lot of rows to insert about 100-200K so I need to use batch.
Thanks!
First: what you describe is documented as intended behavior:
In Cassandra 2.0.6 and later, you can batch conditional updates introduced as lightweight transactions in Cassandra 2.0. Only updates made to the same partition can be included in the batch because the underlying Paxos implementation works at the granularity of the partition. You can group updates that have conditions with those that do not, but when a single statement in a batch uses a condition, the entire batch is committed using a single Paxos proposal, as if all of the conditions contained in the batch apply.
That basically confirms: your updates are to different partitions, so only one Paxos proposal is going to be used, which means the entire batch will succeed, or none of it will.
That said, with Cassandra, batches aren't meant to speed up and bulk load - they're meant to create pseudo-atomic logical operations. From http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html :
Batches are often mistakenly used in an attempt to optimize performance. Unlogged batches require the coordinator to manage inserts, which can place a heavy load on the coordinator node. If other nodes own partition keys, the coordinator node needs to deal with a network hop, resulting in inefficient delivery. Use unlogged batches when making updates to the same partition key.
The coordinator node might also need to work hard to process a logged batch while maintaining consistency between tables. For example, upon receiving a batch, the coordinator node sends batch logs to two other nodes. In the event of a coordinator failure, the other nodes retry the batch. The entire cluster is affected. Use a logged batch to synchronize tables, as shown in this example:
In your schema, each INSERT is to a different partition, which is going to add a LOT of load on your coordinator.
You can run your 200k inserts with a client with async executes, and they'll run quite fast - probably as fast (or faster) as you'd see with a batch.