Keep denormalized table consistent (synchronized) in Cassandra DB - cassandra

I am exploring Cassandra DB and I've come across this video which explains how to denormalize tables to store comments and it goes like this:
Instead of having a comments table and users_id column in that table that points to the user wrote the comment and another table which is a video table and in the comments table a column that is video_id which points to the video that has been commented.
We will have two comments table one comments_by_user and another one comment_by_video the question is how to keep these two table synchronized ?
When a user comments on a video we insert a comment in comments_by_video and comments_by_user, for a video and user respectively. However, what if the the second insert fails ?
We will have a comment by a user on a video that however cannot be found when we select all comments for that user ?

You can use the Batch Statement for this purpose. Be warned though, that batch statement is slower and puts around 30% of overhead on the regular operation on the coordinator node.

One option is to Batch Statement as mentioned in previous answer, which will have perf impact.
The other option, I could think of trying would be to change consistency level to ANY when a write fails. With consistency 'ANY' your data will be saved in the coordinator node for the configured amount of time and replicate to the node when the other responsible node comes up.
For any failed writes, handle it in your code and change the consistency to ANY for that insert and execute the insert again. Ofcourse you won't be able to read that data until any of the responsible node gets the data.
function retryWrites(...)
try{
//insert statement
}catch<writeexception>{
//set consistency to ANY
//insert failed statement
//set consistency back to wherever is required.
}
call retryWrites(statement1)
call retryWrites(statement2)

Related

Cassandra delete/update a row and get its previous value

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.
You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.
Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

Checking to see if a RethinkDB shard is in use?

I am wondering if there is a way to check to see if a RethinkDB shard is in use before performing so ReQL query on it.
I am currently calling two functions back to back, the first creating a RethinkDB table and inserting data, the second will read the data from that newly created table. This works okay if the data being inserted is minimal, but once the size of the data set being inserted increases, I start getting:
Unhandled rejection RqlRuntimeError: Cannot perform write: Primary replica for shard ["", +inf) not available
This is because the primary shard is still doing the write from the previous function. I guess I am wondering if there is some RethinkDB specific way of avoiding this or if I need to emit/listen for events or something?
You can probably use the wait command for this. From the docs:
Wait for a table or all the tables in a database to be ready. A table
may be temporarily unavailable after creation, rebalancing or
reconfiguring. The wait command blocks until the given table (or
database) is fully up to date.
http://rethinkdb.com/api/javascript/wait/

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

What data structure should I use to mimic "order by counter" in Cassandra?

Let's say I currently have a table like this
create table comment_counters
{
contentid uuid,
commentid uuid,
...
liked counter,
PRIMARY_KEY(contentid, commentid)
};
This purpose of this table is to track the comments and the number of times individual comments have been "liked".
What I would like to do is to get the top comments (let's say 20 top comments) determined by their number of likes from this table for each content.
I know there's no way to order by counters so what I would like to know is, are there any other ways to do this in Cassandra, by restructuring my tables or tracking more/different information for instance, or am I left with no choice but to do this in an RDBMS?
Sorting in client is not really an option I would like to consider at this stage.
Unfortunately there's now way to do this type of aggregations using plain Cassandra queries. The best option for doing this kind of data analysis would be to use an external tool such as Spark.
Using Spark you can start periodical jobs that would read and aggregate all counters from the comment_counters table and afterwards write the results (such as top 20 comments) to a different table that you can use to query directly afterwards.
See here to get started with Cassandra and Spark.

How can i make queries for this Cassandra data model design

I got a doubt while designing data model in cassandra.
i.e. I have created this CF
Page-Followers{ "page-id" : { "user-id" : "time" } }
I want to make 2 queries on the above CF.
1) To get all user-ids (as an array using multiget function of phpcassa) who are following a particular page.
2) To chech whether a particular user is following a particular page or not.
i.e. An user with user-id = 1111 is following a page page-id=100 or not.
So, how can i make those queries based on that CF.
Note : I don't want to create a new CF for this situation.Because for this user action(i.e. user clicks on follow button on a page), have to insert data in 3 CFs and if i created another CF for this, then have to insert data into total 4 CF. It may cause performance issue.
If you give example in phpcassa, then it would be great...
Another doubts is:- As I have created cassandra data model for my college social netwoeking site(i.e. page-followers, user-followers, notifications, Alerts,...etc). For each user action, i have to insert data into 2 or 3 or more CFs, So Is it cause for performance issue??? Is it a good design??
Please help me...
Thanks in advance
In general, while data modeling in Cassandra, you first look at your queries and then construct a data model suitable for that.
For your case, you can do the following(I have no experience with phpcassa, so i can only give you the approach, you have to figure out the phpcassa bit)
1) Do a slice query with start column as '' and end column as '' and set range to a very large value. This will return you all the columns.
2) Just do a get column for rowkey = 100 and userid = 1111. If the value is not null, the user follows the page.
Cassandra is highly optimized for writes. The recommended way to model data using Cassandra is to write in denormalized fashion, even to multiple CFs. Writing to 2 or 3 families should not be an issue. You can always make the writes asynchronous to achieve better performance.
EDIT: http://thobbs.github.com/phpcassa/tutorial.html is a good place for phpcassa.

Resources