Minimizing inconsistency between tables in denormalized databases like Cassandra - cassandra

Cassandra (and BigTable, etc) recommends a denormalized database, where tables are designed from the expected queries. The Cassandra doc uses this example:
hotels_by_poi: poi_name (Key)
hotel_id (Cluster key)
name
phone
address
hotels: hotel_id (Key)
name
phone
address
So name, phone, and address are denormalized between hotels_by_poi and hotels. What I'm wondering about is how to implement this method:
update_hotel_info(hotel_id, name, phone, address) {
updateHotel(hotel_id, name, phone, address);
updatePoisByHotel(hotel_id, name, phone, address);
}
It's possible the first method errors, or that the server running the two methods errors between the first and second update methods. Therefore the data gets out of sync. Without doing anything else, it's not even eventually consistent.
Is it possible to design multiple tables for eventual consistency of denormalized data shared between them?
Is this something people worry about in practice? (e.g., if the services are all 5 9's, then the integrity is 4 9s, pretty good.)

The idea is to wrap the related table updates in a CQL BATCH statement as I've explained here -- https://community.datastax.com/articles/2744/.
Even if you didn't use CQL batches, the idea is that if either of those methods fail, you should have error handling that would (for example) retry the request to make sure they're successful. Cheers!

As #Erick mentioned either use batch for maintaining consistency or if you can handle at client side by retrying failed inserts/deletes. For example
update_hotel_info(hotel_id, name, phone, address) {
updateHotel(hotel_id, name, phone, address);
updatePoisByHotel(hotel_id, name, phone, address);
}
You can retry update_hotel_info if either of the insert/update failed . This way you will get fast writes and you can use cheap writes of Cassandra.

Related

How Can I get Item by Attribute in DynamoDB in nodejs

I have 7 columns,
id, firstname, lastname, email, phone, createAt, updatedAt
I am trying to write an api in Nodejs to get items by phone.
id is the primary key.
I am trying to get the data by phone or email. I didn't created sortkey or GSI yet.
I ended up getting suggestions to use scan with filters in dynamodb and get all records .
Is there any other way to achieve it?
Your question already contains two good answers:
The slow way is to use a Scan with a FilterExpression to find the matching items. This will take the time (and also cost!) of reading the entire database on every query. It only makes sense if these queries are very infrequent.
If these queries by phone are not super-rare, it is better to be prepared in advance: add a GSI with the phone as its partition key, to allow looking up items by phone value using a Query with IndexName and KeyConditionExpression. These queries will be fast and cheap: you will only pay for items actually retrieved. The downside of this approach is the increased write costs: The cost
of every write doubles (DynamoDB writes to both the base table and
the index), and the cost of storage increases as well. But unless
your workload is write-mostly (items are very frequently updated and
very rarely read), option 2, using a GSI, is still better than
option 1 - a full-table Scan.
Finally another option you have is to reconsider your data model. For example, if you always look up items by phone and never by id, you can make phone the partition key of your data and id the sort key (to allow multiple items with the same phone). But I don't know if this is relevant for your use case. If you need to look up items sometimes by id and sometimes by phone, probably GSI is exactly what you need.

Cassandra Query Performance: Using IN clause for one portion of the composite partition key

I currently have a table set up in Cassandra that has either text, decimal or date type columns with a composite partition key of a business_date and an account_number. For queries to this table, I need to be able to support look-ups for a single account, or for a list of accounts, for a given date.
Example:
select x,y,z from my_table where business_date = '2019-04-10' and account_number IN ('AAA', 'BBB', 'CCC')
//Note: Both partition keys are provided for this query
I've been struggling to resolve performance issues related to accessing this data because I'm noticing latency patterns that I am having trouble trying to understand / explain.
In many scenarios, the same exact query can be run a total of three times in a short period by the client application. For these scenarios, I see that two out of three requests will have really bad response times (800 ms), and one of them will have a really fast one (50 ms). At first I thought this would be due to key or row caches, however, I'm not so sure since I believe that if this were true, the third request out of the three should always be the fastest, which isn't the case.
The second issue I believed I was facing was the actual data model itself. Although the queries are being submitted with all the partition keys being provided, since it's an IN clause, the results would be separate partitions and can be distributed across the cluster and so, this would be a bad access pattern. However, I see these latency problems when even single account queries are run. Additionally, I see queries that come with 15 - 20 accounts performing really well (under 50ms), so I'm not sure if the data model is actually an issue.
Cluster setup:
Datacenters: 2
Number of nodes per data center: 3
Keyspace Replication:local_dc = 2, remote_dc = 2
Java Driver set:
Load-balancing: DCAware with LatencyAware
Protocol: v3
Queries are still set up to use "IN" clauses instead of async individual queries
Read_consistency: LOCAL_ONE
Does anyone have any ideas / clues of what I should be focusing on in terms of really identifying the root cause of this issue?
the use of IN on the partition key is always the bad idea, even for composite partition keys. The value of partition key defines the location of your data in cluster, and different values of partition key will most probably put data onto different servers. In this case, coordinating node (that received the query) will need to contact nodes that hold the data, wait that these nodes will deliver results, and only after that, send you results back.
If you need to query several partition keys, then it will be faster if you issue individual queries asynchronously, and collect result on client side.
Also, please note that TokenAware policy works best when you use PreparedStatement - in this case, driver is able to extract value of partition key, and find what server holds data for it.

Cassandra: using a one letter as a shard key to reduce load to cluster

I need to implement a functionality to search users by their nickname.
I know that it's possible to create a SASI index on a nickname and the search will work. However, as far as I understand the query will be sent to all nodes in the cluster.
I want to modify a table and introduce a shard key which will be first letter of the nickname. Like that if user starts to search, we know that we need to forward the query only to specific node ( + replicas ).
P.S I know that such kind of pattern can create a hotspot. However, I think the trade-offs here are meaningful and in practice I should not get an issue due to this hotspot ( I don't expect to get billion users in my system ).
What do you think?
Thank you in advance.

How to make sure write only if data does not exist in cassandra

I have two methods in my server application:
boolean isMessageExist(messageId) which execute below query:
SELECT messageId from message Where messageId =1;
insertMessage(int messageId,String data) which execute below query:
INSERT INTO message (messageId,data) VALUES (1, xyz);
In my code I am doing below to meet the required that "only insert if message does not exist".
if(!isMessageExist(1)){
insertMessage(1,"xyz")
}
But above code is not working if request for same messageId comes almost simultaneously.
i.e at time T0 ... the Read1(1), Write1(1) and Read2(1), Write2(1) are happening at the same time since the two requests were sent from the client at the same time. Is there a way make those request in sequence at serverside. I mean Read2(1) should always get the result Write1(1) ?
I don't want to USE CAS operation if IF NOT EXISTS due to performance overhead.
Is there any other way to achieve my requirement? Please suggest.
Using Cassandra's light weight transactions (LWT) IF NOT EXISTS should be both less expensive then what you are currently doing and satisfy your requirement for uniqueness.
INSERT INTO message( messageId, data ) VALUES ( 1, xyz ) IF NOT EXISTS
You can test and verify the performance, but two round trips (read, write) is almost certainly more expensive than a single INSERT ... IF NOT EXISTS.
Alternatively, if you can redesign your application so it uses UPSERTS -- where new values simply overwrites old data, that would be even better and use a more native Cassandra style.

Getting rid of confusion regarding NoSQL databases

This question is about NoSQL (for instance take cassandra).
Is it true that when you use a NoSQL database without data replication that you have no consistency concerns? Also not in the case of access concurrency?
What happens in case of a partition where the same row has been written in both partitions, possible multiple times? When the partition is gone, which written value is used?
Let's say you use N=5 W=3 R=3. This means you have guaranteed consistency right? How good is it to use this quorum? Having 3 nodes returning the data isn't that a big overhead?
Can you specify on a per query basis in cassandra whether you want the query to have guaranteed consistency? For instance you do an insert query and you want to enforce that all replica's complete the insert before the value is returned by a read operation?
If you have: employees{PK:employeeID, departmentId, employeeName, birthday} and department{PK:departmentID, departmentName} and you want to get the birthday of all employees with a specific department name. Two problems:
you can't ask for all the employees with a given birthday (because you can only query on the primary key)
You can't join the employee and the department column families because joins are impossible.
So what you can do is create a column family:
departmentBirthdays{PK:(departmentName, birthday), [employees-whos-birthday-it-is]}
In that case whenever an employee is fired/hired it has to be removed/added in the departmentBirthdays column family. Is this process something you have to do manually? So you have to manually create queries to update all redundant/denormalized data?
I'll answer this from the perspective of cassandra, coz that's what you seem to be looking at (hardly any two nosql stores are the same!).
For a single node, all operations are in sequence. Concurrency issues can be orthogonal though...your web client may have made a request, and then another, but due to network load, cassandra got the second one first. That may or may not be an issue. There are approaches around such problems, like immutable data. You can also leverage "lightweight transactions".
Cassandra uses last write wins to resolve conflicts. Based on you replication factor and consistency level for your query, this can work well.
Quurom for reads AND writes will give you consistency. There is an edge case..if the coordinator doesn't know a quorum node is down, it sends the write requests, then the write would complete when quorum is re-established. The client in this case would get a timeout and not a failure. The subsequent query may get the stale data, but any query after that will get latest data. This is an extreme edge case, and typically N=5, R=3, W3= will give you full consistency. Reading from three nodes isn't actually that much of an overhead. For a query with R=3, the client would make that request to the node it's connected to (the coordinator). The coordinator will query replicas in parallel (not sequenctially). It willmerge up the results with LWW to get the result (and issue read repairs etc. if needed). As the queries happen in parallel, the overhead is greatly reduced.
Yes.
This is a matter of data modelling. You describe one approach (though partitioning on birthday rather than dept might be better and result in more even distribution of partitions). Do you need the employee and department tables...are they needed for other queries? If not, maybe you just need one. If you denormalize, you'll need to maintain the data manually. In Cassandra 3.0, global indexes will allow you to query on an index without being inefficient (which is the case when using a secondary index without specifying the partition key today). Yes another option is to partition employeed by birthday and do two queries, and do the join in memory in the client. Cassandra queries hitting a partition are very fast, so doing two won't really be that expensive.

Resources