Cassandra design pattern for shared record (m:n) - cassandra

we have two entities User and Role. One User can have multiple Roles, and single Role can be shared by many users -
typical m:n relation.
Roles are also dynamic and we expect large amount (millions).
It is quiet simple to model such data in relational DB. I would like to find out whenever it would be possible in cassandra.
Currently I see two solutions:
A) Use normalized model and create something similar to inner-join
Create each single role in separate CF and store in User record foreign keys to referenced roles.
pro: Roles are not replicated and maintenance is simple
contra: In order to get all Roles for single User multiple network calls are necessary. User record contains only FK, Roles are stored
using random partitioner, in this case each role could be stored on different cassandra node.
B) Denormalize model and replicate roles to avoid round trips
In this scenario User record in cassandra contains all user roles as copy.
pro: It is possible to read User with all roles within single query. This guarantees short load times.
contra: Each shared Role is copied multiple times - on each related User. Maintaining roles is very difficult, especially if we have
large data amount. For example: one Role is shared by 1000 users. Changes on this Role require update on 1000 User records.
For very large data sets such updates has to be executed as asynchronous job.
Solutions above are very limited, meybie Cassandra is not right solution for m:n relations ? Do you know any cassandra design patter for such problem?
Thanks,
Maciej

The way you want to design a data store in Cassandra is to start with the queries you plan to execute and make it so you can get all the information you need at once. Denormalization is the name of the game here; if you're not replicating that role information in each user node, you're not going to avoid disk seeks, and your read performance will suffer. Joins do not make sense; if you want a relational database, use a relational database.
At a guess, you're going to ask a lot of questions about what roles a user has and what they should be doing with them, so you definitely want to have role information duplicated in each user entry - probably with each role getting its own column (role-ROLE_KEY => serialized-capability-info instead of roles => [serialized array of capability info]). Your application will need some way to iterate over all those columns itself.
You will probably want to look at what users are in a role, and so you should probably store all the user information you'll need for that view in the role column family as well (though a subset of the full user record will do).
When you run updates, and add/remove users from roles, you will need to make sure that you update both the role's list of users and the user's roles at the same time. Because you're using a column for each relation, instead of a single shared serialized blob, this should work even if you're editing two different roles that share the same user at the same time: Cassandra can merge the updates, including the deletes.
If the query needs to be asynchronous, then go make your application handle it. Remember that Cassandra is an eventual-consistency data store and you shouldn't expect updates to be visible everywhere immediately anyway.

Another option these days is to use playORM that can do joins for you ;). You just decide how to partition your data. It uses Scalabla JQL which is a simple addition on JQL as follows
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t('account', :partId) select t FROM Trade as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
So, we can finally normalize our data on a noSQL system AND scale at the same time. We don't need to give up normalization which has certain benefits.
Dean

Related

What is the right approach for batch deletions with the Cassandra C# driver?

I am quite new to Cassandra database. I have a question related to use of cassandra.
Table structure looks like below :-
Table Name :- Product Details.
ProductFamily Text,
AccessGroup Text,
ProductDetails Map
((ProductFamily), AccessGroup) PRIMARY Key
Data Relation :-
For 1 Product family we have multiple Access Groups and each access group has product details in Map . It is quite possible 1 product detail is present in all the access groups or some of the access groups.
Scenario 1 : -
We receive a delete event with ProductId and product family only.
Our implementation :-
Fetch all access group of the product family from the database.
For each access group, hit database to get the map, then we are checking whether it has specific productid as map key.
If yes, then hold that accessgroup -> productid (key,value) pair in memory.
In the end, prepare batch statement to delete all the product ids for the access group because our partition key is same.
Note - Max. we have 15-20 items in a map and 8-10 access groups with a product family.
.
Questions : -
Could you please let me know whether am I following right approach for batch deletion ?
If we receive thousands of such events in a day whether this approach is performant ?
Thanks in advance.
In general we don't recommend using Batches if the goal is to improve performance. However, some users have reported performance improvements when all statements within a batch refer to the same partition key (vs sending individual asynchronous requests) so your approach could actually be the one that offers the best performance.
One thing that could hurt performance is the "spiky" nature of that approach. It would probably be better for the Cassandra nodes to do something like this:
Fetch all access group of the product family from the database.
For each access group, hit database to get the map, then we are
checking whether it has specific productid as map key.
If yes, then send a DELETE request asynchronously and hold the Task in memory (without awaiting it right away).
In the end, await all the tasks that were held in memory, await Task.WhenAll(tasks).
There is no guarantee that this approach will be better though, performance tests and benchmarks are the only way to determine that.

Implement Paging on Azure Cosmos Db Data coming from two seperate documents

We have two separate set of documents in Cosmos Db, one storing User and it's various roles and second set of documents storing the permission to a particular job.
Now, the job list is unbounded and can grow substantially over a period of time. As group by is not allowed on multiple documents, we are trying to figure out the best strategy to implement a way on retrieving all users either based on role or particular job.
1) Solution 1 - Keep User data and job data as sub documents in a big long document and helps with querying and even continuation tokens.
2) Solution 2 - Keep user and role data in 1 documents and multiple job documents and query on the client side separately and perform query there. In this case the continuation token support is lost, as you have to query the complete data first to provide any meaningful results.
3) Solution 3 - Keep the role data with each job document and directly query on it. In this case, we will get number of users based on job and then make single query/user to get their information.
Can anyone recommend a better solution or pick from above 3 and suggest a path forward?
It seems that you need extra storage to store the relationship. We could use Azure SQL to store the relationship of user(documentId, userid, role id), role,job. Then store the incertain property info such as useinfo into Documentdb.

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

Couchdb database design options

Is it recommended to have a separate database for each document type in couchdb or place all types of documents in a single database?
Is there any limitation on the number of databases that we can create on couchdb?
Are there any drawbacks in creating large number of databases in couchdb?
There is no firm answer. Here are some guidelines:
If two documents must be visible to different sets of users, they must be in different DBs (read/write privs are per-DB, not per-doc).
If two documents must be included in the same view, they must be in the same DB (views are for a single DB only).
If two types of documents will be numerous and never be included in the same view, they might as well be in different DBs (so that accessing a view over one type won't need to process all of the docs of the other type).
It's cheap to drop a database, but expensive to delete all of the documents out of a database. Keep this in mind when designing your data expiration plan.
Nothing hardcoded, but you will eventually start running into resource constraints, depending on the hardware you have available.
Depends on what you mean by "large numbers." Thousands are fine; billions probably not (though with the Cloudant changes coming in v2.0.0 I'd guess that the reasonable cap on DB count probably goes up).

Do I use Azure Table Storage or SQL Azure for our CQRS Read System?

We are about to implement the Read portion of our CQRS system in-house with the goal being to vastly improve our read performance. Currently our reads are conducted through a web service which runs a Linq-to-SQL query against normalised data, involving some degree of deserialization from an SQL Azure database.
The simplified structure of our data is:
User
Conversation (Grouping of Messages to the same recipients)
Message
Recipients (Set of Users)
I want to move this into a denormalized state, so that when a user requests to see a feed of messages it reads from EITHER:
A denormalized representation held in Azure Table Storage
UserID as the PartitionKey
ConversationID as the RowKey
Any volatile data prone to change stored as entities
The messages serialized as JSON in an entity
The recipients of said messages serialized as JSON in an entity
The main problem with this the limited size of a row in Table Storage (960KB)
Also any queries on the "volatile data" columns will be slow as they aren't part of the key
A normalized representation held in Azure Table Storage
Different table for Conversation details, Messages and Recipients
Partition keys for message and recipients stored on the Conversation table.
Bar that; this follows the same structure as above
Gets around the maximum row size issue
But will the normalized state reduce the performance gains of a denormalized table?
OR
A denormalized representation held in SQL Azure
UserID & ConversationID held as a composite primary key
Any volatile data prone to change stored in separate columns
The messages serialized as JSON in a column
The recipients of said messages serialized as JSON in an column
Greatest flexibility for indexing and the structure of the denormalized data
Much slower performance than Table Storage queries
What I'm asking is whether anyone has any experience implementing a denormalized structure in Table Storage or SQL Azure, which would you choose? Or is there a better approach I've missed?
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Your primary driver for considering Azure Tables is to vastly improve read performance, and in your scenario using SQL Azure is "much slower" according to your last point under "A denormalized representation held in SQL Azure". I personally find this very surprising for a few reasons and would ask for detailed analysis on how this claim was made. My default position would be that under most instances, SQL Azure would be much faster.
Here are some reasons for my skepticism of the claim:
SQL Azure uses the native/efficient TDS protocol to return data; Azure Tables use JSON format, which is more verbose
Joins / Filters in SQL Azure will be very fast as long as you are using primary keys or have indexes in SQL Azure; Azure Tables do not have indexes and joins must be performed client side
Limitations in the number of records returned by Azure Tables (1,000 records at a time) means you need to implement multiple roundtrips to fetch many records
Although you can fake indexes in Azure Tables by creating additional tables that hold a custom-built index, you own the responsibility of maintaining that index, which will slow your operations and possibly create orphan scenarios if you are not careful.
Last but not least, using Azure Tables usually makes sense when you are trying to reduce your storage costs (it is cheaper than SQL Azure) and when you need more storage than what SQL Azure can offer (although you can now use Federations to break the single database maximum storage limitation). For example, if you need to store 1 billion customer records, using Azure Table may make sense. But using Azure Tables for increase speed alone is rather suspicious in my mind.
If I were in your shoes I would question that claim very hard and make sure you have expert SQL development skills on staff that can demonstrate you are reaching performance bottlenecks inherent of SQL Server/SQL Azure before changing your architecture entirely.
In addition, I would define what your performance objectives are. Are you looking at 100x faster access times? Did you consider caching instead? Are you using indexing properly in your database?
My 2 cents... :)
I won't try to argue on the exact definition of CQRS. As we are talking about Azure, I'll use it's docs as a reference. From there we can find that:
CQRS doesn't necessary requires that you use a separate read storage.
For greater isolation, you can physically separate the read data from the write data.
"you can" doesn't mean "you must".
About denormalization and read optimization:
Although
The read model of a CQRS-based system provides materialized views of the data, typically as highly denormalized views
the key point is
the read database can use its own data schema that is optimized for queries
It can be a different schema, but it can still be normalized or at least not "highly denormalized". Again - you can, but that doesn't mean you must.
More than that, if you performance is poor due to write locks and not because of heavy SQL requests:
The read store can be a read-only replica of the write store
And when we talk about request's optimization, it's better to talk more about requests themselves, and less about storage types.
About "it reads from either" [...]
The Materialized View pattern describes generating prepopulated views of data in environments where the source data isn't in a suitable format for querying, where generating a suitable query is difficult, or where query performance is poor due to the nature of the data or the data store.
Here the key point is that views are plural.
A materialized view can even be optimized for just a single query.
...
Materialized views tend to be specifically tailored to one, or a small number of queries
So you choice is not between those 3 options. It's much wider actually.
And again, you don't need another storage to create views. All can be done inside a single DB.
About
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Yes, of course, performance will suffer! (Also consider the matter of consistency). But will it be OK or not you can never be sure until you test it. With your data and your requests. Because delays in data transfers can actually be less than time required for some elaborate SQL-request.
So all boils down to:
What features do you need and which of them Table Storage and/or SQL Azure have?
And then, how much will it cost?
These you can only answer yourself. And these choices have little to do with performance. Because if there is a suitable index in either of those, I believe the performance will be virtually indistinguishable.
To sum up:
SQL Azure or Azure Table Storage?
For different requests and data you can and you probably should use both. But there is too little information in the question to give you the exact answer (we need an exact request for that). But I agree with #HerveRoggero - most probably you should stick with SQL Azure.
I am not sure if I can add any value to other answers, but I want to draw your attention toward modeling the data storage based on your query paths. Are you going to query all the mentioned data bits together? Is the user going to ask for some of it as additional information after a click or something? I am assuming that you have thought about this question already, and you are positive that you want to query everything in one go. i.e., the API or something needs to return all this information at once.
In that case, nothing will beat querying a single object by key. If you are talking about Azure's Table Storage specifically, it says right there that it's a key-value store. I am curious whether you have considered the document database (e.g. Cosmos DB) instead? If you are implementing CQRS read models, you could generate a single document per user that has all information that a user sees on a feed. You query that document by user id, which would be the key. This approach would be the optimal CQRS implementation in my mind because, after all, you are aiming to implement read models. Unless I misinterpreted something in your question or you have strong reasons to not go with document databases.

Resources