How to denormalize deep hierarchies? - cassandra

I’ve read quite a lot about Cassandra and the art of denormalization and materialization while writing the data. I think I understand the concept, and it seems to make sense. However, I am having some trouble implementing it in scenarios where there is a deep hierarchical data structure.
Consider the contrived domain where
Owner 1:* Company
Company 1:* Teams
Team 1:* Players
Players 1:* Equipment
We have tables for each of these entities, but we would also like to query quickly for equipment attributes by owner so it seems the thing to do is create a table (OwnerEquipment) that has the owner id and the equipment id as the primary key with the owner id as the partition key. This makes sense, but what if the UX scenarios that add and edit equipment do not include the owner’s id as part of the working set?
Most of the denormalization examples I’ve encountered in my research are usually a single level parent-child or master-detail type use case. It seems pretty reasonable that an updating client would have enough information about the immediate parent when updating the child to write the denormalized reverse index, but what if the data you would really like to denormalize by is several “joins” away?
This problem is compounded further in our example when we consider a Company is sold to a different Owner. Assume that the desired behavior is for OwnerEquipment to reflect this change. How should the code that writes this updated Company to the database handle the OwnerEquipment table updates? Should it, knowing the ID of the old owner, try to update all the OwnerEquipment records for that owner? This seems like a very un-Cassandra-y thing to do and also fraught with concurrency issues. The problem gets worse as you move down the chain (Team to new Company, Player to new Team). In these cases the “old owner” is not necessarily in the working set and would need to be read in order to be updated.
Are there some better ways to think about this problem?

This makes sense, but what if the UX scenarios that add and edit equipment do not include the owner’s id as part of the working set?
Easy, pass the owner id along with equipment id to the UX. Owner id can be a hidden value not to be shown on the interface
but what if the data you would really like to denormalize by is several “joins” away?
Create as many tables for different query use-cases
For multiple updates and denormalizations, you can look at the new materialized views feature. Read my blog: www.doanduyhai.com/blog/?p=1930

Related

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

CQRS Read models in a NoSql (Mongo DB)

Hi its my fist time with DDD/CQRS. I've read multiple sources of knowledge and Im still confused a bit, maybe someone could help :)
Lets assume simple case that we have products and clients (possibly different bounded contexts).
A client can buy a product and he wants to see all products that he purchased.
In this case I realize I need a UserPurchasesView view model with:
purchaseId (which is a mongo primary key)
userId,
product: {id, name, image, shortDescription, [maybe some others]}
prize
timestamp
Now ... the problem is that My domain is producing an event like UserPurchasedProduct(userId, productId). I could enrich an event with a prize, product name or maybe something else but not all fields. Im getting to a point where enriching seems to be wrong.
In this point I realize I need something like ProductDetailsView:
productId (primary key)
prize
name
shortDescription
logo
This view is maintained by events like: ProductCreated, ProductRenamed, ProductImageChanged
And now we have 2 options ...
Look into the ProductDetailsView when UserPurchasedProduct event comes in, take all needed product details and save it in UserPurchasesView for faster reads. This solution looks not that bad but it introduces some extra coupling and it seems to me these views cannot be scaled well when needed. Also both views must be rebuilt together when replying all events from the event store (rebuilding is also more tricky in that case).
Keep only the productId in the UserPurchasesView and read multiple views when user queries his purchases. This is some extra processing that would have to be done somewhere. In the frontend, in the backend controller or in some read model high level API. UPDATE: I also realized that I would also need to keep at least the prize and maybe name of the product in the UserPurchasesView (in case it changes) but sometimes you need the value from the time of a purchase and sometimes you need the recent value. Scenario depends on a business but we could imagine both.
None of these solutions looks perfect to me. Am I wrong, am I missing something or is it just the way to do it? Thanks!
You understand well.
So you have to choose between coupling between the read models and coupling between UI and individual read models.
One of the main advantages of CQRS/ES is the posibility to create blazing fast read models (views if you like), without any joins, the perfect cache as I saw it called. I personally have chosen every time the first approach, with full data denormalisation. The views are very fast and models very clean and clear. This is the perfect solution if you want to optimize the read side of your application (and I think you should).
By listening to the right events you can keep these read models in sync with the rest of the application.
There is a 3rd option:
The projection responsible for the UserPurchasesView view not only listens to UserPurchasedProduct events, but also to ProductCreated, ProductRenamed, ProductImageChanged - any product related events that affect the UserPurchasesView. Now, as well as the UserPurchasesView collection for the read model that it is responsible for, it also needs a private collection to maintain the bits of products it is interested in: ({id, name, image, shortDescription, [maybe some others]}), so that when a new purchase event comes in, you have somewhere to get the initial state of those product fields from. Since your UserPurchasesView needs to listen to some of those product events anyway in order to keep up to date when a product changes, this isn't really much extra work, and avoids any dependency on another projection (ProductDetailsView). The cross-projection dependency also has a potential problem due to eventual consistency - what if the product isn't even in the product details view yet when the UserPurchasedProduct event comes through?
To avoid any concurrency issues, it's simplest to have each projection managed only by a single process and a single thread. That way, as long as the projection can receive events in-order across streams (so that it is guaranteed to see the product creation before the product purchase), you won't have issues with seeing a purchase before the product exists. If you introduce sharding or any other multi-threading to your projection, it gets more complicated.

What is the correct data model for storing user relationships in Cassandra (i.e. Bob follows John)

I have a system where actions of users need to be sent to other users who subscribe to those updates. There aren't a lot of users/subscribers at the moment, but it could grow rapidly so I want to make sure I get it right. Is it just this simple?
create table subscriptions (person_uuid uuid,
subscribes_person_uuid uuid,
primary key (person_uuid, subscribes_person_uuid)
)
I need to be able to look up things in both directions, i.e. answer the questions:
Who are Bob's subscribers.
Who does Bob subscribe to
Any ideas, feedback, suggestions would be useful.
Those two queries represent the start of your model:
you want the user to be the PK or part of the PK.
depending on the cardinality of subscriptions/subscribers you could go with:
for low numbers: using a single table and two sets
for high numbers: using 2 tables similar to the one you describe
#Jacob
Your use case looks very similar to the Twitter example, I did modelize it here
If you want to track both sides of relationship, I'll need to have a dedicated table to index them.
Last but not least, depending on the fact that the users are mutable OR not, you can decide to denormalize (e.g. duplicate User content) or just store user ids and then fetch users content in a separated table.
I've implemented simple join feature in Achilles. Have a look if you want to go this way

Understanding Kohana ORM Relationships

I know this question has been asked a million times, but I can't seem to find one that really gives me a good understanding of how relationships work in Kohana's ORM Module.
I have a database with 5 tables:
approved_submissions
-submission_id
-contents
favorites
-user_id
-submission_id
ratings
-user_id
-submission_id
-rating
users
-user_id
votes
-user_id
-submission_id
-vote
Right now, favorites,ratings, and votes have a Primary Key that consists of every column in the table, so as to prevent a user favoriting the same submission_id multiple times, a user voting on the same submission_id multiple times etc. I also believe these fields are set up using foreign keys that reference approved_submissions and users so as to prevent invalid data existing in the respective fields.
Using the DB module, I can access and update these tables no problem. I really feel as though ORM may offer a more powerful and accessible way to accomplish the same things using less code.
Can you demonstrate how I might update a user voting on a submission_id? A user removing a favorite submission_id? A user changing their rating on a particular submission_id?
Also, do I need to make changes to my database structure or is it okay the way it is?
You're probably looking for has_many_through relationships.
So to add a new submission, you'd do something like
$user->add('submissions', $submission);
and to remove
$user->remove('submissions', $submission);
You may want to consider restructuring your database table and key names so you don't end up doing a lot of configuration.

When is the Data Vault model the right model for a data-warehouse?

I recently found a reference to 'Data Vault Modeling' as a model for data-warehouses. The models I've seen before are Inmon and Kimball. The author refers to possible performance problems due to the joins needed. It looks like a nice model, but I wonder about the gotcha's. Are there any experience reports on-line?
We have been using a home-grown modification to Data Vault for a number of years, called 'Link Modelling', which only has entities and links; drawing principles from neo4j, but implementing in a SQL database.
Both Link Modelling and Data Vault are very different ways of thinking to Kimball/Inmon models.
My comments below relate to a system built with the follow structure: a temporary staging database, a DWH, then a number of marts build from the DWH. There are other ways to architect a DWH solution, but this is quite typical.
With Kimball/Inmon
Data is cleaned on the way into the DWH, but sometimes applied on the way into the staging database
Business rules and MDM are (generally) applied between the staging db and the DWH
The marts are often subject area specific
With Data Vault/Link Modelling
Data is landed unchanged in staging
These data are passed through to the DWH also uncleaned, but stored in an entity/link form
Data cleansing, MDM and business rules are applied between the DWH and the marts.
Marts are based on subject area specific needs (same as above).
For us, we would often (but not always) build Kimball Star Schema style Marts, as the end users understand the data structures of these easily.
The occasions where a Link Modelled DWH comes into its own, are the following (using Kimball terminology to express the issues)
Upon occasion, there will be queries from the users asking 'why is a specific number having this value?'. In traditional Kimball/Inmon, data is cleansed on the way in, there is no way to know what the original value was. Link Model has the original data in the DWH.
When no transaction records exist that link a number of dimensions, and it is required to be able to report on the full set of data, so e.g. ask questions like 'How many insurance policies that were sold by a particular broker have no claim transactions paid?'.
The application of MDM in a type 2 Kimball or Inmon DWH can cause massive numbers of type 2 change records to be written to Dimensions, which often contain all the data values, so there is a lot of duplication of data. With a Link Model/Data Vault, a new dimensional value will just cause new type 2 links to be created in a link table, which only have foreign keys to entity tables. This is often overcome in Kimball DWH by having a slowly changing dimension and a fast changing dimension, which is a fair workaround.
In Insurance and other industries where there is the need to be able to produce 'as at date' reports, Fact tables will be slowly changing as well, type 2 dimension tracking against type 2 fact records are a nightmare.
From a development point of view, adding a new column to a large Kimball dimension needs to be done carefully and consideration of back-populating is important, but with a Link Model, adding an extra column to an Entity is relatively trivial.
There are always ways around these in Kimball methodology, but they require some careful thought and sometimes some jumping through hoops.
From our perspective, there is little downside to the Link Modelling.
I am not connected with any of the companies marketing/producing Kimball/Inmon or Data Vault methodologies.
You can find a whole lot more information on my blog: http://danLinstedt.com, and on the forums at datavaultinstitute dot com
But to give you a quick/brief answer to your question:
The gotchas are as follows:
1) Have to accept the concept of loading raw data to the data warehouse
2) Understand that the Data Vault usually doesn't allow "end-users" direct access because of the model.
There may be a few more, but the benefits outweigh the drawbacks.
Feel free to check out the blog, it's free to register/follow.
Cheers,
Dan Linstedt

Resources