DDD correct the identity of an Entity - domain-driven-design

In DDD, Entities have a value that uniquely identify them i.e. the identity. Sometimes this identity is generated by the server, sometimes is obtained from another BC, sometimes is provided by the user, and so on. Let's assume we are working in the scenario where the user provides the identity.
Let's pretend that there is a business process exclusively done on paper and not-to-be-migrated-to-computer-soon where the Process Owner decides the new name for a thing called a Resource. The name is always following a fixed schema like PROD-<today's date>-<short random string> and is always validated between Very Important Team Members. The chosen and validated name is PROD-2021-01-04-KAH14564YUDO, the last character being an " O " (the letter) and not a " 0 " (the number).
Let's say that an operator registers this new Resource in the system, providing the given identity but mistakingly spelling the last character as a zero, perhaps because of bad handwriting. The Entity is inserted, some other Entities are linked to it via its identity, and then someone detects the error in the identity. What should happen now?
We know that the identity of an Entity should be unique and immutable, but here it seems we need to correct (and so change) it. Introducing a surrogate identity to avoid this bad insert problem is not correct since the identity provided by the PO and validated by the Very Important Team Members is actually unique and is not to be changed, it was only inserted with an error in the management system; moreover there is no concept in the business of this surrogate identity related to the Resource.
Where is the error in this scenario?

Interesting situation. I'm assuming that you cannot add validation to the Identity as it's just a random string entered by the user.
Let's start with the question Where is the error in this scenario?. It's an error prone mechanism or workflow in the Domain that you are in. When you build a system for a specific domain you will have to deal with nasty stuff from that domain. You just have to catch these nasty rules or mechanisms as early as possible and design the system in a way to handle them.
Let's see how you can handle this scenario.
One thing you can do is to use another ID (an auto-generated GUID for example) used by the system and hidden from the users. You can use it to link other entities to this one. This way if an error is detected to the Identity entered by the user you won't have to update the whole system as the Identity will not be used anywhere else. If you need it from another part of the system you should just make a query contains the GUID to get the Identity back. This will unsure that the ID is indeed immutable. Depending on the system it may be a good solution or it may complicate some parts of it and it's not always a viable solution.
If using another ID only for system use is not an option, then you just have to design it in a way to handle these situations. You will have to include updating the Identity from the user as a use-case. Add handling of Identity updates from every part of the system that uses this Identity. In some cases these errors will have nasty consequences. One example is if this Identity is send to another system or a person by e-mail and is already used somewhere else that your system doesn't have control of. In this case it's not the systems fault, it's in the Domain and the people who use it. The only way to fix this is to change the rules and mechanisms in the Domain. This isn't possible most of the time, but sometimes you can raise this issue and a more robust mechanism can be implemented. It's a nasty situation to be in but that's life.
Example for using natural keys / identity instead of GUID.
If you have a network of systems that operate with natural keys and their generation is robust you can use them. For example, banking systems use the International Bank Account Number (IBAN). These numbers are generated by a special schema that is robust. They are not just some random string entered by a user. In this case, the Domain has a robust mechanism that ensures that these natural keys are valid. In this case, it's almost impossible to send a GUID to another banking system to exchange for an IBAN.
When sending money to a bank account, this IBAN is validated, so errors are easily detected. This way a person cannot send money to a non-existing bank account thus losing them just for making a typo.

If you cannot fix the database, then fix the paper and make sure it never happens again e.g. use hex characters only.

Related

CQRS to command or not to, that is the question

I am new to CQRS, but can see the value in this, so I am trying to apply this to a financial system that we are busy rebuilding.
Like I mentioned, this is a basic fin system with basic balance, withdraw, deposit like functionality.
I have a withdraw & deposit commands. But I am struggling with balance.
According to the domain experts, they want to handle balance as a transaction, with no financial implication (yet), on the clients behalf. So, when the client does a balance inq via the device, it creates a transaction, but also a balance query at the same time.
In the CQRS world, you distiguish between commands that mutate state & queries, that retrieve data in some way.
Apologies if my understanding here are flawed. Can someone point me in the correct direction?
EDIT:
Maybe let me put it this way. I was thinking of creating a CheckBalanceCommand that creates a transaction & insert a BalanceCheckedEvent into the store. But then I would also need to create a CheckBalanceQuery to retrieve the actual balance from the read db.
I would need to invoke both in order to satisfy the balance request.
This is an interesting issue. Your business case is valid: some commands don't mutate aggregate/entity states, still treating them and their resultant events are important (e.g. for audit trails).
In order to support these cases, I'd introduce a base event type named IdentityEvent (inspired by identity values for various mathematical operators and as a justification for the concept; operating them on a certain value doesn't change it). On issuing the corresponding command, derivatives of this event (e.g. BalanceCheckedEvent in your case) will be appended to the aggregate's event stream and view projection may construct views from them as usual; however, their mutate method will not perform any actual mutation while reconstructing entities from event stream.
The actual command processing takes place at the domain layer. Some of your application service, at the application layer, receives the query request, processes it as usual. Additionally, before or after the query operation, the same application service may issue the command to the domain layer, on the aggregate root itself. That doesn't violate any principle: your read and query model are still separate, application service just coordinating between the two.
This is not as rare as you would imagine. An additional valid business case is when a service provider runs a credit check on someone. Credit reporting companies actually store queries made against ones credit score, and use it to influence future credit scores. Of course, when I say that this isn't as rare as we imagine, I'm not attempting to normalize such practices (and we should push back to understand the real value something like this is offering to our product).
What I suggest though is to model this explicitly and not try to generalize this. This feature probably is driven by some business need, and you should model it as such. By this I mean that you should treat the service serving the reads as a separate service entirely, which can raise it's own events for things that have happened, and design the rest of the system in a reactive way (ie responding to events generated by another BC/service).
As an example, you could have the service which serves the query fire a BalanceChecked event, which either the same service or another one could store in a stream for subsequent processing.
I would not suggest a command, because if you'll be replying with the data it's not as if someone can reject the command; it has already happened, someone already has the data.

CQRS/Event Sourcing - Does one expect to receive an Aggregate Id from the user/request?

I am currently just trying to learn some new programming patterns and I decided to give event sourcing a shot.
I have decided to model a warehouse as my aggregate root in the domain of shipping/inventory where the number of warehouses is generally pretty constant (i.e. a company wont be adding warehouses too often).
I have run into the question of how to set my aggregateId, which should correspond to a warehouse, on my server. Most examples I have seen, including this one, show the aggregate ID being generated server side when a new aggregate is being created (in my case a warehouse), and then passed in the command request when referring to that aggregate for subsequent commands.
Would you say this is the correct approach? Can I expect the user to know and pass aggregate Ids when issuing commands? I realize this is probably domain dependent and could also be a UI/UX choice as well, just wondering what other's have done. It would make more sense to me if the number of my event sourced aggregates were more frequent, such as with meal tabs or shopping carts.
Thanks!
Heuristic: aggregate id, in many cases, is analogous to the primary key used to distinguish entities in a database table. Many of the lessons of natural vs surrogate keys apply.
Can I expect the user to know and pass aggregate Ids when issuing commands?
You probably can't depend on the human to know the aggregate ids. But the client that the human operator is using can very well know them.
For instance, if an operator is going to be working in a single warehouse during a session, then we might look up the appropriate identifier, cache it, and use it when constructing messages on behalf of the user.
Analog: when you fill in a web form and submit it, the browser does the work of looking at the form action and using that information to construct the correct URI, and similarly the correct HTTP Request.
The client will normally know what the ID is, because it just got it during a previous query.
Creation patterns are weird. It can, in some circumstances, make sense for the client to choose the identifier to be used when creating a new aggregate. In others, it makes sense for the client to provide an identifier for the command message, and the server decides for itself what the aggregate identifier should be.
It's messaging, so you want to be careful about coupling the client directly to your internal implementation details -- especially if that client is under a different development schedule. If you get the message contract right, then the server and client can evolve in any way consistent with the contract at any time.
You may want to review Greg Young's 10 year retrospective, which includes a discussion of warehouse systems. TL;DR - in many cases the messages coming from the human operators are events, not commands.
Would you say this is the correct approach?
You're asking if one of Greg Young's Event Sourcing samples represents the correct approach... Given that the combination of CQRS and Event Sourcing was essentially (re)invented by Greg, I'd say there's a pretty good chance of that.
In general, letting the code that implements the Command-side generate a GUID for every Command, Event, or other persistent object that it needs to write is by far the simplest implementation, since GUIDs are guaranteed to be unique. In a distributed system, uniqueness without coordination is a big thing.
Can I expect the user to know and pass aggregate Ids when issuing commands?
No, and you particularly can't expect a user to know the GUID of their assets. What you may be able to do is to present the user with a list of his or her assets. Each item in the list will have the GUID associated, but it may not be necessary to surface that ID in the user interface. It's just data that the underlying UI object carries around internally.
In some cases, users do need to know the ID of some of their assets (e.g. if it involves phone support). In that case, you can add a lookup API to address that concern.

Visible User ID in Address Bar

Currently, to pass a user id to the server on certain views I use the raw user id.
http://example.com/page/12345 //12345 Being the users id
Although there is no real security risk in my specific application by exposing this data, I can't help but feeling a little dirty about it. What is the proper solution? Should I somehow be disguising the data?
Maybe a better way to propose my question is to ask what the standard approach is. Is it common for applications to use user id's in plain view if it's not a security risk? If it is a security risk how is it handled? I'm just looking for a point in the right direction here.
There's nothing inherently wrong with that. Lots of sites do it. For instance, Stack Overflow users can be enumerated using URLs of the form:
http://stackoverflow.com/users/123456
Using a normalized form of the user's name in the URL, either in conjunction with the ID or as an alternative to it, may be a nicer solution, though, e.g:
http://example.com/user/yourusername
http://example.com/user/12345/yourusername
If you go with the former, you'll need to ensure that the normalized username is set up as a unique key in your user database.
If you go with the latter, you've got a choice: if the normalized username in the database doesn't match the one in the URL, you can either redirect to the correct URL (like Stack Overflow does), or return a 404 error.
In addition to duskwuff's great suggestion to use the username instead of the ID itself, you could use UUIDs instead of integers. They are 128-bit in length so infeasible to enumerate, and also avoid disclosing exactly how many users you have. As an added benefit, your site is future proofed against user id limits if it becomes massively popular.
For example, with integer ids, an attacker could find out the largest user_id on day one, and come back in a week or months time and find what the largest user_id is now. They can continually do this to monitor the rate of growth on your site - perhaps not a biggie for your example - but many organisations consider this sort of information commercially sensitive. Also helps avoid social engineering, e.g. makes it significantly harder for an attacker to email you asking to reset their password "because I've changed email providers and I've forgotten my old password but I remember my user id!". Give an attack an inch and they'll run a mile.
I prefer to use Version/Type 4 (Random) UUIDs, however you could also use Version/Type 5 (SHA-1-based) so you could go UUID.fromName(12345) and get a UUID derived from the integer value, which is useful if you want to migrate existing data and need to update a bunch of foreign key values. Most major languages support UUIDs natively or are included in popular libraries (C & C++), although some database software might require some tweaking - I've used them with postgres and myself and are easy transitions.
The downside is UUIDs are significantly longer and not memorable, but it doesn't sound like you need the ability for the user to type in the URLs manually. You do also need to check if the UUID already exists when creating a user, and if it does, just keep generating until an unused UUID is found - in practice given the size of the numbers, using Version 4 Random UUIDs you will have a better chance at winning the lottery than dealing with a collision, so it's not something that will impact performance etc.
Example URL: http://example.com/page/4586A0F1-2BAD-445F-BFC6-D5667B5A93A9

API design and security: Why hide internal ids?

I've heard a few people say that you should never expose your internal ids to the outside world (for instance an auto_increment'ng primary key).
Some suggest having some sort of uuid column that you use instead for lookups.
I'm wondering really why this would be suggested and if it's truly important.
Using a uuid instead is basically just obfuscating the id. What's the point? The only thing I can think of is that auto_incrementing integers obviously point out the ordering of my db objects. Does it matter if an outside user knows that one thing was created before/after another?
Or is it purely that obfuscating the ids would prevent "guessing" at different operations on specific objects?
Is this even an issue I should thinking about when designing an external facing API?
Great answers, I'll add another reason to why you don't want to expose your internal auto incremented ID.
As a competitive company I can easily instrument how many new users/orders/etc you get every week/day/hour. I just need to create a user and/or order and subtract the new ID from what I got last time.
So not only for security reasons, it's business reasons as well.
Any information that you provide a malicious user about your application and its layout can and will be used against your application. One of the problems we face in (web) application security is that seemingly innocuous design decisions taken at the infancy of a project become achilles heels when the project scales larger. Letting an attacker make informed guesses about the ordering of entities can come back to haunt you in the following, somewhat unrelated ways:
The ID of the entity will inevitably be passed as a parameter at some point in your application. This will result in hackers eventually being able to feed your application arguments they ordinarily should not have access to. I've personally been able to view order details (on a very popular retailer's site) that I had no business viewing, as a URL argument no less. I simply fed the app sequential numbers from my own legitimate order.
Knowing the limits or at least the progression of primary key field values is invaluable fodder for SQL injection attacks, scope of which I can't cover here.
Key values are used not only in RDBMS systems, but other Key-Value mapping systems. Imagine if the JSESSION_ID cookie order could be predetermined or guessed? Everybody with opposable thumbs will be replaying sessions in web apps.
And many more that I'm sure other ppl here will come up with.
SEAL team 6 doesn't necessarily mean there are 6 seal teams. Just keeps the enemy guessing. And the time spent guessing by a potential attacker is more money in your pocket any way you slice it.
As with many security-related issues, it's a subtle answer - kolossus gives a good overview.
It helps to understand how an attacker might go about compromising your API, and how many security breaches occur.
Most security breaches are caused by bugs or oversights, and attackers look for those. An attacker who is trying to compromise your API will firstly try to collect information about it - as it's an API, presumably you publish detailed usage documentation. An attacker will use this document, and try lots of different ways to make your site crash (and thereby expose more information, if he's lucky), or react in ways you didn't anticipate.
You have to assume the attacker has lots of time, and will script their attack to try every single avenue - like a burglar with infinite time, who goes around your house trying every door and window, with a lock pick that learns from every attempt.
So, if your API exposes a method like getUserInfo(userid), and userID is an integer, the attacker will write a script to iterate from 0 upwards to find out how many users you have. They'll try negative numbers, and max(INT) + 1. Your application could leak information in all those cases, and - if the developer forgot to handle certain errors - may expose more data than you intended.
If your API includes logic to restrict access to certain data - e.g. you're allowed to execute getUserInfo for users in your friend list - the attacker may get lucky with some numbers because of a bug or an oversight, and he'll know that the info he is getting relates to a valid user, so they can build up a model of the way your application is designed. It's the equivalent of a burglar knowing that all your locks come from a single manufacturer, so they only need to bring that lock pick.
By itself, this may be of no advantage to the attacker - but it makes their life a tiny bit easier.
Given the effort of using UUIDs or another meaningless identifier, it's probably worth making things harder for the attacker. It's not the most important consideration, of course - it probably doesn't make the top 5 things you should do to protect your API from attackers - but it helps.

Are MongoDB ids guessable?

If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!
Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).
Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.

Resources