What strategies exist to find unreachable keys in a key/value database? - domain-driven-design

TL;DR
How can you find "unreachable keys" in a key/value store with a large amount of
data?
Background
In comparison to relational database that provide ACID guarantees, NoSQL
key/value databases provide fewer guarantees in order to handle "big data".
For example, they only provide atomicity in the context of a single key/value
pair, but they use techniques like distributed hash tables to "shard" the data
across an arbitrarily large cluster of machines.
Keys are often unfriendly for humans. For example, a key for a blob of data
representing an employee might be
Employee:39045e87-6c00-47a4-a683-7aba4354c44a. The employee might also have a
more human-friendly identifier, such as the username jdoe with which the
employee signs in to the system. This username would be stored as a separate
key/value pair, where the key might be EmployeeUsername:jdoe. The value for
key EmployeeUsername:jdoe is typically either an array of strings containing
the main key (think of it like a secondary index, which does not necessarily
contain unique values) or a denormalised version of employee blob (perhaps
aggregating data from other objects in order to improve query performance).
Problem
Now, given that key/value databases do not usually provide transactional
guarantees, what happens when a process inserts the key
Employee:39045e87-6c00-47a4-a683-7aba4354c44a (along with the serialized
representation of the employee) but crashes before inserting the
EmployeeUsername:jdoe key? The client does not know the key for the employee
data - he or she only knows the username jdoe - so how to you find the
Employee:39045e87-6c00-47a4-a683-7aba4354c44a key?
The only thing I can think of is to enumerate the keys in the key/value store
and once you find the appropriate key, "resume" the indexing/denormalisation.
I'm well aware of techniques like event sourcing, where an idempotent event
handler could respond to the event (e.g., EmployeeRegistered) in order to
recreate the username-to-employee-uuid secondary index, but using event
sourcing over key/value store still requires enumeration of keys, which could
degrade performance.
Analogy
The more experience I have in IT, the more I see the same problems being
tackled in different scenarios. For example, Linux filesystems store both file
and directory contents in "inodes". You can think of these as key/value pairs,
where the key is an integer and the value is the file/directory contents. When
writing a new file, the system creates an inode and fills it with data THEN
modifies the parent directory to add the "filename-to-inode" mapping. If the
system crashes after creating the file but before referencing it in the parent
directory, your file "exists on disk" but is essentially unreadable. When the
system comes back online, hopefully it will place this file into the
"lost+found" directory (I imagine it does this by scanning the entire disk).
There are plenty of other examples (such as domain name to IP address mappings
in the DNS system), but I specifically want to know how the above problem is
tackled in NoSQL key/value databases.
EDIT
I found this interesting article on manual secondary indexes but it doesn't "broken" or "dated" secondary indexes.

The solution I've come up with is to use a process manager (or "saga"),
whose key contains the username. This also guarantees uniqueness across
employees during registration. (Note that I'm using a key/value store
with compare-and-swap (CAS) semantics for concurrency control.)
Create an EmployeeRegistrationProcess with a key of
EmployeeRegistrationProcess:jdoe.
If a concurrency error occurs (i.e., the registration process
already exists) then this is a duplicate username.
When started, the EmployeeRegistrationProcess allocates an
employee UUID. The EmployeeRegistrationProcess attempts to create
an Employee object using this UUID (e.g.,
Employee:39045e87-6c00-47a4-a683-7aba4354c44a).
If the system crashes after starting the
EmployeeRegistrationProcess but before creating the Employee, we
can still locate the "employee" (or more accurately, the employee
regisration process) by the username "jdoe". We can then resume the
"transaction".
If there is a concurrency error (i.e., the Employee with the
generated UUID already exists), the RegistrationProcess can flag
itself as being "in error" or "for review" or whatever process we
decide is best.
After the Employee has successfully been created, the
EmployeeRegistrationProcess creates the secondary index
EmployeeUsernameToUuid:jdoe ->
39045e87-6c00-47a4-a683-7aba4354c44a.
Again, if this fails, we can still locate the "employee" by the
username "jdoe" and resume the transaction.
And again, if there is a concurrency error (i.e., the
EmployeeUsernameToUuid:jdoe key already exists), the
EmployeeRegistrationProcess can take appropriate action.
When both commands have succeeded (the creation of the Employee and
the creation of the secondary index), the
EmployeeRegistrationProcess can be deleted.
At all stages of the process, Employee (or
EmployeeRegistrationProcess) is reachable via it's human-friendly
identifier "jdoe". Event sourcing the EmployeeRegistrationProcess is
optional.
Note that using a process manager can also help in enforcing uniqueness
across usernames after registration. That is, we can create a
EmployeeUsernameChangeProcess object with a key containing the new
username. "Enforcing" uniqueness at either registration or username
change hurts scalability, so the value identified by
"EmployeeUsernameToUuid:jdoe" could be an array of employee UUIDs.

If to look at a question from the point of view of eventsourcing entities, then responsibility of an entity of EventStore includes the guaranteed saving an event into storage and sending for the bus. From this point of view it is guaranteed that the event will be written completely, and as the database in the append-only mode, there will never be a problem with a non-valid event.
At the same time of course it isn't guaranteed that all commands which generate events will be successfully executed - it is possible to guarantee only an order and protection against repeated execution of the same command, but not all transaction.
Further occurs as follows - the saga intercepts an original command, and tries to execute everything transaction. If any part of transaction comes to the end with an error, or for example, doesn't come to the end for the preset time - that process is rolled away by means of generation of the so-called compensating events. Such events can't delete an entity, however bring system to the consistent state similar to that the command never and was executed.
Note. If your specific implementation of the database for events is arranged so that the key value can guarantee record only of one couple, just serialize an event, and the combination from the identifier and the version of a root of the aggregate can be a key. The version of the aggregate in this case somewhat is a CAS operation analog.
About concurrency you can read this article: http://danielwhittaker.me/2014/09/29/handling-concurrency-issues-cqrs-event-sourced-system/

Related

How to check if multiple keys exist in an EntityKind without also fetching the data at the same time?

I am using Cloud Firestore in Datastore mode. I have a list of keys of the same Kind, some exist already and some do not. For optimal performance, I want to run a compute-intensive operation only for the keys that do not yet exist. Using the Python client library, I know I can run client.get_multi() which will retrieve the list of keys that exist as needed. The problem is this will also return unneeded Entity data associated with existing keys, increasing the latency and cost of the request.
Is there a better way to check for existence of multiple keys?
You could check whether a key exists using keys-only queries as they return only the keys instead of the entities themselves, at lower latency and cost than retrieving entire entities.

Possibility of GUID collision in MS CRM Data migration

We are doing CRM data migration in order to keep two CRM Systems in Sync. And removing history data from Primary CRM. Target CRM is been created by taking Source as base. Now while we migrate the data we keep guids of record, same in order to maintain data integrity. This solution expects that in target systems that GUID must be available to assign to new record. There are no new records created directly in target system except Emails, that too very low in number. But apart from that there are ways in which system creates its guids, e.g when we move newly created entity to target solution using Solution it will not maintain the GUID of entity and attributes and will create its own, since we do not have control on this. Also some of the records which are created internally will also get created by platform and assigned a new GUID. Now if we do not have control over guid creation in target system(Although number is very small), i fear of the situation where Source System has guid which target has already consumed!! And at time of data migration it will give errors.
My Question is there any possibility that above can happen? because if that happens to us whole migration solution will loose its value.
SQL Server's NEWID() generates a 128-bit ID. All IDs generated on the same machine are guaranteed to be unique but because yours have been generated across multiple machines, there's no guarantee.
That being said, from this source on GUIDs:
...for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
So the answer is yes there is a chance of collision, but it's so astromonically low that most consider the answer to effectively be no.

disadvantages of generating keys at client side

At inserting documents, if the key is generated at client-side. does it slow down the writes on a single machine or cluster?
I ask because i think server-side generated keys are sure to be unique and doesn't need to be checked for uniqueness.
However what are the disadvantages or things to remember when generating keys on client side?(in single machine, sharding, master-master replication which is coming)
Generating keys on the client-side should not have any notable performance impact for ArangoDB. ArangoDB will parse the incoming JSON anyway, and will always look for a _key attribute in it. If it does not exist, it will create one itself. If it exists in the JSON, it will be validated for syntactic correctness (because only some characters are allowed inside document keys). That latter operation only happens when a _key value is specified in the JSON, but its impact is very likely negligible, especially when compared to the other things that happen when documents are inserted, such as network latency, disk writes etc.
Regardless of whether a user-defined _key value was specified or not, ArangoDB will check the primary index of the collection for a document with the same key. If it exists, the insert will fail with a unique key constraint violation. If it does not exist, the insert will proceed. As mentioned, this operation will always happen. Looking for the document in the primary index has an amortized complexity of O(1) and should again be negligible when compared to network latency, disk writes etc. Note that this check will always happen, even if ArangoDB generates the key. This is due to the fact that a collection may contain a mix of client-generated keys and ArangoDB-generated keys, and ArangoDB must still make sure it hasn't generated a key that a client had also generated before.
In a cluster, the same steps will happen, apart from that the client will send an insert to a coordinator node, which will need to forward the insert to a dbserver node. This is independent of whether a key is specified or not. The _key attribute will likely be the shard key for the collection, so the coordinator will send the request to exactly one dbserver node. If the _key attribute is not the shard key for the collection because it a different shard key was explicitly set, then user-defined keys are disallowed anyway.
Summary so far: in terms of ArangoDB there should not be relevant performance differences between generating the keys on the client side or having ArangoDB generate them.
The advantages and disadvantages of generating keys in the client application are, among others:
+ client application can make sure keys follow some required pattern / syntax that's not guaranteed by ArangoDB-generated keys and has full control over key creation algorithm (e.g. can use tenant-specific keys in multi-tenant application)
- client may need some data store for storing its key generator state (e.g. id of last generated key) to prevent duplicates (also after a restart of the client application)
- usage of client-side keys are disallowed when different shard keys are used in cluster mode

CQRS Event Sourcing check username is unique or not from EventStore while sending command

EventSourcing works perfectly when we have particular unique EntityID but when I am trying to get information from eventStore other than particular EntityId i am having tough time.
I am using CQRS with EventSourcing. As part of event-sourcing we are storing the events in SQL table as columns(EntityID (uniqueKey),EventType,EventObject(eg. UserAdded)).
So while storing EventObject we are just serializing the DotNet object and storing it in SQL, So, All the details related to UserAdded event will be in xml format. My concern is I want to make sure the userName which is present in db Should be unique.
So, while making command of AddUser I have to query EventStore(sql db) whether the particular userName is already present in eventStore. So for doing that I need to serialize all the UserAdded/UserEdited events in Event store and check if requested username is present in eventStore.
But as part of CQRS commands are not allowed to query may be because of Race condition.
So, I tried before sending the AddUser command just query the eventStore and get all the UserNames by serializing all events(UserAdded) and fetch usernames and if requested username is unique then shoot command else throwing exception that userName already exist.
As with above approach ,we need to query entire db and we may have hundreds of thousands of events/day.So the execution of query/deserialization will take much time which will lead to performance issue.
I am looking for any better approach/suggestion for maintaining username Unique either by getting all userNames from eventStore or any other approach
So, your client (the thing that issues the commands) should have full faith that the command it sends will be executed, and it must do this by ensuring that, before it sends the RegisterUserCommand, that no other user is registered with that email address. In other words, your client must perform the validation, not your domain or even the application services that surround the domain.
From http://cqrs.nu/Faq
This is a commonly occurring question since we're explicitly not
performing cross-aggregate operations on the write side. We do,
however, have a number of options:
Create a read-side of already allocated user names. Make the client
query the read-side interactively as the user types in a name.
Create a reactive saga to flag down and inactivate accounts that were
nevertheless created with a duplicate user name. (Whether by extreme
coincidence or maliciously or because of a faulty client.)
If eventual consistency is not fast enough for you, consider adding a
table on the write side, a small local read-side as it were, of
already allocated names. Make the aggregate transaction include
inserting into that table.
Querying different aggregates with a repository in a write operation as part of your business logic is not forbidden. You can do that in order to accept the command or reject it due to duplicate user by using some domain service (a cross-aggregate operation). Greg Young mentions this here: https://www.youtube.com/watch?v=LDW0QWie21s&t=24m55s
In normal scenarios you would just need to query all the UserCreated + UserEdited events.
If you expect to have thousands of these events per day, maybe your events are bloated and you should design more atomically. For example, instead having a UserEdited event raised every time something happens on a user, consider having UserPersonalDetailsEdited and UserAccessInfoEdited or similar, where the fields that must be unique are treated differently from the rest of user fields. That way, querying all the UserCreated + UserAccessInfoEdited prior to accepting or not a command would be a lighter operation.
Personally I'd go with the following approach:
More atomicity in events so that everything that touches fields that should be globally unique is described more explicitly (e.g: UserCreated, UserAccessInfoEdited)
Have projections available in the write side in order to query them during a write operation. So for example I'd subscribe to all UserCreated and UserAccessInfoEdited events in order to keep a queryable "table" with all the unique fields (e.g: email).
When a CreateUser command arrives to the domain, a domain service would query this email table and accept or reject the command.
This solution relies a bit on eventual consistency and there's a possibility where the query tells us that field has not been used and allows the command to succeed raising an event UserCreated when actually the projection hadn't been updated yet from a previous transaction, causing therefore the situation where there are 2 fields in the system that are not globally unique.
If you want to completely avoid these uncertain situations because your business can't really deal with eventual consistency my recommendation is to deal with this in your domain by explicitly modeling them as part of your ubiquitous language. For example you could model your aggregates differently since it's obvious that your aggregate User is not really your transactional boundary (i.e: it depends on others).
As often, there's no right answer, only answers that fit your domain.
Are you in an environment that really requires immediate consistency ? What would be the odds of an identical user name being created between the moment uniqueness is checked by querying (say, at client side) and when the command is processed ? Would your domain experts tolerate, for instance, one out of 1 million user name conflict (that can be compensated afterwards) ? Will you have a million users in the first place ?
Even if immediate consistency is required, "user names should be unique"... in which scope ? A Company ? An OnlineStore ? A GameServerInstance ? Can you find the most restricted scope in which the uniqueness constraint must hold and make that scope the Aggregate Root from which to sprout a new user ? Why would the "replay all the UserAdded/UserEdited events" solution be bad after all, if the Aggregate Root makes these events small and simple ?
With GetEventStore (from Greg Young) you can use whatever string as your aggregateId/StreamId. Use the username as the id of the aggregate instead of guids, or a combination like "mycompany.users.john" as the key and.. voila! You have for free user name uniqueness!

Is it possible to make conditional inserts with Azure Table Storage

Is it possible to make a conditional insert with the Windows Azure Table Storage Service?
Basically, what I'd like to do is to insert a new row/entity into a partition of the Table Storage Service if and only if nothing changed in that partition since I last looked.
In case you are wondering, I have Event Sourcing in mind, but I think that the question is more general than that.
Basically I'd like to read part of, or an entire, partition and make a decision based on the content of the data. In order to ensure that nothing changed in the partition since the data was loaded, an insert should behave like normal optimistic concurrency: the insert should only succeed if nothing changed in the partition - no rows were added, updated or deleted.
Normally in a REST service, I'd expect to use ETags to control concurrency, but as far as I can tell, there's no ETag for a partition.
The best solution I can come up with is to maintain a single row/entity for each partition in the table which contains a timestamp/ETag and then make all inserts part of a batch consisting of the insert as well as a conditional update of this 'timestamp entity'. However, this sounds a little cumbersome and brittle.
Is this possible with the Azure Table Storage Service?
The view from a thousand feet
Might I share a small tale with you...
Once upon a time someone wanted to persist events for an aggregate (from Domain Driven Design fame) in response to a given command. This person wanted to ensure that an aggregate would only be created once and that any form of optimistic concurrency could be detected.
To tackle the first problem - that an aggregate should only be created once - he did an insert into a transactional medium that threw when a duplicate aggregate (or more accurately the primary key thereof) was detected. The thing he inserted was the aggregate identifier as primary key and a unique identifier for a changeset. A collection of events produced by the aggregate while processing the command, is what is meant by changeset here. If someone or something else beat him to it, he would consider the aggregate already created and leave it at that. The changeset would be stored beforehand in a medium of his choice. The only promise this medium must make is to return what has been stored as-is when asked. Any failure to store the changeset would be considered a failure of the whole operation.
To tackle the second problem - detection of optimistic concurrency in the further life-cycle of the aggregate - he would, after having written yet another changeset, update the aggregate record in the transactional medium if and only if nobody had updated it behind his back (i.e. compared to what he last read just before executing the command). The transactional medium would notify him if such a thing happened. This would cause him to restart the whole operation, rereading the aggregate (or changesets thereof) to make the command succeed this time.
Of course, now he had solved the writing problems, along came the reading problems. How would one be able to read all the changesets of an aggregate that made up its history? Afterall, he only had the last committed changeset associated with the aggregate identifier in that transactional medium. And so he decided to embed some metadata as part of each changeset. Among the meta data - which is not so uncommon to have as part of a changeset - would be the identifier of the previous last committed changeset. This way he could "walk the line" of changesets of his aggregate, like a linked list so to speak.
As an additional perk, he would also store the command message identifier as part of the metadata of a changeset. This way, when reading changesets, he could know in advance if the command he was about to execute on the aggregate was already part of its history.
All's well that ends well ...
P.S.
1. The transactional medium and changeset storage medium can be the same,
2. The changeset identifier MUST not be the command identifier,
3. Feel free to punch holes in the tale :-),
4. Although not directly related to Azure Table Storage, I've implemented the above tale successfully using AWS DynamoDB and AWS S3.
How about storing each event at "PartitionKey/RowKey" created based on AggregateId/AggregateVersion?where AggregateVersion is a sequential number based on how many events the aggregate already has.
This is very deterministic, so when adding a new event to the aggregate, you will make sure that you were using the latest version of it, because otherwise you'll get an error saying that the row for that partition already exists. At this time you can drop the current operation and retry, or try to figure out if you could merge the operation anyways if the new updates to the aggregate do not conflict to the operation you just did.

Resources