node.js + mongo + atomic update of multiple entities = head ache - node.js

My setup:
Node.js
Mongojs
A simple database containing two collections - inventory and invoices.
Users may concurrently create invoices.
An invoice may refer to several inventory items.
My problem:
Keeping the inventory integrity. Imagine a scenario were two users submit two invoices with overlapping item sets.
A naive (and wrong) implementation would do the following:
For each item in the invoice read the respective item from the inventory collection.
Fix the quantity of the inventory items.
If any item quantity goes below zero - abandon the request with the relevant message to the user.
Save the inventory items.
Save the invoice.
Obviously, this implementation is bad, because the actions of the two users are going to interleave and affect each other. In a typical blocking server + relational database this is solved with complex locking/transaction schemes.
What is the nodish + mongoish way to solve this? Are there any tools that the node.js platform provides for these kind of things?

You can look at a two phase commit approach with MongoDB, or you can forget about transactions entirely and decouple your processes via a service bus approach. Use Amazon as an example - they will allow you to submit your order, but they will not confirm it until they have been able to secure your inventory item, charged your card, etc. None of this occurs in a single transaction - it is a series of steps that can occur in isolation and can have compensating steps applied where necessary.
A naive bus implementation would do the following (keep in mind that this is just a generic suggestion for you to work from and the exact implementation would depend on your specific needs for concurrency, etc.):
place the order on the queue. At this point, you can
continue to have your client wait, or you can thank them for their
order and let them know they will receive an email when its been
processed.
an "inventory worker" will grab the order and lock the inventory
items that it needs to reserve. This can be done in many different
ways. With Mongo you could create a collection that has a document per orderid. This document would have as its ID the inventory item ID and a TTL that is reasonable
(say 30 seconds). As long as the worker has the lock, then it can
manage the inventory levels of the items it has locks for. Once its
made its changes, it could delete the "lock" document.
If another worker comes along that wants to manage the same item
while its locked, you could put the blocked worker into sleep mode
for X seconds and then retry or, better yet, you could put the
request back onto the message bus to be picked up later by another
worker.
Once the worker has resolved all the inventory items, it then can
place another message on the service bus that indicates a card
should be charged, or processing should receive a notification to
pull the inventory, or an email can be sent to the person who made
the order, etc., etc.
Sounds complex, but once you have a message bus setup, its actually relatively simple. A list of Node Message Bus Implementations can be found here.
Some developers will even skip the formal message bus completely and use a database as their message passing engine which can work in simple implementations. Google Mongo and Queues.
If you don't expect more than 1 server and the message bus implementation is too bulky, node could handle the locking and message passing for you. For example, if you really wanted to lock with node, you could create an array that stored the inventory item IDs. Although, to be frank, I think the message bus is the best way to go. Anyway, here's some code I have used in the past to handle simple external resource locking with Node.
// attempt to take out a lock, if the lock exists, then place the callback into the array.
this.getLock = function( id, cb ) {
if(locks[id] ) {
locks[id].push( cb );
return false;
}
else {
locks[id] = [];
return true;
}
};
// call freelock when done
this.freeLock = function( that, id ) {
async.forEach(locks[id], function(item, callback) {
item.apply( that,[id]);
callback();
}, function(err){
if(err) {
// do something on error
}
locks[id] = null;
});
};

Related

How to model chat messages in an event-sourced system?

Context: I'm exploring to build an event sourced system / PoC using EventStoreDB (separate event stream per aggregate) with Node.JS/TypeScript. One part of the system is a 1:1 customer support chat. When a chat message is created, a push notification is sent to the user, including an update to the app's badge number (total unread message count). I'm wondering what's the best way to model the aggregates / bounded contexts.
Question 1: where to put the chat messages?
Question 2: how to handle a customer's unread message badge counter?
Since chat messages are by themselves already timed events, they seem like they could easily fit in an event sourced system. Still, I'm looking for advise on how to best model the aggregates:
Option A: Since each chat message has its own lifecycle (they can be edited, have a read status that gets updated, etc.), ChatMessage could be an aggregate on its own. This would explode the number of aggregates (and thus streams), but that might not really be such an issue for EventStoreDB. However, to send the notification for a message, we'll need to know the total number of unread messages (so info on other aggregates). But how should the push notification sending "saga" / "process manager" (which is the correct term?) know what badge counter to send with the notification? Should it keep its own state / read model with the current counter for each customer based on all the event it has seen?
Option B: Another way might be to have a list of messages under the Customer aggregate root. That way, Customer could have a counter for the number of unread messages and a fold of all the events would give me that number. However, here I'm afraid the large number of chat message events for the Customer aggregate root gets in the way of "simple" Customer behavior. E.g. when processing a Customer command, we'd first get the current state by folding all events (assume no snapshotting is used), which means applying all those chat events, even to just do something with the current name of the customer.
Option C: Or should these be in different bounded contexts? So have the Customer with it's contact details in a bounded context, and have a separate bounded context for chat (or communications in general), where both have a Customer aggregate root sharing only the UUID of the customer? Would that be best of both worlds, or would that give other challenges?
Is any of the options the way to go? Or is there another, better option? Or am I just missing the point entirely ;) (don't wanna rule that out)
Any advice is much appreciated!
Event Sourcing describes a way to (re)create state, by storing every change as an event. This does not include how those events get persisted or snapshotted, or how they are read and distributed.
I always start from the User Interface. Because that's where you should know which information you want to display and which actions can be executed.
For example there could be the following Commands (or actions executed by the User Interface):
SendMessage(receiverId, content)
MarkMessageAsRead(messageId)
Your server will then check if the provided data is valid and create the related Events:
class SupportChatMessageAggregate {
MessageId messageId;
UserId senderId;
UserId receiverId;
String content;
boolean readByReceiver;
// depending on framework and personal preference, this could
// also be a method: handle(SendMessage command, CurrentUser currentUser)
constructor(SendMessage command, CurrentUser currentUser) {
validate(command); // throws Exception if invalid
// for example if content is empty,
// or if currentUser is not allowed to send messages to receiverId
publishEvent(new MessageSentEvent(
command.getMessageId(),
currentUser.getUserId(),
command.getReceiverId(),
command.getContent()
));
}
handle(MarkMessageAsRead command, CurrentUser currentUser) {
validate(command); // throws Exception if invalid
// for example check if currentUser == receiver
publishEvent(new MessageMarkedAsReadEvent(
command.getMessageId(),
currentUser.getUserId()
));
}
...
}
Now when you want to know the badge counter for a User, you simply add up all the MessageSentEvents where receiver = currentUser, and subtract all the MessageMarkedAsReadEvents of the currentUser.
This could be done for example within the UnreadSupportChatMessageCountAggregate, that is responsible for providing the current unreadMessages value based on the MessageSentEvents and MessageMarkedAsReadEvents for a given User. A pretty boring Aggregate, but it does the job.
That's Event Sourcing: You simply have a bunch of events, and if you want to query some data, you just fetch all related events, process them, and get your result. If you use separate event streams per aggregate or just have a single stream for all events is an implementation detail (or depends on the event store you use).
Depending on the number of events this can be extremely fast, or very slow. That's where snapshots and/or read models (from CQRS) come in handy. But for plain Event Sourcing this is not required.

Concurrency issue when processing webhooks

Our application creates/updates database entries based on an external service's webhooks. The webhook sends the external id of the object so that we can fetch more data for processing. The processing of a webhook with roundtrips to get more data is 400-1200ms.
Sometimes, multiple hooks for the same object ID are sent within microseconds of each other. Here are timestamps of the most recent occurrence:
2020-11-21 12:42:45.812317+00:00
2020-11-21 20:03:36.881120+00:00 <-
2020-11-21 20:03:36.881119+00:00 <-
There can also be other objects sent for processing around this time as well. The issue is that concurrent processing of the two hooks highlighted above will create two new database entries for the same single object.
Q: What would be the best way to prevent concurrent processing of the two highlighted entries?
What I've Tried:
Currently, at the start of an incoming hook, I create a database entry in a Changes table which stores the object ID. Right before processing, the Changes table is checked for entries that were created for this ID within the last 10 seconds; if one is found, it quits to let the other process do the work.
In the case above, there were two database entries created, and because they were SO close in time, they both hit the detection spot at the same time, found each other, and quit, resulting in nothing being done.
I've thought of adding some jitter'd timeout before the check (increases processing time), or locking the table (again, increases processing time), but it all feels like I'm fighting the wrong battle.
Any suggestions?
Our API is Django 3.1 with a Postgres db
Okay, this might not be a very satisfactory answer, but it sounds to me like the root of your problem isn't necessarily with your own app, but the webhooks service you are receiving from.
Due to inherent possibility for error in network communication, webhooks which guarantee delivery always use at-least-once semantics. A sender that encounters a failure that leaves receipt uncertain needs to try sending the webhook again, even if the webhook may have been received the first time, thus opening the possibility for a duplicate event.
By extension, all webhook sending services should offer some way of deduplicating an individual event. I help run our webhooks at Stripe, and if you're using those, every webhook sent will come with an event ID like evt_1CiPtv2eZvKYlo2CcUZsDcO6, which a receiver can use for deduplication.
So the right answer for your problem is to ask your sender for some kind of deduplication/idempotency key, because without one, their API is incomplete.
Once you have that, everything gets really easy: you'd create a unique index on that key in the database, and then use upsert to guarantee only a single entry. That would look something like:
CREATE UNIQUE INDEX index_my_table_idempotency_key ON my_table (idempotency_key);
INSERT INTO object_changes (idempotency_key, ...) VALUES ('received-key', ...)
ON CONFLICT (idempotency_key) DO NOTHING;
Second best
Absent an idempotency ID for deduping, all your solutions are going to be hacky, but you could still get something workable together. What you've already suggested of trying to round off the receipt time should mostly work, although it'll still have the possibility of losing two events that were different, but generated close together in time.
Alternatively, you could also try using the entire payload of a received webhook, or better yet, a hash of it, as an idempotency ID:
CREATE UNIQUE INDEX index_my_table_payload_hash ON my_table (payload_hash);
INSERT INTO object_changes (payload_hash, ...) VALUES ('<hash_of_webhook_payload>', ...)
ON CONFLICT (payload_hash) DO NOTHING;
This should keep the field relatively small in the database, while still maintaining accurate deduplication, even for unique events sent close together.
You could also do a combination of the two: a rounded timestamp plus a hashed payload, just in case you were to receive a webhook with an identical payload somewhere down the line. The only thing this wouldn't protect against is two different events sending identical payloads close together in time, which should be a very unlikely case.
If you look at the acquity webhook docs, they supply a field called action, which key to making your webhook idempotent. Here are the quotes I could salvage:
action either scheduled rescheduled canceled changed or order.completed depending on the action that initiated the webhook call
The different actions:
scheduled is called once when an appointment is initially booked
rescheduled is called when the appointment is rescheduled to a new time
canceled is called whenever an appointment is canceled
changed is called when the appointment is changed in any way. This includes when it is initially scheduled, rescheduled, or canceled, as well as when appointment details such as e-mail address or intake forms are updated.
order.completed is called when an order is completed
Based on the wording, I assume that scheduled, canceled, and order.completed are all unique per object_id, which means you can use a unique together constraint for those messages:
class AcquityAction(models.Model):
id = models.CharField(max_length=17, primary_key=True)
class AcquityTransaction(models.Model):
action = models.ForeignKey(AcquityAction, on_delete=models.PROTECT)
object_id = models.IntegerField()
class Meta:
unique_together = [['object_id', 'action_id']]
You can substitute the AcquityAction model for an Enumeration Field if you'd like, but I prefer having them in the DB.
I would ignore the change event entirely, since it appears to trigger on every event, according to their vague definition. For the rescheduled event, I would create a model that allows you to use a unique constraint on the new date, so something like this:
class Reschedule(models.Model):
schedule = models.ForeignKey(MyScheduleModel, on_delete=models.CASCADE)
schedule_date = models.DateTimeField()
class Meta:
unique_together = [['schedule', 'schedule_date']]
Alternatively, you could have a task specifically for updating your schedule model with a rescheduled date, that way it remains idempotent.
Now in your view, you will do something like this:
from django.db import IntegrityError
ACQUITY_ACTIONS = {'scheduled', 'canceled', 'order.completed'}
def webhook_view(request):
validate(request)
action = get_action(request)
if action in ACQUITY_ACTIONS:
try:
insert_transaction()
except IntegrityError:
return HttpResponse(200)
webhook_task.delay()
elif action == 'rescheduled':
other_webhook_task.delay()
...

How to ensure either all things are done, or none is done?

The title might be misleading (I couldn't come up with a better title to be honest) so please read my explanation:
Let's say we are trying to create a user and also update the cache:
Create user and insert to database.
Update the cache with created user.
OR
We are trying to publish an event after user is created (for example in microservices)
Create user and insert to database.
Publish an event with created user.
OR
We are trying to do n things and we want to ensure either all of them get completed or none.
Create user and insert to database.
Update cache.
Send an email.
Send SMS.
Publish an event, ... ( the list goes on )
In a perfect world where there are no failures, we can just write them in order and that's it, but what happens when we have a failure after user creation is complete? (Before adding to cache OR Sending the event, etc)
These examples are made up and are for the cache example:
const data = {
id: 1
};
const user = database.createUser(data);
// Power goes out here (or any kind of failure)
cache.setCache(user);
Here, We've successfully created the user but failed to update the cache.
Let's give another example using database transactions:
const data = {
id: 1
};
const transaction = database.startTransaction();
try {
const user = database.createUser(data);
cache.setCache(user);
// Power goes out here (or any kind of failure)
transaction.commit();
} catch(err) {
transaction.rollback();
}
Here, We've successfully updated the cache but the user was never created because of the failure.
Thank you for your time.
When working with microservices, the usual ACID transactions that we are used to work with won't apply. Instead you could have a look at BASE transactions.
See here : https://www.johndcook.com/blog/2009/07/06/brewer-cap-theorem-base/
An alternative to ACID is BASE:
Basic Availability
Soft-state
Eventual consistency
Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state. (Accounting systems do this all the time. It’s called “closing out the books.”) It’s OK to use stale data, and it’s OK to give approximate answers.
Technically it means that you're gonna have to find a clean way to deal with failure, for example by sending Events in case of failure (which means the user you created should be removed from the cache, or event send an email saying there's been an error).
We often see examples in orders or payments system, where you can receive an email saying that the order could not be processed.

Who and how should handle replaying events?

I am learning about DDD,CQRS and Event-sourcing and there is something I cannot figure out. Commands trigger changes in the aggregates and once the change is performed an event is fired. The event is subsequently handled by other parts of the system and preserved in the event store. However, I do not understand how replaying events would recreate the aggregate, if changes are triggered by commands.
Example: If we have a online shop.
AddItemToCardCommand -> Card Aggregate adds the item to its card -> ItemAddedToCardEvent -> The event is handled by whoever.
However, if the event is replayed, the aggregate would not add the item to its card.
To sum up, my question is how should I recreate aggregates based on the events in the event store? Also, any general advice on how to replay events the right way would be appreaciated.
For simplicity, let's assume a stateless process - our service doesn't try to keep copies of things in memory, but instead reloads aggregates as needed.
The service receives AddItemToCardCommand:{card:123, ...}. We don't have the current state of card:123 in memory, so we need to create it. We do that by loading the state of card:123 from our durable store. Because we chose to use event sourced storage, the "state" we read from the durable store is a representation of the history of events previously written by the service.
Event histories have within them all of the information you need to remember, but not necessarily in a convenient "shape" - append only lists are a great data structure for writes, but not necessarily good for reads.
What this often means is that we will "replay" the events to create an in memory object which we can then use to answer questions about the events we will write next.
The same pattern is used when answering simple queries: we load the history of events from the store, transform the event history into a more convenient shape, and then use that shape to compute the answer.
In circumstances where query latency is more important than timeliness, we might design our query handler to read the convenient shapes from a cache, rather than trying to compute them fresh every time; a concurrently running background thread would be responsible to waking up periodically to compute new contents for the cache.
Using an async process to pull updates from an event stream is a common pattern; Greg Young discusses some of the advantages of that approach in his Polyglot Data talk.
In an ideal event scenario, you would not have an already constructed aggregate structure available in your database. You repeatedly arrive at the end data structure by running through all events stored so far.
Let me illustrate with some pseudocode of adding items to cart, and then fetching the cart data.
# Create a new cart
POST /cart/new
# Store a series of events related to the cart (in database as records, similar to array items)
POST /cart/add -> CartService.AddItem(item_data) -> ItemAddedToCart
A series of events would look like:
* ItemAddedToCart
* ItemAddedToCart
* ItemAddedToCart
* ItemRemovedFromCart
* ItemAddedToCart
When its time to fetch cart data from the DB, you construct a new cart instance (or retrieve a cart instance if persisted) and replay the events on it.
cart = Cart(id=ID1)
# Fetch contents of Cart with id ID1
for each event in ID1 cart's events:
if event is ItemAddedToCart:
cart.add_item(event.data)
else if event is ItemRemovedFromCart:
cart.remove_item(event.data)
return cart
Occasionally, when there are too many events related to the cart, you may want to generate the aggregate structure then and save it in DB. Next time, you can start with the aggregate structure savepoint, and continue applying new events. This optimization helps save time and improve performance when there are too many events to process.
What may help is to not think of the command as changing the state but rather the event as changing the state. In fact, I don't quite see how else one would go about doing so. The command handler in your aggregate would apply the invariants and, if all is OK, would immediately create the event and call some method that would apply it ([Apply|On|Do]MyEvent). The fact that you have an event after the fact does not necessarily mean other parts of your system would handle it. It is however required for event sourcing. Once you have an event you can most certainly pass that on to other parts of your system via, say, publishing on a service bus.
When you replay your events you are calling the same methods that the commands were calling to actually mutate the state of your aggregate:
public MyEvent MyCommand(string data)
{
if (string.IsNullOrWhiteSpace(data))
{
throw new ArgumentException($"Argument '{nameof(data)}' may not be empty.");
}
return On(new MyEvent
{
Data = data
});
}
private MyEvent On(MyEvent myEvent)
{
// change the relevant state
someState = myEvent.Data;
return myEvent;
}
Your event sourcing infrastructure would call On(MyEvent) for MyEvent when replaying. Since you have an event it means that it was a valid state transition and can simply be applied; else something went wrong in your initial command processing and you probably have a bug.
All events in an event store would be in chronological order for an aggregate. In addition to this the events should have a global sequence number to facilitate projection processing.
You could have a generic projection that accepts any/all events and then publishes the event on a service bus for system integration. You could also place that burden on a client of the event store to have it keep track of the position itself and then read events off the store itself. You could combine these and have the client subscribe to service bus events but ensure that it executes them in the same order by keeping track of the position (global sequence number) itself and update it as the events are processed.

How to avoid concurrency issues when scaling writes horizontally?

Assume there is a worker service that receives messages from a queue, reads the product with the specified Id from a document database, applies some manipulation logic based on the message, and finally writes the updated product back to the database (a).
This work can be safely done in parallel when dealing with different products, so we can scale horizontally (b). However, if more than one service instance works on the same product, we might end up with concurrency issues, or concurrency exceptions from the database, in which case we should apply some retry logic (and still the retry might fail again and so on).
Question: How do we avoid this? Is there a way I can ensure two instances are not working on the same product?
Example/Use case: An online store has a great sale on productA, productB and productC that ends in an hour and hundreds of customers are buying. For each purchase, a message is enqueued (productId, numberOfItems, price). Goal: How can we run three instances of our worker service and make sure that all messages for productA will end up in instanceA, productB to instanceB and productC to instanceC (resulting in no concurrency issues)?
Notes: My service is written in C#, hosted on Azure as a Worker Role, I use Azure Queues for messaging, and I'm thinking to use Mongo for storage. Also, the Entity IDs are GUID.
It's more about the technique/design, so if you use different tools to solve the problem I'm still interested.
Any solution attempting to divide the load upon different items in the same collection (like orders) are doomed to fail. The reason is that if you got a high rate of transactions flowing you'll have to start doing one of the following things:
let nodes to talk each other (hey guys, are anyone working with this?)
Divide the ID generation into segments (node a creates ID 1-1000, node B 1001-1999) etc and then just let them deal with their own segment
dynamically divide a collection into segments (and let each node handle a segment.
so what's wrong with those approaches?
The first approach is simply replicating transactions in a database. Unless you can spend a large amount of time optimizing the strategy it's better to rely on transactions.
The second two options will decrease performance as you have to dynamically route messages upon ids and also change the strategy at run-time to also include newly inserted messages. It will fail eventually.
Solutions
Here are two solutions that you can also combine.
Retry automatically
Instead you have an entry point somewhere that reads from the message queue.
In it you have something like this:
while (true)
{
var message = queue.Read();
Process(message);
}
What you could do instead to get very simple fault tolerance is to retry upon failure:
while (true)
{
for (i = 0; i < 3; i++)
{
try
{
var message = queue.Read();
Process(message);
break; //exit for loop
}
catch (Exception ex)
{
//log
//no throw = for loop runs the next attempt
}
}
}
You could of course just catch db exceptions (or rather transaction failures) to just replay those messages.
Micro services
I know, Micro service is a buzz word. But in this case it's a great solution. Instead of having a monolithic core which processes all messages, divide the application in smaller parts. Or in your case just deactivate the processing of certain types of messages.
If you have five nodes running your application you can make sure that Node A receives messages related to orders, node B receives messages related to shipping etc.
By doing so you can still horizontally scale your application, you get no conflicts and it requires little effort (a few more message queues and reconfigure each node).
For this kind of a thing I use blob leases. Basically, I create a blob with the ID of an entity in some known storage account. When worker 1 picks up the entity, it tries to acquire a lease on the blob (and create the blob itself, if it doesn't exist). If it is successful in doing both, then I allow the processing of the message to occur. Always release the lease afterwards.
If I am not successfull, I dump the message back onto the queue
I follow the apporach originally described by Steve Marx here http://blog.smarx.com/posts/managing-concurrency-in-windows-azure-with-leases although tweaked to use new Storage Libraries
Edit after comments:
If you have a potentially high rate of messages all talking to the same entity (as your commend implies), I would redesign your approach somewhere.. either entity structure, or messaging structure.
For example: consider CQRS design pattern and store changes from processing of every message independently. Whereby, product entity is now an aggregate of all changes done to the entity by various workers, sequentially re-applied and rehydrated into a single object
If you want to always have the database up to date and always consistent with the already processed units then you have several updates on the same mutable entity.
In order to comply with this you need to serialize the updates for the same entity. Either you do this by partitioning your data at producers, either you accumulate the events for the entity on the same queue, either you lock the entity in the worker using an distributed lock or a lock at the database level.
You could use an actor model (in java/scala world using akka) that is creating a message queue for each entity or group of entities that process them serially.
UPDATED
You can try an akka port to .net and here.
Here you can find a nice tutorial with samples about using akka in scala.
But for general principles you should search more about [actor model]. It has drawbacks nevertheless.
In the end pertains to partition your data and ability to create a unique specialized worker(that could be reused and/or restarted in case of failure) for a specific entity.
I assume you have a means to safely access the product queue across all worker services. Given that, one simple way to avoid conflict could be using global queues per product next to the main queue
// Queue[X] is the queue for product X
// QueueMain is the main queue
DoWork(ProductType X)
{
if (Queue[X].empty())
{
product = QueueMain().pop()
if (product.type != X)
{
Queue[product.type].push(product)
return;
}
}else
{
product = Queue[X].pop()
}
//process product...
}
The access to queues need to be atomic
You should use session enabled service bus queue for ordering and concurrency.
1) Every high scale data solution that I can think of has something built in to handle precisely this sort of conflict. The details will depend on your final choice for data storage. In the case of a traditional relational database, this comes baked in without any add'l work on your part. Refer to your chosen technology's documentation for appropriate detail.
2) Understand your data model and usage patterns. Design your datastore appropriately. Don't design for scale that you won't have. Optimize for your most common usage patterns.
3) Challenge your assumptions. Do you actually have to mutate the same entity very frequently from multiple roles? Sometimes the answer is yes, but often you can simply create a new entity that's similar to reflect the update. IE, take a journaling/logging approach instead of a single-entity approach. Ultimately high volumes of updates on a single entity will never scale.

Resources