Data Repository using MemoryCache

Data Repository using MemoryCache - multithreading

I built a homebrew data entity repository with a factory that defines retention policy by type (e.g. absolute or sliding expiration). The policy also specifies the cache type as httpcontext request, session, or application. A MemoryCache is maintained by a caching proxy in all 3 cache types. Anyhow, I have a data entity service tied to the repository which does the load and save for our primary data entity. The idea is you use the entity repository and don't need to care if the entity is cached or retrieved from it's data source (db in this case).
An obvious assumption would be that you would need to synchronise the load/save events as you would need to save the cached entity before loading the entity from it's data source.
So I was investigating a data integrity issue in production today... :)
Today I read there can be a good long gap between the entity being removed from the MemoryCache and the CacheItemRemovedCallback event firing (default 20 seconds). The simple lock I had around the load and save data ops was insufficient. Furthermore the CacheItemRemovedCallback was in it's own context outside of HttpContext making things interesting. It meant I needed to make the callback function static as I was potentially assigning a disposed instance to the event.
So once I realised there was was the possibility of a gap whereby my data entity no longer existed in cache but might not have been saved to it's data source might explain the 3 corrupt orders out of 5000. While filling out a long form it would be easy to perform work beyond the policy's 20 minute sliding expiration on the primary data entity. That means if they happen to submit at the same moment of expiration an interesting race condition between the load (via request context) and save (via cache expired callback) emerges.
With a simple lock it was the roll of the dice, would save or load win? Clearly we need a save before the next load from the data source (db). Ideally when an item expires from the cache it is atomically written to it's data source. with the entity gone from the cache but the expired callback not yet fired a load operation can slip in. In this case the entity will not be found in the cache so will default to load from the data source. However, as the save operation may not have commenced resulting in data integrity corruption and will likely clobber your now saved cached data.
To accomplish synchronisation I need a named signalling lock so I settled on EventWaitHandle. A named lock is created per user which is < 5000. This allows the Load to wait on a signal from the expired event which Saves the entity (whose thread exists in its own context outside HttpContext). So in the save it is easy to grab the existing name handle and signal the Load to continue once the Save is complete.
I also have a redundancy where it times out and logs each 10 seconds block by the save operation. As I said, the default is meant to be 20 seconds between an entity being removed form MemoryCache and it being conscious of it to fire the event which in turn saves the entity.
Thank you to anyone who followed my ramblings through all that. Given the nature of the sync requirements was the EventWaitHandle lock the best solution?

For completeness I wanted to post what I did to address the issue. I made multiple changes to the design to create a tidier solution which did not require a named sync object and allowed me to use a simple lock instead.
First the data entity repository is a singleton which was stored in the request cache. This front end of the repository is detached from the cache's themselves. I changed it to reside in the session cache instead which becomes important below.
Second I changed the event for the expired entity to route through the data entity repository above.
Third I changed the MemoryCache event from RemovedCallback to UpdateCallback**.
Last, we tie it all together with a regular lock in the data entity repository which is is the user's session and the gap-less expiry event routing through the same allowing the lock to cover load and save (expire) operations.
** These events are funny in that A) you can't subscribe to both and B) UpdateCallback is called before the item is removed from the cache but it is not called when you explicitly remove the item (aka myCache.Remove(entity) won't call event but UpdateCallback will). We made the decision if the item was being forcefully removed from the cache that we didn't care. This happens when the user changes company or clears their shopping list. So these scenarios won't fire the event so the entity may never be saved to the DB's cache tables. While it might have been nice for debugging purposes it wasn't worth dealing with the limbo state of an entity's existence to use the RemovedCallback which had 100% coverage.

Related

Why does Spring's Hazelcast4IndexSessionRepository implement deltas and a Hazelcast EntryProcessor?

Spring-session provides a Hazelcast4IndexedSessionRepository that stores the session data (represented as MapSession) in a Hazelcast cache. Instead of returning the MapSession directly from the repository, it uses its own wrapper class HazelcastSession (both MapSession and HazelcastSession implement the same Session interface). One of the reasons is for sure so that it can implement the different flush and save modes and support the update of the principal name index. But it also remembers all changes to the session as deltas and when the save() method on the respository is called, it uses an Hazelcast4SessionUpdateEntryProcessor to update the corresponding map entry of the Hazelcast IMap.
Why does the session repository not just set the MapSession object on the IMap directly via put without using an EntryProcessor? What is the benefit of the current implementation of recording the change deltas?
To my understanding of the Hazelcast EntryProcessor documentation, an entry processor is useful when a map entry should be updated often without the client having to retrieve the existing value first. Instead of first getting the old value (which might require a network round-trip), the entry processor can be executed directly on the Hazelcast member that holds the data.
But in case of a Spring Session, the session data is loaded from the Hazelcast map at the beginning of each incoming web request anyway (or latest when the application code wants to read/modify the session content) and then held in local memory. All changes to the session during the processing of such a request are done to the local session object and it is then saved again to the Hazelcast cache when the request ends (or earlier depending on the flush/save mode). That means the saving can be done without executing an extra get request on the IMap first. So why not just call map.put(MapSession) instead of using an EntryProcessor to update only the attributes noted in the delta list?
The only explanation I could think of would be the attempt to minimize concurrent modification of the same attributes. By saving only the deltas in the EntryProcessor instead of storing the whole MapSession which was loaded earlier, the chance to overwrite an attribute value that was modified concurrently in a parallel process is less likely. But it is not zero. Especially if my application code stores and updates only the same couple of attributes in the session all the time, even with the EntryProcessor the update will not be safe because there is no optimistic lock scheme in place.
Thanks for the insight!

How to process Read Model in CQRS

We want to implement cqrs in our new design. We have some doubts in processing command handler and read model. We got understand that while processing commands we should take optimistic lock on aggregateId. But what approach should be considered while processing readModels. Should we take lock on entire readModel or on aggregateId or never take lock while processing read model.
case 1. when take lock on entire readmodel -> it is safest but is not good in term of speed.
case 2 - take lock on aggregateId. Here two issues may arise. if we take lock aggregateId wise -> then what if read model server restarts. It does not know from where it starts again.
case 3 - Never take lock. in ths approach, I think data may be in corrputed state. For eg say an order inserted event is generated and thorugh some workflow/saga, order updated event took place as well. what if order updated event comes first and order inserted event is not yet processed ?
Hope I am able to address my issue.

If you do not process events concurrently in the Readmodel then there is no need for a lock. This is the case when you have a single instance of the Readmodel, possible in a Microservice, that poll for events and process them sequentially.
If you have a synchronous Readmodel (i.e. in the same process as the Writemodel/Aggregate) then most probably you will need locking.
An important thing to keep in mind is that a Readmodel most probably differs from the Writemodel. There could be a lot of Writemodel types whos events are projected in the same Readmodel. For example, in an ecommerce shop you could have a ListOfProducts that projects event from Vendor and from Product Aggregates. This means that, when we speak about a Readmodel we cannot simply refer to the "Aggregate" because there is not single Aggregate involved. In the case of ecommerce, when we say "the Aggregate" we might refer to the Product Aggregate or Vendor Aggregate.
But what to lock? Here depends on the database technology. You should lock the smallest affected read entity or collection that can be locked. In a Readmodel that consist of a list of products (read entities, not aggregates!), when an event that affects only one product you should lock only that product (i.e. ProductTitleRenamed).
If an event affects more products then you should lock the entire collection. For example, VendorWasBlocked affects all the products (it should remove all the products from that vendor).
You need the locking for the events that have non-idempotent side effects, for the case where the Readmodel's updater fails during the processing of an event, if you want to retry/resume from where it left. If the event has idempotent side effects then it can be retried safely.
In order to know from where to resume in case of a failed Readmodel, you could store inside the Readmodel the sequence of the last processed event. In this case, if the entity update succeeds then the last processed event's sequence is also saved. If it fails then you know that the event was not processed.

For eg say an order inserted event is generated and thorugh some workflow/saga, order updated event took place as well. what if order updated event comes first and order inserted event is not yet processed ?
Read models are usually easier to reason about if you think about them polling for ordered sequences of events, rather than reacting to unordered notifications.
A single read model might depend on events from more than one aggregate, so aggregate locking is unlikely to be your most general answer.
That also means, if we are polling, that we need to keep track of the position of multiple streams of data. In other words, our read model probably includes meta data that tells us what version of each source was used.
The locking is likely to depend on the nature of your backing store / cache. But an optimistic approach
read the current representation
compute the new representation
compare and swap
is, again, usually easy to reason about.

Should latest event version be queried in event sourcing?

I am developing a simple DDD + Event sourcing based app for educational purposes.
In order to set event version before storing to event store I should query event store but my gut tells that this is wrong because it causes concurrency issues.
Am I missing something?

There are different answers to that, depending on what use case you are considering.
Generally, the event store is a dumb, domain agnostic appliance. It's superficially similar to a List abstraction -- it stores what you put in it, but it doesn't actually do any work to satisfy your domain constraints.
In use cases where your event stream is just a durable record of things that have happened (meaning: your domain model does not get a veto; recording the event doesn't depend on previously recorded events), then append semantics are fine, and depending on the kind of appliance you are using, you may not need to know what position in the stream you are writing to.
For instance: the API for GetEventStore understands ExpectedVersion.ANY to mean append these events to the end of the stream wherever it happens to be.
In cases where you do care about previous events (the domain model is expected to ensure an invariant based on its previous state), then you need to do something to ensure that you are appending the event to the same history that you have checked. The most common implementations of this communicate the expected position of the write cursor in the stream, so that the appliance can reject attempts to write to the wrong place (which protects you from concurrent modification).
This doesn't necessarily mean that you need to be query the event store to get the position. You are allowed to count the number of events in the stream when you load it, and to remember how many more events you've added, and therefore where the stream "should" be if you are still synchronized with it.
What we're doing here is analogous to a compare-and-swap operation: we get a representation of the original state of the stream, create a new representation, and then compare and swap the reference to the original to point instead to our changes
oldState = stream.get()
newState = domainLogic(oldState)
stream.compareAndSwap(oldState, newState)
But because a stream is a persistent data structure with append only semantics, we can use a simplified API that doesn't require duplicating the existing state.
events = stream.get()
changes = domainLogic(events)
stream.appendAt(count(events), changes)
If the API of your appliance doesn't allow you to specify a write position, then yes - there's the danger of a data race when some other writer changes the position of the stream between your query for the position and your attempt to write. Data obtained in a query is always stale; unless you hold a lock you can't be sure that the data hasn't changed at the source while you are reading your local copy.

I guess you shouldn't to think about event version.
If you talk about the place in the event stream, in general, there's no guaranteed way to determine it at the creation moment, only in processing time or in event-storage.
If it is exactly about event version (see http://cqrs.nu/Faq, How do I version/upgrade my events?), you have it hardcoded in your application. So, I mean next use case:
First, you have an app generating some events. Next, you update app and events are changed (you add some fields or change payload structure) but kept logical meaning. So, now you have old events in your ES, and new events, that differ significantly from old. And to distinguish one from another you use event version, eg 0 and 1.

React Flux dispatcher vs Node.js EventEmitter - scalable?

When you use Node's EventEmitter, you subscribe to a single event. Your callback is only executed when that specific event is fired up:
eventBus.on('some-event', function(data){
// data is specific to 'some-event'
});
In Flux, you register your store with the dispatcher, then your store gets called when every single event is dispatched. It is the job of the store to filter through every event it gets, and determine if the event is important to the store:
eventBus.register(function(data){
switch(data.type){
case 'some-event':
// now data is specific to 'some-event'
break;
}
});
In this video, the presenter says:
"Stores subscribe to actions. Actually, all stores receive all actions, and that's what keeps it scalable."
Question
Why and how is sending every action to every store [presumably] more scalable than only sending actions to specific stores?

The scalability referred to here is more about scaling the codebase than scaling in terms of how fast the software is. Data in flux systems is easy to trace because every store is registered to every action, and the actions define every app-wide event that can happen in the system. Each store can determine how it needs to update itself in response to each action, without the programmer needing to decide which stores to wire up to which actions, and in most cases, you can change or read the code for a store without needing to worrying about how it affects any other store.
At some point the programmer will need to register the store. The store is very specific to the data it'll receive from the event. How exactly is looking up the data inside the store better than registering for a specific event, and having the store always expect the data it needs/cares about?
The actions in the system represent the things that can happen in a system, along with the relevant data for that event. For example:
A user logged in; comes with user profile
A user added a comment; comes with comment data, item ID it was added to
A user updated a post; comes with the post data
So, you can think about actions as the database of things the stores can know about. Any time an action is dispatched, it's sent to each store. So, at any given time, you only need to think about your data mutations a single store + action at a time.
For instance, when a post is updated, you might have a PostStore that watches for the POST_UPDATED action, and when it sees it, it will update its internal state to store off the new post. This is completely separate from any other store which may also care about the POST_UPDATED event—any other programmer from any other team working on the app can make that decision separately, with the knowledge that they are able to hook into any action in the database of actions that may take place.
Another reason this is useful and scalable in terms of the codebase is inversion of control; each store decides what actions it cares about and how to respond to each action; all the data logic is centralized in that store. This is in contrast to a pattern like MVC, where a controller is explicitly set up to call mutation methods on models, and one or more other controllers may also be calling mutation methods on the same models at the same time (or different times); the data update logic is spread through the system, and understanding the data flow requires understanding each place the model might update.
Finally, another thing to keep in mind is that registering vs. not registering is sort of a matter of semantics; it's trivial to abstract away the fact that the store receives all actions. For example, in Fluxxor, the stores have a method called bindActions that binds specific actions to specific callbacks:
this.bindActions(
"FIRST_ACTION_TYPE", this.handleFirstActionType,
"OTHER_ACTION_TYPE", this.handleOtherActionType
);
Even though the store receives all actions, under the hood it looks up the action type in an internal map and calls the appropriate callback on the store.

Ive been asking myself the same question, and cant see technically how registering adds much, beyond simplification. I will pose my understanding of the system so that hopefully if i am wrong, i can be corrected.
TLDR; EventEmitter and Dispatcher serve similar purposes (pub/sub) but focus their efforts on different features. Specifically, the 'waitFor' functionality (which allows one event handler to ensure that a different one has already been called) is not available with EventEmitter. Dispatcher has focussed its efforts on the 'waitFor' feature.
The final result of the system is to communicate to the stores that an action has happened. Whether the store 'subscribes to all events, then filters' or 'subscribes a specific event' (filtering at the dispatcher). Should not affect the final result. Data is transferred in your application. (handler always only switches on event type and processes, eg. it doesn't want to operate on ALL events)
As you said "At some point the programmer will need to register the store.". It is just a question of fidelity of subscription. I don't think that a change in fidelity has any affect on 'inversion of control' for instance.
The added (killer) feature in facebook's Dispatcher is it's ability to 'waitFor' a different store, to handle the event first. The question is, does this feature require that each store has only one event handler?
Let's look at the process. When you dispatch an action on the Dispatcher, it (omitting some details):
iterates all registered subscribers (to the dispatcher)
calls the registered callback (one per stores)
the callback can call 'waitfor()', and pass a 'dispatchId'. This internally references the callback of registered by a different store. This is executed synchronously, causing the other store to receive the action and be updated first. This requires that the 'waitFor()' is called before your code which handles the action.
The callback called by 'waitFor' switches on action type to execute the correct code.
the callback can now run its code, knowing that its dependancies (other stores) have already been updated.
the callback switches on the action 'type' to execute the correct code.
This seems a very simple way to allow event dependancies.
Basically all callbacks are eventually called, but in a specific order. And then switch to only execute specific code. So, it is as if we only triggered a handler for the 'add-item' event on the each store, in the correct order.
If subscriptions where at a callback level (not 'store' level), would this still be possible? It would mean:
Each store would register multiple callbacks to specific events, keeping reference to their 'dispatchTokens' (same as currently)
Each callback would have its own 'dispatchToken'
The user would still 'waitFor' a specific callback, but be a specific handler for a specific store
The dispatcher would then only need to dispatch to callbacks of a specific action, in the same order
Possibly, the smart people at facebook have figured out that this would actually be less performant to add the complexity of individual callbacks, or possibly it is not a priority.

What's the best way to keep count of the data set size information in Core Data?

Right now whenever I need to access my data set size (and it can be quite frequently), I perform a countForFetchRequest on the managedObjectContext. Is this a bad thing to do? Should I manage the count locally instead? The reason I went this route is to ensure I am getting 100% correct answer. With Core Data being accessed from more than one places (for example, through NSFetchedResultsController as well), it's hard to keep an accurate count locally.

-countForFetchRequest: is always evaluated in the persistent store. When using the Sqlite store, this will result in IO being performed.
Suggested strategy:
Cache the count returned from -countForFetchRequest:.
Observe NSManagedObjectContextObjectsDidChangeNotification for your own context.
Observe NSManagedObjectContextDidSaveNotification for related contexts.
For the simple case (no fetch predicate) you can update the count from the information contained in the notification without additional IO.
Alternately, you can invalidate your cached count and refresh via -countForFetchRequest: as necessary.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string