Locking HttpRuntime.Cache for lazy loading - multithreading

We have a website running .NET 2.0 and have started using the ASP.Net HttpRuntime.Cache to store the results of frequent data lookups to cut down our database access.
Snippet:
lock (locker)
{
if (HttpRuntime.Cache[cacheKey] == null)
{
HttpRuntime.Cache.Insert(cacheKey, GetSomeDataToCache(), null, DateTime.Today.AddDays(1), Cache.NoSlidingExpiration);
}
return ((SomeData)HttpRuntime.Cache[cacheKey]).Copy();
}
We are pessimistically locking whenever we want to look at the cache. However, I've seen various blogs posted around the web suggesting you lock after you check the cache value instead, to not incur the overhead of the lock. That doesn't seem right as another thread may have written to the cache after the check.
So finally my question is what is the "right" way to do this? Are we even using the right thread synchronization object? I am aware of ReaderWriterLockSlim() but we're running .NET 2.0.

As far as I know the Cache object is thread safe so you wouldn't need the lock.

The Cache object in .NET is thread safe, so locking is not necessary. Reference: http://msdn.microsoft.com/en-us/library/system.web.caching.cache.aspx.

Your code is probably making you think you'll have that item cached for 1 day and your last line will always give that data to you, but that's not the case. As others said, the cache operations are synchronized so you shouldn't lock at that point.
Take a look here for the proper way of doing it.

Thread safe. Does it mean all other processes are waiting for ever for your code to finish?
Thread safe is you can be sure that the item you fetch will not be "cut in half" or partially demolished by an update to the cache at the same time you are reading the item.
item = cache.Get(key);
But anything you do after that - another thread can operate on the cache (or any other shared resource). If you want to do something to the cache based on your fetched item being null or not I would not be 100% sure it is not already fixed by a an other instance of your own code being a few CPU instructions ahead servicing another reader of the same page of your on line motor magazine.
You need some bad luck. The risk of other processes bothering about the same cache object, non atomic, in a few lines of code appart, is randomly small. But if it happens you will have a hard time figuring out why the image of the Chevy is sometimes the small suitcase from page two.

Related

What multithreading based data structure should I use?

I have recently come across a question based on multi-threading. I was given a situation where there will be variable no of cars constantly changing there locations. Also there are multiple users who are posting requests to get location of any car at any moment. What would be data structure to handle this situation and why?
You could use a mutex (one per car).
Lock: before changing location of the associated car
Unlock: after changing location of the associated car
Lock: before getting location of the associated car
Unlock: after done doing work that relies on that location being up to date
I'd answer with:
Try to make threading an external concept to your system yet make the system as modular and encapsulated as possible at the same time. It will allow adding concurrency at later phase at low cost and in case the solution happens to work nicely in a single thread (say by making it event-loop-based) no time will have been burnt for nothing.
There are several ways to do this. Which way you choose depends a lot on the number of cars, the frequency of updates and position requests, the expected response time, and how accurate (up to date) you want the position reports to be.
The easiest way to handle this is with a simple mutex (lock) that allows only one thread at a time to access the data structure. Assuming you're using a dictionary or hash map, your code would look something like this:
Map Cars = new Map(...)
Mutex CarsMutex = new Mutex(...)
Location GetLocation(carKey)
{
acquire mutex
result = Cars[carKey].Location
release mutex
return result
}
You'd do that for Add, Remove, Update, etc. Any method that reads or updates the data structure would require that you acquire the mutex.
If the number of queries far outweighs the number of updates, then you can do better with a reader/writer lock instead of a mutex. With an RW lock, you can have an unlimited number of readers, OR you can have a single writer. With that, querying the data would be:
acquire reader lock
result = Cars[carKey].Location
release reader lock
return result
And Add, Update, and Remove would be:
acquire writer lock
do update
release writer lock
Many runtime libraries have a concurrent dictionary data structure already built in. .NET, for example, has ConcurrentDictionary. With those, you don't have to worry about explicitly synchronizing access with a Mutex or RW lock; the data structure handles synchronization for you, either with a technique similar to that shown above, or by implementing lock-free algorithms.
As mentioned in comments, a relational database can handle this type of thing quite easily and can scale to a very large number of requests. Modern relational databases, properly constructed and with sufficient hardware, are surprisingly fast and can handle huge amounts of data with very high throughput.
There are other, more involved, methods that can increase throughput in some situations depending on what you're trying to optimize. For example, if you're willing to have some latency in reported position, then you could have position requests served from a list that's updated once per minute (or once every five minutes). So position requests are fulfilled immediately with no lock required from a static copy of the list that's updated once per minute. Updates are queued and once per minute a new list is created by applying the updates to the old list, and the new list is made available for requests.
There are many different ways to solve your problem.

Using threadsafe initialization in a JRuby gem

Wanting to be sure we're using the correct synchronization (and no more than necessary) when writing threadsafe code in JRuby; specifically, in a Puma instantiated Rails app.
UPDATE: Extensively re-edited this question, to be very clear and use latest code we are implementing. This code uses the atomic gem written by #headius (Charles Nutter) for JRuby, but not sure it is totally necessary, or in which ways it's necessary, for what we're trying to do here.
Here's what we've got, is this overkill (meaning, are we over/uber-engineering this), or perhaps incorrect?
ourgem.rb:
require 'atomic' # gem from #headius
SUPPORTED_SERVICES = %w(serviceABC anotherSvc andSoOnSvc).freeze
module Foo
def self.included(cls)
cls.extend(ClassMethods)
cls.send :__setup
end
module ClassMethods
def get(service_name, method_name, *args)
__cached_client(service_name).send(method_name.to_sym, *args)
# we also capture exceptions here, but leaving those out for brevity
end
private
def __client(service_name)
# obtain and return a client handle for the given service_name
# we definitely want to cache the value returned from this method
# **AND**
# it is a requirement that this method ONLY be called *once PER service_name*.
end
def __cached_client(service_name)
##_clients.value[service_name]
end
def __setup
##_clients = Atomic.new({})
##_clients.update do |current_service|
SUPPORTED_SERVICES.inject(Atomic.new({}).value) do |memo, service_name|
if current_services[service_name]
current_services[service_name]
else
memo.merge({service_name => __client(service_name)})
end
end
end
end
end
end
client.rb:
require 'ourgem'
class GetStuffFromServiceABC
include Foo
def self.get_some_stuff
result = get('serviceABC', 'method_bar', 'arg1', 'arg2', 'arg3')
puts result
end
end
Summary of the above: we have ##_clients (a mutable class variable holding a Hash of clients) which we only want to populate ONCE for all available services, which are keyed on service_name.
Since the hash is in a class variable (and hence threadsafe?), are we guaranteed that the call to __client will not get run more than once per service name (even if Puma is instantiating multiple threads with this class to service all the requests from different users)? If the class variable is threadsafe (in that way), then perhaps the Atomic.new({}) is unnecessary?
Also, should we be using an Atomic.new(ThreadSafe::Hash) instead? Or again, is that not necessary?
If not (meaning: you think we do need the Atomic.news at least, and perhaps also the ThreadSafe::Hash), then why couldn't a second (or third, etc.) thread interrupt between the Atomic.new(nil) and the ##_clients.update do ... meaning the Atomic.news from EACH thread will EACH create two (separate) objects?
Thanks for any thread-safety advice, we don't see any questions on SO that directly address this issue.
Just a friendly piece of advice, before I attempt to tackle the issues you raise here:
This question, and the accompanying code, strongly suggests that you don't (yet) have a solid grasp of the issues involved in writing multi-threaded code. I encourage you to think twice before deciding to write a multi-threaded app for production use. Why do you actually want to use Puma? Is it for performance? Will your app handle many long-running, I/O-bound requests (like uploading/downloading large files) at the same time? Or (like many apps) will it primarily handle short, CPU-bound requests?
If the answer is "short/CPU-bound", then you have little to gain from using Puma. Multiple single-threaded server processes would be better. Memory consumption will be higher, but you will keep your sanity. Writing correct multi-threaded code is devilishly hard, and even experts make mistakes. If your business success, job security, etc. depends on that multi-threaded code working and working right, you are going to cause yourself a lot of unnecessary pain and mental anguish.
That aside, let me try to unravel some of the issues raised in your question. There is so much to say that it's hard to know where to start. You may want to pour yourself a cold or hot beverage of your choice before sitting down to read this treatise:
When you talk about writing "thread-safe" code, you need to be clear about what you mean. In most cases, "thread-safe" code means code which doesn't concurrently modify mutable data in a way which could cause data corruption. (What a mouthful!) That could mean that the code doesn't allow concurrent modification of mutable data at all (using locks), or that it does allow concurrent modification, but makes sure that it doesn't corrupt data (probably using atomic operations and a touch of black magic).
Note that when your threads are only reading data, not modifying it, or when working with shared stateless objects, there is no question of "thread safety".
Another definition of "thread-safe", which probably applies better to your situation, has to do with operations which affect the outside world (basically I/O). You may want some operations to only happen once, or to happen in a specific order. If the code which performs those operations runs on multiple threads, they could happen more times than desired, or in a different order than desired, unless you do something to prevent that.
It appears that your __setup method is only called when ourgem.rb is first loaded. As far as I know, even if multiple threads require the same file at the same time, MRI will only ever let a single thread load the file. I don't know whether JRuby is the same. But in any case, if your source files are being loaded more than once, that is symptomatic of a deeper problem. They should only be loaded once, on a single thread. If your app handles requests on multiple threads, those threads should be started up after the application has loaded, not before. This is the only sane way to do things.
Assuming that everything is sane, ourgem.rb will be loaded using a single thread. That means __setup will only ever be called by a single thread. In that case, there is no question of thread safety at all to worry about (as far as initialization of your "client cache" goes).
Even if __setup was to be called concurrently by multiple threads, your atomic code won't do what you think it does. First of all, you use Atomic.new({}).value. This wraps a Hash in an atomic reference, then unwraps it so you just get back the Hash. It's a no-op. You could just write {} instead.
Second, your Atomic#update call will not prevent the initialization code from running more than once. To understand this, you need to know what Atomic actually does.
Let me pull out the old, tired "increment a shared counter" example. Imagine the following code is running on 2 threads:
i += 1
We all know what can go wrong here. You may end up with the following sequence of events:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A writes its incremented value back to i.
Thread B writes its incremented value back to i.
So we lose an update, right? But what if we store the counter value in an atomic reference, and use Atomic#update? Then it would be like this:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A tries to write its incremented value back to i, and succeeds.
Thread B tries to write its incremented value back to i, and fails, because the value has already changed.
Thread B reads i again and increments it.
Thread B tries to write its incremented value back to i again, and succeeds this time.
Do you get the idea? Atomic never stops 2 threads from running the same code at the same time. What it does do, is force some threads to retry the #update block when necessary, to avoid lost updates.
If your goal is to ensure that your initialization code will only ever run once, using Atomic is a very inappropriate choice. If anything, it could make it run more times, rather than less (due to retries).
So, that is that. But if you're still with me here, I am actually more concerned about whether your "client" objects are themselves thread-safe. Do they have any mutable state? Since you are caching them, it seems that initializing them must be slow. Be that as it may, if you use locks to make them thread-safe, you may not be gaining anything from caching and sharing them between threads. Your "multi-threaded" server may be reduced to what is effectively an unnecessarily complicated, single-threaded server.
If the client objects have no mutable state, good for you. You can be "free and easy" and share them between threads with no problems. If they do have mutable state, but initializing them is slow, then I would recommend caching one object per thread, so they are never shared. Thread[] is your friend there.

Is it required to lock shared variables in perl for read access?

I am using shared variables on perl with use threads::shared.
That variables can we modified only from single thread, all other threads are only 'reading' that variables.
Is it required in the 'reading' threads to lock
{
lock $shared_var;
if ($shared_var > 0) .... ;
}
?
isn't it safe to simple verification without locking (in the 'reading' thread!), like
if ($shared_var > 0) ....
?
Locking is not required to maintain internal integrity when setting or fetching a scalar.
Whether it's needed or not in your particular case depends on the needs of the reader, the other readers and the writers. It rarely makes sense not to lock, but you haven't provided enough details for us to determine what your needs are.
For example, it might not be acceptable to use an old value after the writer has updated the shared variable. For starters, this can lead to a situation where one thread is still using the old value while the another thread is using the new value, a situation that can be undesirable if those two threads interact.
It depends on whether it's meaningful to test the condition just at some point in time or other. The problem however is that in a vast majority of cases, that Boolean test means other things, which might have already changed by the time you're done reading the condition that says it represents a previous state.
Think about it. If it's an insignificant test, then it means little--and you have to question why you are making it. If it's a significant test, then it is telltale of a coherent state that may or may not exist anymore--you won't know for sure, unless you lock it.
A lot of times, say in real-time reporting, you don't really care which snapshot the database hands you, you just want a relatively current one. But, as part of its transaction logic, it keeps a complete picture of how things are prior to a commit. I don't think you're likely to find this in code, where the current state is the current state--and even a state of being in a provisional state is a definite state.
I guess one of the times this can be different is a cyclical access of a queue. If one consumer doesn't get the head record this time around, then one of them will the next time around. You can probably save some processing time, asynchronously accessing the queue counter. But here's a case where it means little in context of just one iteration.
In the case above, you would just want to put some locked-level instructions afterward that expected that the queue might actually be empty even if your test suggested it had data. So, if it is just a preliminary test, you would have to have logic that treated the test as unreliable as it actually is.

Cache Locking for lots of processes?

I've got a script that 1) runs often 2) is run by lots of different processes and 3) takes a long time.
update: The stuff that takes a long time is tests who's results will be the same for every process. Totally redundant.
I think it's time to do some caching, but I'm worried about the potential for races, conflicts, corruption, temporal-vortex-instability and chickens.
The complexity comes in because any of the processes could update the cache as well as read the cache, so I have to know how to handle all those combinations.
This smells to me like something that someone smarter and more educated than myself has already probably figured out.
Anyway, to make this question more concrete, here's what I've thought of so far. I'm using flock in my head, not sure if that's a good idea.
if the cache is fresh, read it and go away
if the cache is stale
try to get a write lock
if I get the lock, do the tests and update the cache
If I don't get the lock, does someone else have an write or a read lock?
If its shared, why are they reading a stale cache? Do I ignore them, do the tests and update the cache (or maybe this causes them to read a half-written cache... er...)
If it's exclusive, give them a short time to complete the tests and update the cache.
Hope that makes sense...
Here is a scheme which uses flock(2) for file locking in concurrent environments.
It explains how "safe-cache" works.
Every cache file has two companion files (WLock and RLock).
All flock requests are blocking except first one (NB WLock).
having WLock ensures opportunity for possible generation of fresh cache
having shared RLock ensures safe reading from cache file
and having exclusive RLock ensures safe writing to cache file
There are two companion files for only one reason, and that is when new cache is being generated,
and old cache is not too old(cache time+N is not expired) clients can still use old cache
instead of waiting for cache being generated.
Please comment on this scheme and make it simpler if possible.

Core Data: Deleting causes 'NSObjectInaccessibleException' from NSOperation with a reference to a deleted object

My application has NSOperation subclasses that fetch and operate on managed objects. My application also periodically purges rows from the database, which can result in the following race condition:
An background operation fetches a bunch of objects (from a thread-specific context). It will iterate over these objects and do something with their properties.
A bunch of rows are deleted in the main managed object context.
The background operation accesses a property on an object that was deleted from the main context. This results in an 'NSObjectInaccessibleException', reason: 'CoreData could not fulfill a fault'
Ideally, the objects that are fetched by the NSOperation can be operated on even if one is deleted in the main context. The best way I can think to achieve this is either to:
Call [request setReturnsObjectsAsFaults:NO] to ensure that Core Data won't try to fulfill a fault for an object that no longer exists in the main context. The problem here is I may need to access the object's relationships, which (to my understanding) will still be faulted.
Iterate through the managed objects up front and copy the properties I will need into separate non-managed objects. The problem here is that (I think) I will need to synchronize/lock this part, in case an object is deleted in the main context before I can finish copying.
Am I missing something obvious? It doesn't seem like what I'm trying to accomplish is too out of the ordinary. Thanks for your help.
You said each thread has its own context. That's good. However, they also need to stay synchronized with changes to each other (how depends on their hierarchy).
Are the all assigned to the same persistent store coordinator, or do they have parent/child relationships?
Siblings should monitor NSManagedObjectContextObjectsDidChangeNotification from other siblings. Parents will automatically get notified when a child context saves.
I ended up mitigating this by perform both fetches and deletes on the same queue.
Great question, I can only provide a partial answer and would really like to know this too. Unfortunately your own solution is more of a workaround but not really an answer. What if the background operation is very long and you can't resort to running it on the main thread?
One thing I can say is that you don't have to call [request setReturnsObjectsAsFaults:NO] since the fetch request will load the data into the row cache and will not go back to the database when a fault fires for one of the fetched objects (see Apples documentation for NSFetchRequest). This doesn't help with relationships though.
I've tried the following:
On NSManagedObjectContextWillSave notification, wait for the current background task to finish and prevent new tasks from starting with something like
-(void)contextWillSave:(NSNotification *)notification {
dispatch_sync(self.backgroundQueue, ^{
self.suspendBackgroundOperation = YES;
});
}
Unset suspendBackgroundOperation on NSManagedObjectContextDidSave notification
However the dispatch_sync call introduces possible dead locks so this doesn't really work either (see my related question). Plus it would still block the main thread until a potentially lengthy background operation finishes.

Resources