beginWriteTransaction from multithreading

beginWriteTransaction from multithreading - multithreading

According to the documentation in beginWriteTransaction method
Only one write transaction can be open at a time. Write transactions
cannot be nested, and trying to begin a write transaction on a
RLMRealm which is already in a write transaction will throw an
exception. Calls to beginWriteTransaction from RLMRealm instances in
other threads will block until the current write transaction
completes.
but when I looked at code I found next one:
void Realm::begin_transaction()
{
check_read_write(this);
verify_thread();
if (is_in_transaction()) {
throw InvalidTransactionException("The Realm is already in a write transaction");
}
// make sure we have a read transaction
read_group();
transaction::begin(*m_shared_group, m_binding_context.get());
}
Could you explain when a condition is met ?
Calls to beginWriteTransaction from RLMRealm instances in other
threads will block until the current write transaction completes.

The last call will lead to a method which will leave this intermediate cross-platform C++ API level and goes one level deeper into our internal storage engine, where it uses a mutex to coordinate exclusive access between processes and threads.
Once this mutex is acquired, it is hold until the write transaction is committed or cancelled.

Related

Blocking code and callbacks. How to to transform blocking code in non-blocking code?

I feel the core of this question is not related to the specific language and library I am using, so I am using some pseudo-code. We can assume C as a language and a WinApi COM DLL.
Let's say I am using a dynamically linked external library which exposes some callbacks in response to some events. Say:
function RegisterCallback(ptr *callback);
Which is to be used as:
function OnEvent(type newValue) {
...
}
...
RegisterCallback(&OnEvent)
...
The library tells me that the callback should be non-blocking.
Now, suppose that I want to update an internal state in response to this event. This internal state is accessed by other threads and hence is guarded by a mutex. Thus I would like to write:
function OnEvent(type newValue) {
mutexLock();
internalState = newState;
mutexUnlock()
}
Bu this would be a potentially blocking operation. How should I proceed? The only solution I see out of this is to use a different thread to update the state, like:
function OnEvent(type newValue) {
sendChangeStateMessage(newValue)
}
But, again, in order for this call to be non-blocking, this "sending operation" should be buffered (that is, have a queue of messages), since sending (i.e. sharing data) across threads require syncing, and hence locking.
EDIT: Of course if the operation is atomic (as might be for an integer) there is no such problem /EDIT
To wrap it up: how do you transform a blocking code into a non-blocking one?
Thanks

Locking using a mutex is not always a blocking operation.
If the mutex is only used to protect access to that one variable, and if all other threads acquiring that mutex do not do any blocking operations while the mutex is locked, then this is not a blocking operation. A blocking operation is an operation that will block until something happens. In this case, this use of mutex is unlikely to be blocking unless, for instance, you lock the mutex in another thread and wait for a network read.

Spawning a new thread for each object load

I have a system which runs multiple service (long lived) and worker (short lived) threads. They all share a state which contains objects. Any thread can request an object an any time, through a singleton-of-sorts class called ObjectManager. If the object is not available it needs to be loaded.
Here's some pseudo-code of how object loading looks now:
class ObjectManager {
getLoadinData(path) {
if (hasLoadingDataFor(path))
return whatWeHave()
else {
loadingData = createNewLoadingData();
loadingData.path = path;
pushLoadingTaskToLoadingThread(loadingData);
return loadingData;
}
}
// loads object and blocks until it's loaded
loadObjectSync(path) {
loadingData = getLoadinData(path);
waitFor(loadingData.conditionVar);
return loadingData.loadedObject;
}
// initiates a load and calls a callback when done
loadObjectAsync(path, callback) {
loadingData = getLoadinData(path);
loadingData.callbacks.add(callback);
}
// dedicated loading thread
loadingThread() {
while (running) {
loadingData = waitForLoadingData();
object = readObjectFromDisk(loadingData.path);
object.onLoaded(); // !!!!
loadingData.object = object;
// unblock cv waiters
loadingData.conditionVar.notifyAll();
// call callbacks
loadingData.callbacks.callAll(object);
}
}
}
The problem is the line object.onLoaded. I have no control over this function. Some objects might decide that they need other objects to be valid. So in their onLoaded method they might call loadObjectSync. Uh-oh! This (naturally) dead locks. It blocks the loading loop until the loading loop makes more iterations.
What I could do to solve this is leave the onLoaded call to the initiating threads. This will change loadObjectSync to something like:
loadObjectSync(path) {
loadingData = getLoadinData(path);
waitFor(loadingData.conditionVar);
if (loadingData.wasCreatedInThisThread()) {
object.onLoaded();
loadingData.onLoadedConditionVar.notifyAll();
loadingData.callbacks.callAll(object);
}
else {
// wait more
waitFor(loadingData.onLoadedConditionVar);
}
return loadingData.loadedObject;
}
... but then the problem is that if I have no calls for loadSync and only for loadAsync or simply the loadAsync call was the first to create the loading data, there will be no one to finalize the object. So to make this work, I have to introduce another thread finalizes objects whose loadingData was created by loadObjectAsync.
It seems that it would work. But I have a simpler idea! What if I change getLoadingData instead. What if it does this:
getLoadinData(path) {
if (hasLoadingDataFor(path))
return whatWeHave()
else {
loadingData = createNewLoadingData();
loadingData.path = path;
///
thread = spawnLoadingThread(loadingData);
thread.detach();
///
return loadingData;
}
}
Spawn a new thread for every object load. Thus there is no dead lock. Every loading thread can safely block until it's done. The rest of the code remains exactly as it is.
This means potentially tens (or why not thousands in certain edge cases) active threads, waiting on condition variables. I know that spawning threads has its overhead but I think it would be negligible compared to the cost of I/O from readObjectFromDisk
So my question is: Is this terrible? Can this somehow backfire?
The target platform is conventional desktop machines. But this software is supposed to run for a long time without stopping: weeks, maybe months.
Alternatively... even though I have an idea how to solve this if the thread-per-load turns out to be terrible, can this be solved in another way?

Very interesting! This is a problem I have bumped into a couple of times, trying to add a synchronous interface to a fundamentally asynchronous operation (i.e. file load, or in my case, network write) that is performed by a service thread.
My own preference would be to not provide the synchronous interface. Why? Because it keeps the code simpler in design & implementation and easier to reason about -- always important for multi-threading.
Benefits of sticking to single thread & async only is that you only have 1 service thread, so resource growth is not a concern, plus the user callbacks are always invoked on this same thread, which simplifies thread-safety concerns for users of ObjectManager (if you have multiple callback threads, every user callback must be thread safe, so it's an important choice to make). However sticking to only an async interface does mean the user of ObjectManager has more work to do.
But if you do want to keep the synchronous interface, then another approach that I have taken could work for you. You stick to a single service thread but inside the implementation of loadObjectSync you check the thread-ID to determine if the invoker is the service thread or any-other thread. If it is any-other thread you queue the request and safely block. But if it is the service thread, you can immediately load the object, say by calling a new function loadObjectImpl. You will need to grab the thread-ID of the service thread during initialization and store it within the ObjectManager instance, and use that for thread identification. And you will need a new function which is basically just the internal scope of the loadingThread function -- i.e. a new function called something like loadObjectImpl.

thread with a forever loop with one inherently asynch operation

I'm trying to understand the semantics of async/await in an infinitely looping worker thread started inside a windows service. I'm a newbie at this so give me some leeway here, I'm trying to understand the concept.
The worker thread will loop forever (until the service is stopped) and it processes an external queue resource (in this case a SQL Server Service Broker queue).
The worker thread uses config data which could be changed while the service is running by receiving commands on the main service thread via some kind of IPC. Ideally the worker thread should process those config changes while waiting for the external queue messages to be received. Reading from service broker is inherently asynchronous, you literally issue a "waitfor receive" TSQL statement with a receive timeout.
But I don't quite understand the flow of control I'd need to use to do that.
Let's say I used a concurrentQueue to pass config change messages from the main thread to the worker thread. Then, if I did something like...
void ProcessBrokerMessages() {
foreach (BrokerMessage m in ReadBrokerQueue()) {
ProcessMessage(m);
}
}
// ... inside the worker thread:
while (!serviceStopped) {
foreach (configChange in configChangeConcurrentQueue) {
processConfigChange(configChange);
}
ProcessBrokerMessages();
}
...then the foreach loop to process config changes and the broker processing function need to "take turns" to run. Specifically, the config-change-processing loop won't run while the potentially-long-running broker receive command is running.
My understanding is that simply turning the ProcessBrokerMessages() into an async method doesn't help me in this case (or I don't understand what will happen). To me, with my lack of understanding, the most intuitive interpretation seems to be that when I hit the async call it would go off and do its thing, and execution would continue with a restart of the outer while loop... but that would mean the loop would also execute the ProcessBrokerMessages() function over and over even though it's already running from the invocation in the previous loop, which I don't want.
As far as I know this is not what would happen, though I only "know" that because I've read something along those lines. I don't really understand it.
Arguably the existing flow of control (ie, without the async call) is OK... if config changes affect ProcessBrokerMessages() function (which they can) then the config can't be changed while the function is running anyway. But that seems like it's a point specific to this particular example. I can imagine a case where config changes are changing something else that the thread does, unrelated to the ProcessBrokerMessages() call.
Can someone improve my understanding here? What's the right way to have
a block of code which loops over multiple statements
where one (or some) but not all of those statements are asynchronous
and the async operation should only ever be executing once at a time
but execution should keep looping through the rest of the statements while the single instance of the async operation runs
and the async method should be called again in the loop if the previous invocation has completed
It seems like I could use a BackgroundWorker to run the receive statement, which flips a flag when its job is done, but it also seems weird to me to create a thread specifically for processing the external resource and then, within that thread, create a BackgroundWorker to actually do that job.

You could use a CancelationToken. Most async functions accept one as a parameter, and they cancel the call (the returned Task actually) if the token is signaled. SqlCommand.ExecuteReaderAsync (which you're likely using to issue the WAITFOR RECEIVE is no different. So:
Have a cancellation token passed to the 'execution' thread.
The settings monitor (the one responding to IPC) also has a reference to the token
When a config change occurs, the monitoring makes the config change and then signals the token
the execution thread aborts any pending WAITFOR (or any pending processing in the message processing loop actually, you should use the cancellation token everywhere). any transaction is aborted and rolled back
restart the execution thread, with new cancellation token. It will use the new config

So in this particular case I decided to go with a simpler shared state solution. This is of course a less sound solution in principle, but since there's not a lot of shared state involved, and since the overall application isn't very complicated, it seemed forgivable.
My implementation here is to use locking, but have writes to the config from the service main thread wrapped up in a Task.Run(). The reader doesn't bother with a Task since the reader is already in its own thread.

Serial Dispatch Queue with Asynchronous Blocks

Is there ever any reason to add blocks to a serial dispatch queue asynchronously as opposed to synchronously?
As I understand it a serial dispatch queue only starts executing the next task in the queue once the preceding task has completed executing. If this is the case, I can't see what you would you gain by submitting some blocks asynchronously - the act of submission may not block the thread (since it returns straight-away), but the task won't be executed until the last task finishes, so it seems to me that you don't really gain anything.
This question has been prompted by the following code - taken from a book chapter on design patterns. To prevent the underlying data array from being modified simultaneously by two separate threads, all modification tasks are added to a serial dispatch queue. But note that returnToPool adds tasks to this queue asynchronously, whereas getFromPool adds its tasks synchronously.
class Pool<T> {
private var data = [T]();
// Create a serial dispath queue
private let arrayQ = dispatch_queue_create("arrayQ", DISPATCH_QUEUE_SERIAL);
private let semaphore:dispatch_semaphore_t;
init(items:[T]) {
data.reserveCapacity(data.count);
for item in items {
data.append(item);
}
semaphore = dispatch_semaphore_create(items.count);
}
func getFromPool() -> T? {
var result:T?;
if (dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER) == 0) {
dispatch_sync(arrayQ, {() in
result = self.data.removeAtIndex(0);
})
}
return result;
}
func returnToPool(item:T) {
dispatch_async(arrayQ, {() in
self.data.append(item);
dispatch_semaphore_signal(self.semaphore);
});
}
}

Because there's no need to make the caller of returnToPool() block. It could perhaps continue on doing other useful work.
The thread which called returnToPool() is presumably not just working with this pool. It presumably has other stuff it could be doing. That stuff could be done simultaneously with the work in the asynchronously-submitted task.
Typical modern computers have multiple CPU cores, so a design like this improves the chances that CPU cores are utilized efficiently and useful work is completed sooner. The question isn't whether tasks submitted to the serial queue operate simultaneously — they can't because of the nature of serial queues — it's whether other work can be done simultaneously.

Yes, there are reasons why you'd add tasks to serial queue asynchronously. It's actually extremely common.
The most common example would be when you're doing something in the background and want to update the UI. You'll often dispatch that UI update asynchronously back to the main queue (which is a serial queue). That way the background thread doesn't have to wait for the main thread to perform its UI update, but rather it can carry on processing in the background.
Another common example is as you've demonstrated, when using a GCD queue to synchronize interaction with some object. If you're dealing with immutable objects, you can dispatch these updates asynchronously to this synchronization queue (i.e. why have the current thread wait, but rather instead let it carry on). You'll do reads synchronously (because you're obviously going to wait until you get the synchronized value back), but writes can be done asynchronously.
(You actually see this latter example frequently implemented with the "reader-writer" pattern and a custom concurrent queue, where reads are performed synchronously on concurrent queue with dispatch_sync, but writes are performed asynchronously with barrier with dispatch_barrier_async. But the idea is equally applicable to serial queues, too.)
The choice of synchronous v asynchronous dispatch has nothing to do with whether the destination queue is serial or concurrent. It's simply a question of whether you have to block the current queue until that other one finishes its task or not.
Regarding your code sample code, that is correct. The getFromPool should dispatch synchronously (because you have to wait for the synchronization queue to actually return the value), but returnToPool can safely dispatch asynchronously. Obviously, I'm wary of seeing code waiting for semaphores if that might be called from the main thread (so make sure you don't call getFromPool from the main thread!), but with that one caveat, this code should achieve the desired purpose, offering reasonably efficient synchronization of this pool object, but with a getFromPool that will block if the pool is empty until something is added to the pool.

<Spring Batch> Why does making ItemReader thread-safe leads us to loosing restartability?

I have a multi-threaded batch job reading from a DB and I am concerned about different threads re-reading records as ItemReader is not thread safe in Spring batch. I went through SpringBatch FAQ section which states that
You can synchronize the read() method (e.g. by wrapping it in a delegator that does the synchronization). Remember that you will lose restartability, so best practice is to mark the step as not restartable and to be safe (and efficient) you can also set saveState=false on the reader.
I want to know why will I loose re-startability in this case? What has restartability got to do with synchronizing my read operations? It can always try again,right?
Also, will this piece of code be enough for synchronizing the reader?
public SynchronizedItemReader<T> implements ItemReader<T> {
private final ItemReader<T> delegate;
public SynchronizedItemReader(ItemReader<T> delegate) {
this.delegate = delegate;
}
public synchronized T read () {
return delegate.read();
}
}

When using an ItemReader with multithreads, the lack of restartability is not about the read itself. It's about saving the state of the reader which occurs in the update method. The issue is that there needs to be coordination between the calls to read() - the method providing the data and update() - the method persisting the state. When you use multiple threads, the internal state of the reader (and therefore the update() call) may or may not reflect the work that has been done. Take for example the FlatFileItemReader using a chunk size of 5 and running on multiple threads. You could have thread1 having read 5 items (time to update), yet thread 2 could have read an additional 3. This means that the call to update would save that 8 items have been read. If the chunk on thread 2 fails, the state would due incorrect and the restart would miss the three items that were already read.
This is not to say that it is impossible to write a thread safe ItemReader. However, as your example above illustrates, if delegate is a stateful ItemReader (implements ItemStream as well), the state will not be persisted correctly with calls to update (in fact, your example above doesn't even take the ItemStream aspect of stageful readers into account).

If you want make restartable your job, with parallel execution of items, you can save item, that reader read plus state of this item by yourself.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string