How to use IObservable/IObserver with ConcurrentQueue or ConcurrentStack - c#-4.0

I realized that when I am trying to process items in a concurrent queue using multiple threads while multiple threads can be putting items into it, the ideal solution would be to use the Reactive Extensions with the Concurrent data structures.
My original question is at:
While using ConcurrentQueue, trying to dequeue while looping through in parallel
So I am curious if there is any way to have a LINQ (or PLINQ) query that will continuously be dequeueing as items are put into it.
I am trying to get this to work in a way where I can have n number of producers pushing into the queue and a limited number of threads to process, so I don't overload the database.
If I could use Rx framework then I expect that I could just start it, and if 100 items are placed in within 100ms, then the 20 threads that are part of the PLINQ query would just process through the queue.
There are three technologies I am trying to work together:
Rx Framework (Reactive LINQ)
PLING
System.Collections.Concurrent
structures

Drew is right, I think the ConcurrentQueue even though it sounds perfect for the job is actually the underlying data structure that the BlockingCollection uses. Seems very back to front to me too.
Check out chapter 7 of this book*
http://www.amazon.co.uk/Parallel-Programming-Microsoft-NET-Decomposition/dp/0735651590/ref=sr_1_1?ie=UTF8&qid=1294319704&sr=8-1
and it will explain how to use the BlockingCollection and have multiple producers and multiple consumers each taking off the "queue". You will want to look at the "GetConsumingEnumerable()" method and possibly just call .ToObservable() on that.
*the rest of the book is pretty average.
edit:
Here is a sample program that I think does what you want?
class Program
{
private static ManualResetEvent _mre = new ManualResetEvent(false);
static void Main(string[] args)
{
var theQueue = new BlockingCollection<string>();
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 1", 10000000));
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 2", 50000000));
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 3", 30000000));
LoadQueue(theQueue, "Producer A");
LoadQueue(theQueue, "Producer B");
LoadQueue(theQueue, "Producer C");
_mre.Set();
Console.WriteLine("Processing now....");
Console.ReadLine();
}
private static void ProcessNewValue(string value, string consumerName, int delay)
{
Thread.SpinWait(delay);
Console.WriteLine("{1} consuming {0}", value, consumerName);
}
private static void LoadQueue(BlockingCollection<string> target, string prefix)
{
var thread = new Thread(() =>
{
_mre.WaitOne();
for (int i = 0; i < 100; i++)
{
target.Add(string.Format("{0} {1}", prefix, i));
}
});
thread.Start();
}
}

I don't know how best to accomplish this with Rx, but I would recommend just using BlockingCollection<T> and the producer-consumer pattern. Your main thread adds items into the collection, which uses ConcurrentQueue<T> underneath by default. Then you have a separate Task that you spin up ahead of that which uses Parallel::ForEach over the BlockingCollection<T> to process as many items from the collection as makes sense for the system concurrently. Now, you will probably also want to look into using the GetConsumingPartitioner method of the ParallelExtensions library in order to be most efficient since the default partitioner will create more overhead than you want in this case. You can read more about this from this blog post.
When the main thread is finished you call CompleteAdding on the BlockingCollection<T> and Task::Wait on the Task you spun up to wait for all the consumers to finish processing all the items in the collection.

Related

Coordinating emission and subscription in Kotlin coroutines with hot flows

I am trying to design an observable task-like entity which would have the following properties:
Reports its current state changes reactively
Shares state and result events: new subscribers will also be notified if the change happens after they've subscribed
Has a lifecycle (backed by CoroutineScope)
Doesn't have suspend functions in the interface (because it has a lifecycle)
The very basic code is something like this:
class Worker {
enum class State { Running, Idle }
private val state = MutableStateFlow(State.Idle)
private val results = MutableSharedFlow<String>()
private val scope = CoroutineScope(Dispatchers.Default)
private suspend fun doWork(): String {
println("doing work")
return "Result of the work"
}
fun start() {
scope.launch {
state.value = State.Running
results.emit(doWork())
state.value = State.Idle
}
}
fun state(): Flow<State> = state
fun results(): Flow<String> = results
}
The problems with this arise when I want to "start the work after I'm subscribed". There's no clear way to do that. The simplest thing doesn't work (understandably):
fun main() {
runBlocking {
val worker = Worker()
// subscriber 1
launch {
worker.results().collect { println("received result $it") }
}
worker.start()
// subscriber 2 can also be created "later" and watch
// for state()/result() changes
}
}
This prints only "doing work" and never prints a result. I understand why this happens (because collect and start are in separate coroutines, not synchronized in any way).
Adding a delay(300) to coroutine inside doWork "fixes" things, results are printed, but I'd like this to work without artificial delays.
Another "solution" is to create a SharedFlow from results() and use its onSubscription to call start(), but that didn't work either last time I've tried.
My questions are:
Can this be turned into something that works or is this design initially flawed?
If it is flawed, can I take some other approach which would still hit all the goals I have specified in the beginning of the post?
Your problem is that your SharedFlow has no buffer set up, so it is emitting results to its (initially zero) current collectors and immediately forgetting them. The MutableSharedFlow() function has a replay parameter you can use to determine how many previous results it should store and replay to new collectors. You will need to decide what replay amount to use based on your use case for this class. For simply displaying latest results in a UI, a common choice is a replay of 1.
Depending on your use case, you may want to give your CoroutineScope a SupervisorJob() in its context so it isn't destroyed by any child job failing.
Side note, your state() and results() functions should be properties by Kotlin convention, since they do nothing but return references. Personally, I would also have them return read-only StateFlow/SharedFlow instead of just Flow to clarify that they are not cold.

In Kotlin Native, how to keep an object around in a separate thread, and mutate its state from any other thead without using C pointers?

I'm exploring Kotlin Native and have a program with a bunch of Workers doing concurrent stuff
(running on Windows, but this is a general question).
Now, I wanted to add simple logging. A component that simply logs strings by appending them as new lines to a file that is kept open in 'append' mode.
(Ideally, I'd just have a "global" function...
fun log(text:String) {...} ]
...that I would be able to call from anywhere, including from "inside" other workers and that would just work. The implication here is that it's not trivial to do this because of Kotlin Native's rules regarding passing objects between threads (TLDR: you shouldn't pass mutable objects around. See: https://github.com/JetBrains/kotlin-native/blob/master/CONCURRENCY.md#object-transfer-and-freezing ).
Also, my log function would ideally accept any frozen object. )
What I've come up with are solutions using DetachedObjectGraph:
First, I create a detached logger object
val loggerGraph = DetachedObjectGraph { FileLogger("/foo/mylogfile.txt")}
and then use loggerGraph.asCPointer() ( asCPointer() ) to get a COpaquePointer to the detached graph:
val myPointer = loggerGraph.asCPointer()
Now I can pass this pointer into the workers ( via the producer lambda of the Worker's execute function ), and use it there. Or I can store the pointer in a #ThreadLocal global var.
For the code that writes to the file, whenever I want to log a line, I have to create a DetachedObjectGraph object from the pointer again,
and attach() it in order to get a reference to my fileLogger object:
val fileLogger = DetachedObjectGraph(myPointer).attach()
Now I can call a log function on the logger:
fileLogger.log("My log message")
This is what I've come up with looking at the APIs that are available (as of Kotlin 1.3.61) for concurrency in Kotlin Native,
but I'm left wondering what a better approach would be ( using Kotlin, not resorting to C ). Clearly it's bad to create a DetachedObjectGraph object for every line written.
One could pose this question in a more general way: How to keep a mutable resource open in a separate thread ( or worker ), and send messages to it.
Side comment: Having Coroutines that truly use threads would solve this problem, but the question is about how to solve this task with the APIs currently ( Kotlin 1.3.61 ) available.
You definitely shouldn't use DetachedObjectGraph in the way presented in the question. There's nothing to prevent you from trying to attach on multiple threads, or if you pass the same pointer, trying to attach to an invalid one after another thread as attached to it.
As Dominic mentioned, you can keep the DetachedObjectGraph in an AtomicReference. However, if you're going to keep DetachedObjectGraph in an AtomicReference, make sure the type is AtomicRef<DetachedObjectGraph?> and busy-loop while the DetachedObjectGraph is null. That will prevent the same DetachedObjectGraph from being used by multiple threads. Make sure to set it to null, and repopulate it, in an atomic way.
However, does FileLogger need to be mutable at all? If you're writing to a file, it doesn't seem so. Even if so, I'd isolate the mutable object to a separate worker and send log messages to it rather than doing a DetachedObjectGraph inside an AtomicRef.
In my experience, DetachedObjectGraph is super uncommon in production code. We don't use it anywhere at the moment.
To isolate mutable state to a Worker, something like this:
class MutableThing<T:Any>(private val worker:Worker = Worker.start(), producer:()->T){
private val arStable = AtomicReference<StableRef<T>?>(null)
init {
worker.execute(TransferMode.SAFE, {Pair(arStable, producer).freeze()}){
it.first.value = StableRef.create(it.second()).freeze()
}
}
fun <R> access(block:(T)->R):R{
return worker.execute(TransferMode.SAFE, {Pair(arStable, block).freeze()}){
it.second(it.first.value!!.get())
}.result
}
}
object Log{
private val fileLogger = MutableThing { FileLogger() }
fun log(s:String){
fileLogger.access { fl -> fl.log(s) }
}
}
class FileLogger{
fun log(s:String){}
}
The MutableThing uses StableRef internally. producer makes the mutable state you want to isolate. To log something, call Log.log, which will wind up calling the mutable FileLogger.
To see a basic example of MutableThing, run the following test:
#Test
fun goIso(){
val mt = MutableThing { mutableListOf("a", "b")}
val workers = Array(4){Worker.start()}
val futures = mutableListOf<Future<*>>()
repeat(1000) { rcount ->
val future = workers[rcount % workers.size].execute(
TransferMode.SAFE,
{ Pair(mt, rcount).freeze() }
) { pair ->
pair.first.access {
val element = "ttt ${pair.second}"
println(element)
it.add(element)
}
}
futures.add(future)
}
futures.forEach { it.result }
workers.forEach { it.requestTermination() }
mt.access {
println("size: ${it.size}")
}
}
The approach you've taken is pretty much correct and the way it's supposed to be done.
The thing I would add is, instead of passing around a pointer around. You should pass around a frozen FileLogger, which will internally hold a reference to a AtomicRef<DetachedObjectGraph>, the the attaching and detaching should be done internally. Especially since DetachedObjectGraphs are invalid once attached.

Modify object from multiple async streams in Dart

Imagine we had a object like this
class Foo {
List<int> data = [];
void addAndCheck(int n){
for(int number in data){
// check something
}
data.add(n);
}
}
and the imagine we spawn a bunch of subscriptions like this
Foo foo = Foo();
for(int i = 0; i++; i<10){
subscriptions.add(api.someRandomStream().listen((response){
foo.addAndCheck(response.value);
}));
}
As it stands, if this code is run it might work but as soon as the streams start emitting around the same time we get a exception: Concurrent modification during iteration
The cause is the for loop, but how can this problem be solved? In a language like Java there are things like ConcurrentHashMap, Collections.synchronizedList(...), etc..
If you get a concurrent modification error during the iteration, then you are doing something asynchronous inside the loop. That is, your function is probably async and there is at least one await inside the loop. That will allow another event to trigger while you are awaiting, and then modify the list.
There are several ways to avoid the exception, all with different trade-offs:
Don't do anything asynchronous in the loop, and make sure that nothing you do in there will call addAndCheck again. Then there should be no problem because the loop will complete before anyone else has a chance to modify the list. That obviously only works if you don't need to do something asynchronous.
Copy the list. If you do for(int number in [...data]) { ... } (or in data.toList() as it used to be written), then the list that you iterate is a different list than the one which is modified. It also means that you might not have checked all the elements that are actually in the list at the point you reach the add call.
Don't use an iterator. If you do for (int i = 0; i < data.length; i++) { var number = data[i]; ... } instead, you will not get a concurrent modification error from the iterator. If elements are added at the end of the list, then you will eventually reach them, and all is well. If elements are removed from the list, or added in any place other than at the end, then you might be skipping elements or seeing some of them twice, which may be bad for you.
Use a mutex. If you want to be sure that all the tests on existing elements are performed before any other element is added, then you need to prevent anything from happening while you are adding. Assume a Mutex class of some sort, which would allow you to write code like:
class Foo {
List<int> data = [];
final _mutex = Mutex();
void addAndCheck(int n) async {
await _mutex.acquire();
for(int number in data){
// check something
}
data.add(n);
_mutex.release();
}
}
(I found package:mutex by searching, I have no experience with it).
This might slow down your code, though, making every operation wait for the previous one to complete entirely.
In the end, only you can say which trade-off is best for the behavior of your code.

Simple parallelization with shared non-threadsafe resource in Scala

I frequently want to parallelize a task that relies on a non-threadsafe shared resource. Consider the following non-threadsafe class. I want to do a map over a data: Vector[String].
class Processor { def apply: String => String }
Basically, I want to create n threads, each with an instance of Processor and a partition of the data. Scala parallel collections have spoiled me into thinking the parallelization should be dirt simple. However, they don't seem well suited for this problem. Yes, I can use actors but Scala actors might become deprecated and Akka seems like overkill.
The first thing that comes to mind is to have a synchronized map Thread -> Processor and then use parallel collections, looking up my Processor in this thread-safe map. Is there a better way?
Instead of building your own synchronized map, you can use ThreadLocal. That will guarantee a unique Processor per thread.
val processors = new ThreadLocal[Processor] {
def initialValue() = new Processor
}
data.par.map(x => processors.get.apply(x))
Alternatively you try using an executor service configured to use specified number of threads explicitly:
val processors = new ThreadLocal[Processor] {
override def initialValue() = new Processor
}
val N = 4
// create an executor with fixed number of threads
val execSvc = Executors.newFixedThreadPool(N)
// create the tasks
data foreach {
loopData =>
execSvc.submit(new Runnable() {
def run = processors.get().apply(loopData)
})
}
// await termination
execSvc.shutdown()
while(!execSvc.awaitTermination(1, TimeUnit.SECONDS)) {
;
}
// processing complete!

.net 4.0 Tasks: Synchronize on one or more objects

I have read a lot about the new Task functionality in .net 4.0, but I haven't found a solution for the following problem:
I am writing a server application that processes requests from many users and I want to use Tasks to distribute these request on multiple cores. However, these Tasks should be synchronized on objects - for the beginning, users -, so that just one task is processed for each object at a time. This would be simple to achieve with Task.ContinueWith(), but it should also be possible to synchonize a task on multiple objects (e.g. when a user transfers money to another user, a variable should be decremented at user A and incremented at user B without other tasks interfering).
So, my first attempt is a class that receives delegates, creates tasks and stores them in a dictionary with the objects to sync on as keys. If a new task is scheduled, it can be appended to the last task of the given object with Task.ContinueWith(). If it should be synchronized on multiple objects, the new Task is created using TaskFactory.ContinueWhenAll(). The created task is stored in the dictionary for every object it is synchronized on.
Here is my first draft:
public class ActionScheduler:IActionScheduler
{
private readonly IDictionary<object, Task> mSchedulingDictionary = new Dictionary<object, Task>();
private readonly TaskFactory mTaskFactory = new TaskFactory();
/// <summary>
/// Schedules actions synchonized on one or more objects. Only one action will be processed for each object at any time.
/// </summary>
/// <param name="synchronisationObjects">Array of objects the current action is synchronized on</param>
/// <param name="action">The action that will be scheduled and processed</param>
public void ScheduleTask(object[] synchronisationObjects, Action action)
{
// lock the dictionary in case two actions are scheduled on the same object at the same time
// this is necessary since reading and writing to a dictionary can not be done in an atomic manner
lock(mSchedulingDictionary)
{
// get all current tasks for the given synchronisation objects
var oldTaskList = new List<Task>();
foreach (var syncObject in synchronisationObjects)
{
Task task;
mSchedulingDictionary.TryGetValue(syncObject, out task);
if (task != null)
oldTaskList.Add(task);
}
// create a new task for the given action
Task newTask;
if (oldTaskList.Count > 1)
{
// task depends on multiple previous tasks
newTask = mTaskFactory.ContinueWhenAll(oldTaskList.ToArray(), t => action());
}
else
{
if (oldTaskList.Count == 1)
{
// task depends on exactly one previous task
newTask = oldTaskList[0].ContinueWith(t => action());
}
else
{
// task does not depend on any previous task and can be started immediately
newTask = new Task(action);
newTask.Start();
}
}
// store the task in the dictionary
foreach (var syncObject in synchronisationObjects)
{
mSchedulingDictionary[syncObject] = newTask;
}
}
}
}
This even works if a task "multiSyncTask" was created for multiple objects, and afterwards tasks for each of the objects are scheduled. Since they are all created with multiSyncTask.ContinueWith(), they start synchronously:
static void Main()
{
IActionScheduler actionScheduler = new ActionScheduler();
var syncObj1 = new object();
var syncObj2 = new object();
// these two start and complete simultaneously:
actionScheduler.ScheduleTask(new[] { syncObj1 }, () => PrintTextAfterWait("1"));
actionScheduler.ScheduleTask(new[] { syncObj2 }, () => PrintTextAfterWait("2"));
// this task starts after the first two and "locks" both objects:
actionScheduler.ScheduleTask(new[] { syncObj1, syncObj2 }, () => PrintTextAfterWait("1 and 2"));
// these two - again - start and complete simultaneously after the task above:
actionScheduler.ScheduleTask(new[] { syncObj1 }, () => PrintTextAfterWait("1"));
actionScheduler.ScheduleTask(new[] { syncObj2 }, () => PrintTextAfterWait("2"));
}
static void PrintTextAfterWait(string text)
{
Thread.Sleep(3000);
Console.WriteLine(text);
}
What do you think - is this a good solution for my problem? I am a bit sceptic about the big lock on the dictionary, but it is necessary in case two tasks are scheduled on one object at once to prevent race conditions. Of course, the dictionary is just locked for the time it takes to create a task, not when it is processed.
Also, I would love to know if there are any already existing solutions or coding paradigms out there that solve my problem better using .net 4.0 Tasks that I have failed to track down.
Thank you and with best regards,
Johannes
If I got you right.. you would like to have a Task.ContinueWith(task1, task2, lambda)?
Something like the Join arbiter in CCR?
http://msdn.microsoft.com/en-us/library/bb648749.aspx
If so, probably the most elegant option is to use the JoinBlock in TPL dataflow (http://www.microsoft.com/download/en/confirmation.aspx?id=14782).
Or, maybe, have you tried to do use Task.WaitAll() as the first instruction of your dependent task?

Resources