Increasing parallelism level of scala .par operations - multithreading

When I call par on collections, it seems to create about 5-10 threads, which is fine for CPU bound tasks.
But sometimes I have tasks which are IO bound, in which case I'd like to have 500-1000 threads pulling from IO concurrently - doing 10-15 threads is very slow and I see my CPUs mostly sitting idle.
How can I achieve this?

You could wrap your blocking io operations in blocking block:
(0 to 1000).par.map{ i =>
blocking {
Thread.sleep(100)
Thread.activeCount()
}
}.max // yield 67 on my pc, while without blocking it's 10
But you should ask yourself a question if you should use parallel collections for IO operations. Their use case is to perform a CPU heavy task.
I would suggest you to consider using futures for IO calls.
You should also consider using a custom execution context for that task because the global execution context is a public singleton and you don't have control what code uses it and for which purpose. You could easily starve parallel computations created by external libraries if you used all threads from it.
// or just use scala.concurrent.ExecutionContext.Implicits.global if you don't care
implicit val blockingIoEc: ExecutionContextExecutor = ExecutionContext.fromExecutor(
Executors.newCachedThreadPool()
)
def fetchData(index: Int): Future[Int] = Future {
//if you use global ec, then it's required to mark computation as blocking to increase threads,
//if you use custom cached thread pool it should increase thread number even without it
blocking {
Thread.sleep(100)
Thread.activeCount()
}
}
val futures = (0 to 1000).map(fetchData)
Future.sequence(futures).onComplete {
case Success(data) => println(data.max) //prints about 1000 on my pc
}
Thread.sleep(1000)
EDIT
There is also a possibility to use custom ForkJoinPool using ForkJoinTaskSupport:
import java.util.concurrent.ForkJoinPool //scala.concurrent.forkjoin.ForkJoinPool is deprecated
import scala.util.Random
import scala.collection.parallel
val fjpool = new ForkJoinPool(2)
val customTaskSupport = new parallel.ForkJoinTaskSupport(fjpool)
val numbers = List(1,2,3,4,5).par
numbers.tasksupport = customTaskSupport //assign customTaskSupport

Related

How can I solve the problem of script blocking?

I want to give users the ability to customize the behavior of game objects, but I found that unity is actually a single threaded program. If the user writes a script with circular statements in the game object, the main thread of unity will block, just like the game is stuck. How to make the update function of object seem to be executed on a separate thread?
De facto execution order
The logical execution sequence I want to implement
You can implement threading, but the UnityAPI is NOT thread safe, so anything you do outside of the main thread cannot use the UnityAPI. This means that you can do a calculation in another thread and get a result returned to the main thread, but you cannot manipulate GameObjects from the thread.
You do have other options though, for tasks which can take several frames, you can use a coroutine. This will also allow the method to wait without halting the main thread. It sounds like your best option is the C# Jobs System. This system essentially lets you use multithreading and manages the threads for you.
Example from the Unity Manual:
public struct MyJob : IJob
{
public float a;
public float b;
public NativeArray<float> result;
public void Execute()
{
result[0] = a + b;
}
}
// Create a native array of a single float to store the result. This example waits for the job to complete for illustration purposes
NativeArray<float> result = new NativeArray<float>(1, Allocator.TempJob);
// Set up the job data
MyJob jobData = new MyJob();
jobData.a = 10;
jobData.b = 10;
jobData.result = result;
// Schedule the job
JobHandle handle = jobData.Schedule();
// Wait for the job to complete
handle.Complete();
float aPlusB = result[0];
// Free the memory allocated by the result array
result.Dispose();

When to use more instances of same Akka actor?

I've been working with Akka for some time, but now am exploring its actor system in depth. I know there is thread poll executor and fork join executor and afinity executor. I know how dispatcher works and all the rest details. BTW, this link gives a great explanation
https://scalac.io/improving-akka-dispatchers
However, when I experimented with a simple call actor and switched execution contexts, I always got roughly the same performance. I run 60 requests simultaneously and average execution time is around 800 ms to just return simple string to a caller.
I'm running on a MAC which has 8 core (Intel i7 processor).
So, here are the execution contexts I tried:
thread-poll {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 32
}
throughput = 10
}
fork-join {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 2
parallelism-factor = 1
parallelism-max = 8
}
throughput = 100
}
pinned {
type = Dispatcher
executor = "affinity-pool-executor"
}
So, questions are:
Is there any chance to get a better performance in this example?
What's all about the actor instances? How that matters, if we know that dispatcher is scheduling thread (using execution context) to execute actor's receive method inside that thread on the next message from actor's mailbox. Isn't than actor receive method only like a callback? When do number of actors instance get into play?
I have some code which is executing Future and if I run that code directly from main file, it executed around 100-150 ms faster than when I put it in actors and execute Future from actor, piping its result to a sender. What is making it slower?
If you have some real world example with this explained, it is more than welcome. I read some articles, but all in theory. If I try something on a simple example, I get some unexpected results, in terms of performance.
Here is a code
object RedisService {
case class Get(key: String)
case class GetSC(key: String)
}
class RedisService extends Actor {
private val host = config.getString("redis.host")
private val port = config.getInt("redis.port")
var currentConnection = 0
val redis = Redis()
implicit val ec = context.system.dispatchers.lookup("redis.dispatchers.fork-join")
override def receive: Receive = {
case GetSC(key) => {
val sen = sender()
sen ! ""
}
}
}
Caller:
val as = ActorSystem("test")
implicit val ec = as.dispatchers.lookup("redis.dispatchers.fork-join")
val service = as.actorOf(Props(new RedisService()), "redis_service")
var sumTime = 0L
val futures: Seq[Future[Any]] = (0 until 4).flatMap { index =>
terminalIds.map { terminalId =>
val future = getRedisSymbolsAsyncSCActor(terminalId)
val s = System.currentTimeMillis()
future.onComplete {
case Success(r) => {
val duration = System.currentTimeMillis() - s
logger.info(s"got redis symbols async in ${duration} ms: ${r}")
sumTime = sumTime + duration
}
case Failure(ex) => logger.error(s"Failure on getting Redis symbols: ${ex.getMessage}", ex)
}
future
}
}
val f = Future.sequence(futures)
f.onComplete {
case Success(r) => logger.info(s"Mean time: ${sumTime / (4 * terminalIds.size)}")
case Failure(ex) => logger.error(s"error: ${ex.getMessage}")
}
The code is pretty basic, just to test how it behaves.
It's a little unclear to me what you're specifically asking, but I'll take a stab.
If your dispatcher(s) (and, if what the actor is doing is CPU/memory- vs. IO-bound, actual number of cores available (note that this gets hazier the more virtualization (thank you, oversubscribed host CPU...) and containerization (thank you share- and quota-based cgroup limits) comes into play)) allows m actors to be processing simultaneously and you rarely/never have more than n actors with a message to handle (m > n), trying to increase parallelism via dispatcher settings won't gain you anything. (Note that in the foregoing, any task scheduled on the dispatcher(s), e.g. a Future callback, is effectively the same thing as an actor).
n in the previous paragraph is obviously at most the number of actors in the application/dispatcher (depending on what scope we want to look at things: I'll note that every dispatcher over two (one for actors and futures that don't block and one for those that do) is stronger smell (if on Akka 2.5, it's probably a decent idea to adapt some of the 2.6 changes around default dispatcher settings and running things like remoting/cluster in their own dispatcher so they don't get starved out; note also that Alpakka Kafka uses its own dispatcher by default: I wouldn't count those against the two), so in general more actors implies more parallelism implies more core utilization. Actors are comparatively cheap, relative to threads so a profusion of them isn't a huge matter for concern.
Singleton actors (whether at node or cluster (or even, in really extreme cases, entity) level) can do a lot to limit overall parallelism and throughput: the one-message-at-a-time restriction can be a very effective throttle (sometimes that's what you want, often it's not). So don't be afraid to create short-lived actors that do one high-level thing (they can definitely process more than one message) and then stop (note that many simple cases of this can be done in a slightly more lightweight way via futures). If they're interacting with some external service, having them be children of a router actor which spawns new children if the existing ones are all busy (etc.) is probably worth doing: this router is a singleton, but as long as it doesn't spend a lot of time processing any message, the chances of it throttling the system are low. Your RedisService might be a good candidate for this sort of thing.
Note also that performance and scalability aren't always one and the same and improving one diminishes the other. Akka is often somewhat willing to trade performance in the small for reduced degradation in the large.

Reading values from a different thread

I'm writing software in Go that does a lot of parallel computing. I want to collect data from worker threads and I'm not really sure how to do it in a safe way. I know that I could use channels but in my scenario they make it more complicated since I have to somehow synchronize messages (wait until every thread sent something) in the main thread.
Scenario
The main thread creates n Worker instances and launches their work() method in a goroutine so that the workers each run in their own thread. Every 10 seconds the main thread should collect some simple values (e.g. iteration count) from the workers and print a consolidated statistic.
Question
Is it safe to read values from the workers? The main thread will only read values and each individual thread will write it's own values. It would be ok if the values are a few nanoseconds off while reading.
Any other ideas on how to implement this in an easy way?
In Go no value is safe for concurrent access from multiple goroutines without synchronization if at least one of the accesses is a write. Your case meets the conditions listed, so you must use some kind of synchronization, else the behavior would be undefined.
Channels are used if goroutine(s) want to send values to another. Your case is not exactly this: you don't want your workers to send updates in every 10 seconds, you want your main goroutine to fetch status in every 10 seconds.
So in this example I would just protect the data with a sync.RWMutex: when the workers want to modify this data, they have to acquire a write lock. When the main goroutine wants to read this data, it has to acquire a read lock.
A simple implementation could look like this:
type Worker struct {
iterMu sync.RWMutex
iter int
}
func (w *Worker) Iter() int {
w.iterMu.RLock()
defer w.iterMu.RUnlock()
return w.iter
}
func (w *Worker) setIter(n int) {
w.iterMu.Lock()
w.iter = n
w.iterMu.Unlock()
}
func (w *Worker) incIter() {
w.iterMu.Lock()
w.iter++
w.iterMu.Unlock()
}
Using this example Worker, the main goroutine can fetch the iteration using Worker.Iter(), and the worker itself can change / update the iteration using Worker.setIter() or Worker.incIter() at any time, without any additional synchronization. The synchronization is ensured by the proper use of Worker.iterMu.
Alternatively for the iteration counter you could also use the sync/atomic package. If you choose this, you may only read / modify the iteration counter using functions of the atomic package like this:
type Worker struct {
iter int64
}
func (w *Worker) Iter() int64 {
return atomic.LoadInt64(&w.iter)
}
func (w *Worker) setIter(n int64) {
atomic.StoreInt64(&w.iter, n)
}
func (w *Worker) incIter() {
atomic.AddInt64(&w.iter, 1)
}

State Monad with multiple threads

Let's say I have a stateful client smth like
trait Client {
case class S(connection: Connection, failures: Int) // Connection takes 10 seconds to create
def fetchData: State[S, Data] // Takes 10 seconds to fetch
}
I want to make it stateful because the creation of connection is expensive (so I want to cache it) + I have failures count to signal if there are too many failures in a row, then basically recreate connection.
What I've understood from State monad is that effectively it should be calculated on one thread, chaining immutable states sequentially on that thread. I can't afford it in this case, because I fetch operation takes huge amount of time, while all I need is quickly read the state, use the connection from there to start expensive async call. On the other hand, I can't make it State[S, Task[Data]], bacause I need to modify failures in S if the task fetchData fails. So I modified client to be
import fs2.Task
trait Client {
case class S(connection: Connection, failures: Int) // Connection takes 10 seconds to create
def fetchData: StateT[Task, S, Data] // Takes 10 seconds to fetch
}
The thing is I can come up with some Monoid that can add the states irrelevant if what was the initial state. E.g. If I have S1 -> (S2, Data), S1 -> (S3, Data), I could still add S2 and S3 to arrive at the final state regardless of what came before - S2 or S3.
Now I have a kind of Java server (Thrift) that use the handler. I haven't dig into details of how it working precisely yet, but let's assume that it listens for incoming connections and then spawns a thread on which it handles the data retrieval. So in my case I want to block the thread until Task is finished:
class Server(client: Client) {
def handle: Data {
run(client.fetchData)
}
private var state: S = Monoid[S].mempty
private def run(st: State[S, Data]): Data {
val resTask: Task[(S, Data)] = st.run(state)
val (newState, res) = Await.result(resTask.unsafeRunAsyncFuture)
state.synchonized({
state = Monoid[S].mappend(state, newState)
})
res
}
}
My questions are:
1) Am I doing it the right / best way here (my focus is efficiency / no bugs)
2) Do I really need State monad here, or it's better to use ordinary var and work with it?
3) In the examples of State monad I always saw that state is propagated to the top level of the app and handled there. But maybe it's better to handle it on Client level and expose stateless interface to the app?
4) Is this code thread safe, or I should make state some sort of synchronized var? I know that in Objective C this line state = Monoid[S].mappend(state, newState) could crash if in the middle of assignment the state was read from another thread.

Limit number of threads in Groovy

How can I limit number of threads that are being executed at the same time?
Here is sample of my algorithm:
for(i = 0; i < 100000; i++) {
Thread.start {
// Do some work
}
}
I would like to make sure that once number of threads in my application hits 100, algorithm will pause/wait until number of threads in the app goes below 100.
Currently "some work" takes some time to do and I end up with few thousands of threads in my app. Eventually it runs out of threads and "some work" crashes. I would like to fix it by limiting number of pools that it can use at one time.
Please let me know how to solve my issue.
I believe you are looking for a ThreadPoolExecutor in the Java Concurrency API. The idea here is that you can define a maximum number of threads in a pool and then instead of starting new Threads with a Runnable, just let the ThreadPoolExecutor take care of managing the upper limit for Threads.
Start here: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
import java.util.concurrent.*;
import java.util.*;
def queue = new ArrayBlockingQueue<Runnable>( 50000 )
def tPool = new ThreadPoolExecutor(5, 500, 20, TimeUnit.SECONDS, queue);
for(i = 0; i < 5000; i++) {
tPool.execute {
println "Blah"
}
}
Parameters for the ThreadBlockingQueue constructor: corePoolSize (5), this is the # of threads to create and to maintain if the system is idle, maxPoolSize (500) max number of threads to create, 3rd and 4th argument states that the pool should keep idle threads around for at least 20 seconds, and the queue argument is a blocking queue that stores queued tasks.
What you'll want to play around with is the queue sizes and also how to handle rejected tasks. If you need to execute 100k tasks, you'll either have to have a queue that can hold 100k tasks, or you'll have to have a strategy for handling a rejected tasks.

Resources