Let's say I have a stateful client smth like
trait Client {
case class S(connection: Connection, failures: Int) // Connection takes 10 seconds to create
def fetchData: State[S, Data] // Takes 10 seconds to fetch
}
I want to make it stateful because the creation of connection is expensive (so I want to cache it) + I have failures count to signal if there are too many failures in a row, then basically recreate connection.
What I've understood from State monad is that effectively it should be calculated on one thread, chaining immutable states sequentially on that thread. I can't afford it in this case, because I fetch operation takes huge amount of time, while all I need is quickly read the state, use the connection from there to start expensive async call. On the other hand, I can't make it State[S, Task[Data]], bacause I need to modify failures in S if the task fetchData fails. So I modified client to be
import fs2.Task
trait Client {
case class S(connection: Connection, failures: Int) // Connection takes 10 seconds to create
def fetchData: StateT[Task, S, Data] // Takes 10 seconds to fetch
}
The thing is I can come up with some Monoid that can add the states irrelevant if what was the initial state. E.g. If I have S1 -> (S2, Data), S1 -> (S3, Data), I could still add S2 and S3 to arrive at the final state regardless of what came before - S2 or S3.
Now I have a kind of Java server (Thrift) that use the handler. I haven't dig into details of how it working precisely yet, but let's assume that it listens for incoming connections and then spawns a thread on which it handles the data retrieval. So in my case I want to block the thread until Task is finished:
class Server(client: Client) {
def handle: Data {
run(client.fetchData)
}
private var state: S = Monoid[S].mempty
private def run(st: State[S, Data]): Data {
val resTask: Task[(S, Data)] = st.run(state)
val (newState, res) = Await.result(resTask.unsafeRunAsyncFuture)
state.synchonized({
state = Monoid[S].mappend(state, newState)
})
res
}
}
My questions are:
1) Am I doing it the right / best way here (my focus is efficiency / no bugs)
2) Do I really need State monad here, or it's better to use ordinary var and work with it?
3) In the examples of State monad I always saw that state is propagated to the top level of the app and handled there. But maybe it's better to handle it on Client level and expose stateless interface to the app?
4) Is this code thread safe, or I should make state some sort of synchronized var? I know that in Objective C this line state = Monoid[S].mappend(state, newState) could crash if in the middle of assignment the state was read from another thread.
Related
I've been working with Akka for some time, but now am exploring its actor system in depth. I know there is thread poll executor and fork join executor and afinity executor. I know how dispatcher works and all the rest details. BTW, this link gives a great explanation
https://scalac.io/improving-akka-dispatchers
However, when I experimented with a simple call actor and switched execution contexts, I always got roughly the same performance. I run 60 requests simultaneously and average execution time is around 800 ms to just return simple string to a caller.
I'm running on a MAC which has 8 core (Intel i7 processor).
So, here are the execution contexts I tried:
thread-poll {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 32
}
throughput = 10
}
fork-join {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 2
parallelism-factor = 1
parallelism-max = 8
}
throughput = 100
}
pinned {
type = Dispatcher
executor = "affinity-pool-executor"
}
So, questions are:
Is there any chance to get a better performance in this example?
What's all about the actor instances? How that matters, if we know that dispatcher is scheduling thread (using execution context) to execute actor's receive method inside that thread on the next message from actor's mailbox. Isn't than actor receive method only like a callback? When do number of actors instance get into play?
I have some code which is executing Future and if I run that code directly from main file, it executed around 100-150 ms faster than when I put it in actors and execute Future from actor, piping its result to a sender. What is making it slower?
If you have some real world example with this explained, it is more than welcome. I read some articles, but all in theory. If I try something on a simple example, I get some unexpected results, in terms of performance.
Here is a code
object RedisService {
case class Get(key: String)
case class GetSC(key: String)
}
class RedisService extends Actor {
private val host = config.getString("redis.host")
private val port = config.getInt("redis.port")
var currentConnection = 0
val redis = Redis()
implicit val ec = context.system.dispatchers.lookup("redis.dispatchers.fork-join")
override def receive: Receive = {
case GetSC(key) => {
val sen = sender()
sen ! ""
}
}
}
Caller:
val as = ActorSystem("test")
implicit val ec = as.dispatchers.lookup("redis.dispatchers.fork-join")
val service = as.actorOf(Props(new RedisService()), "redis_service")
var sumTime = 0L
val futures: Seq[Future[Any]] = (0 until 4).flatMap { index =>
terminalIds.map { terminalId =>
val future = getRedisSymbolsAsyncSCActor(terminalId)
val s = System.currentTimeMillis()
future.onComplete {
case Success(r) => {
val duration = System.currentTimeMillis() - s
logger.info(s"got redis symbols async in ${duration} ms: ${r}")
sumTime = sumTime + duration
}
case Failure(ex) => logger.error(s"Failure on getting Redis symbols: ${ex.getMessage}", ex)
}
future
}
}
val f = Future.sequence(futures)
f.onComplete {
case Success(r) => logger.info(s"Mean time: ${sumTime / (4 * terminalIds.size)}")
case Failure(ex) => logger.error(s"error: ${ex.getMessage}")
}
The code is pretty basic, just to test how it behaves.
It's a little unclear to me what you're specifically asking, but I'll take a stab.
If your dispatcher(s) (and, if what the actor is doing is CPU/memory- vs. IO-bound, actual number of cores available (note that this gets hazier the more virtualization (thank you, oversubscribed host CPU...) and containerization (thank you share- and quota-based cgroup limits) comes into play)) allows m actors to be processing simultaneously and you rarely/never have more than n actors with a message to handle (m > n), trying to increase parallelism via dispatcher settings won't gain you anything. (Note that in the foregoing, any task scheduled on the dispatcher(s), e.g. a Future callback, is effectively the same thing as an actor).
n in the previous paragraph is obviously at most the number of actors in the application/dispatcher (depending on what scope we want to look at things: I'll note that every dispatcher over two (one for actors and futures that don't block and one for those that do) is stronger smell (if on Akka 2.5, it's probably a decent idea to adapt some of the 2.6 changes around default dispatcher settings and running things like remoting/cluster in their own dispatcher so they don't get starved out; note also that Alpakka Kafka uses its own dispatcher by default: I wouldn't count those against the two), so in general more actors implies more parallelism implies more core utilization. Actors are comparatively cheap, relative to threads so a profusion of them isn't a huge matter for concern.
Singleton actors (whether at node or cluster (or even, in really extreme cases, entity) level) can do a lot to limit overall parallelism and throughput: the one-message-at-a-time restriction can be a very effective throttle (sometimes that's what you want, often it's not). So don't be afraid to create short-lived actors that do one high-level thing (they can definitely process more than one message) and then stop (note that many simple cases of this can be done in a slightly more lightweight way via futures). If they're interacting with some external service, having them be children of a router actor which spawns new children if the existing ones are all busy (etc.) is probably worth doing: this router is a singleton, but as long as it doesn't spend a lot of time processing any message, the chances of it throttling the system are low. Your RedisService might be a good candidate for this sort of thing.
Note also that performance and scalability aren't always one and the same and improving one diminishes the other. Akka is often somewhat willing to trade performance in the small for reduced degradation in the large.
When I call par on collections, it seems to create about 5-10 threads, which is fine for CPU bound tasks.
But sometimes I have tasks which are IO bound, in which case I'd like to have 500-1000 threads pulling from IO concurrently - doing 10-15 threads is very slow and I see my CPUs mostly sitting idle.
How can I achieve this?
You could wrap your blocking io operations in blocking block:
(0 to 1000).par.map{ i =>
blocking {
Thread.sleep(100)
Thread.activeCount()
}
}.max // yield 67 on my pc, while without blocking it's 10
But you should ask yourself a question if you should use parallel collections for IO operations. Their use case is to perform a CPU heavy task.
I would suggest you to consider using futures for IO calls.
You should also consider using a custom execution context for that task because the global execution context is a public singleton and you don't have control what code uses it and for which purpose. You could easily starve parallel computations created by external libraries if you used all threads from it.
// or just use scala.concurrent.ExecutionContext.Implicits.global if you don't care
implicit val blockingIoEc: ExecutionContextExecutor = ExecutionContext.fromExecutor(
Executors.newCachedThreadPool()
)
def fetchData(index: Int): Future[Int] = Future {
//if you use global ec, then it's required to mark computation as blocking to increase threads,
//if you use custom cached thread pool it should increase thread number even without it
blocking {
Thread.sleep(100)
Thread.activeCount()
}
}
val futures = (0 to 1000).map(fetchData)
Future.sequence(futures).onComplete {
case Success(data) => println(data.max) //prints about 1000 on my pc
}
Thread.sleep(1000)
EDIT
There is also a possibility to use custom ForkJoinPool using ForkJoinTaskSupport:
import java.util.concurrent.ForkJoinPool //scala.concurrent.forkjoin.ForkJoinPool is deprecated
import scala.util.Random
import scala.collection.parallel
val fjpool = new ForkJoinPool(2)
val customTaskSupport = new parallel.ForkJoinTaskSupport(fjpool)
val numbers = List(1,2,3,4,5).par
numbers.tasksupport = customTaskSupport //assign customTaskSupport
I'm writing software in Go that does a lot of parallel computing. I want to collect data from worker threads and I'm not really sure how to do it in a safe way. I know that I could use channels but in my scenario they make it more complicated since I have to somehow synchronize messages (wait until every thread sent something) in the main thread.
Scenario
The main thread creates n Worker instances and launches their work() method in a goroutine so that the workers each run in their own thread. Every 10 seconds the main thread should collect some simple values (e.g. iteration count) from the workers and print a consolidated statistic.
Question
Is it safe to read values from the workers? The main thread will only read values and each individual thread will write it's own values. It would be ok if the values are a few nanoseconds off while reading.
Any other ideas on how to implement this in an easy way?
In Go no value is safe for concurrent access from multiple goroutines without synchronization if at least one of the accesses is a write. Your case meets the conditions listed, so you must use some kind of synchronization, else the behavior would be undefined.
Channels are used if goroutine(s) want to send values to another. Your case is not exactly this: you don't want your workers to send updates in every 10 seconds, you want your main goroutine to fetch status in every 10 seconds.
So in this example I would just protect the data with a sync.RWMutex: when the workers want to modify this data, they have to acquire a write lock. When the main goroutine wants to read this data, it has to acquire a read lock.
A simple implementation could look like this:
type Worker struct {
iterMu sync.RWMutex
iter int
}
func (w *Worker) Iter() int {
w.iterMu.RLock()
defer w.iterMu.RUnlock()
return w.iter
}
func (w *Worker) setIter(n int) {
w.iterMu.Lock()
w.iter = n
w.iterMu.Unlock()
}
func (w *Worker) incIter() {
w.iterMu.Lock()
w.iter++
w.iterMu.Unlock()
}
Using this example Worker, the main goroutine can fetch the iteration using Worker.Iter(), and the worker itself can change / update the iteration using Worker.setIter() or Worker.incIter() at any time, without any additional synchronization. The synchronization is ensured by the proper use of Worker.iterMu.
Alternatively for the iteration counter you could also use the sync/atomic package. If you choose this, you may only read / modify the iteration counter using functions of the atomic package like this:
type Worker struct {
iter int64
}
func (w *Worker) Iter() int64 {
return atomic.LoadInt64(&w.iter)
}
func (w *Worker) setIter(n int64) {
atomic.StoreInt64(&w.iter, n)
}
func (w *Worker) incIter() {
atomic.AddInt64(&w.iter, 1)
}
I have a function that boils down to:
while(doWork)
{
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
}
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
while(doWork)
{
if(thread_count < max_threads)
{
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
thread_count++;
}
else
usleep(100); // don't consume too much CPU
}
void checkResult(Result value)
{
if(value == good) doWork = false;
thread_count--;
}
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).
Here is the question:
Each process may be in different states and different events cause a process to transfer from one state to another; this can be represented using a state diagram. Use a state diagram to explain how a suspension-queue semaphore may be implemented. [10 marks]
Is my diagram correct, or have I misunderstood the question?
http://i.imgur.com/dC5RG6o.jpg
It is my understanding that suspended-queue semaphores maintain a list of blocked processes from which to (perhaps randomly) select a process to unblock when the current process has finished its critical section. Hence the waiting state in the state diagram.
pseudocode of suspended_queue_semaphore.
struct suspended_queue_semaphore
{
int count;
queueType queue;
};
void up(suspended_queue_semaphore s)
{
if (s.count == 0)
{
/* place this process in s.queue /*
/* block this process */
}
else
{
s.count = s.count - 1;
}
}
void down(suspended_queue_semaphore s)
{
if (s.queue is not empty)
{
/* remove a process from s.queue using FIFO */
/* unblock the process */
}
else
{
s.count = s.count + 1;
}
}
Is the state diagram for the process or the semaphore, and which semaphore are you talking about.
In the simplest semaphore: a binary semaphore (i.e. only one process can run) with operations wait() i.e. request to access shared resource and signal() i.e. finished accessing resource.
A state diagram for the process has only two states: Queued (Q) and Running (R) in addition to the Start and Terminate state.
The state diagram would be:
START = wait.CAN_RUN
CAN_RUN = suspend.QUEUED + run.RUNNING
QUEUED = run.RUNNING
RUNNING = signal.END
The semaphore has two states Empty and Full
A state diagram for the semaphore would be:
START = EMPTY
EMPTY = wait.RUN_PROCCESS + RUN_PROCESS
RUN_PROCESS = run.FULL
FULL = signal.EMPTY + wait.SUSPEND_PROCESS
SUSPEND_PROCESS = suspend.FULL
Edit: Fixed notation of state diagrams (was backwards sorry my process calculus is rusty) and added internal processes CAN_RUN, SUSPEND_PROCESS and RUN_PROCESS; and internal messages run and suspend.
Explanation:
The process calls the 'wait' method (up in your pseudo code) and goes to the CAN_RUN state, from there it can either start RUNNING or become QUEUED based on whether it gets a 'run' or 'suspend' message. If QUEUED it can start RUNNING when it receives a 'run' message. If RUNNING it uses 'signal' (down in your pseudo code) before finishing.
The semaphore starts EMPTY, if it gets a 'wait' it goes into RUN_PROCESS issues a 'run' message and becomes FULL. Once FULL any further 'wait' will send it to the SUSPEND_PROCESS state where it issues a 'suspend' to the process. When a 'signal' is received it goes back to EMPTY and it can remain there or go to RUN_PROCESS again based on whether the queue is empty or not (I did not model these internal states, nor did I model the queue as a system.)