I've been using multiple threads for a long time, yet can not explain such a simple case.
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1))
def addOne(x: Int) = Future(x + 1)
def addTwo(x: Int) = Future {addOne(x + 1)}
addTwo(1)
// res5: Future[Future[Int]] = Future(Success(Future(Success(3))))
To my surprise, it works. And I don't know why.
Question:
Why given one thread can it execute two Futures at the same time?
My expectation:
The first Future (addTwo) is occupying the one and only thread (newFixedThreadPool(1)), then it calls another Future (addOne) which again needs another thread.
So the program should end up starved for threads and get stuck.
The reason that your code is working, is that both futures will be executed by the same thread. The ExecutionContext that you are creating will not use a Thread directly for each Future but will instead schedule tasks (Runnable instances) to be executed. In case no more threads are available in the pool these tasks will be put into a BlockingQueue waiting to be executed. (See ThreadPoolExecutor API for details)
If you look at the implementation of Executors.newFixedThreadPool(1) you'll see that creates an Executor with an unbounded queue:
new ThreadPoolExecutor(1, 1, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue[Runnable])
To get the effect of thread-starvation that you were looking for, you could create an executor with a limited queue yourself:
implicit val ec = ExecutionContext.fromExecutor(new ThreadPoolExecutor(1, 1, 0L,
TimeUnit.MILLISECONDS, new ArrayBlockingQueue[Runnable](1)))
Since the minimal capacity of ArrayBlockingQueue is 1 you would need three futures to reach the limit, and you would also need to add some code to be executed on the result of the future, to keep them from completing (in the example below I do this by adding .map(identity))
The following example
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(new ThreadPoolExecutor(1, 1, 0L,
TimeUnit.MILLISECONDS, new ArrayBlockingQueue[Runnable](1)))
def addOne(x: Int) = Future {
x + 1
}
def addTwo(x: Int) = Future {
addOne(x + 1) .map(identity)
}
def addThree(x: Int) = Future {
addTwo(x + 1).map(identity)
}
println(addThree(1))
fails with
java.util.concurrent.RejectedExecutionException: Task scala.concurrent.impl.CallbackRunnable#65a264b6 rejected from java.util.concurrent.ThreadPoolExecutor#10d078f4[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 1]
expand it to Promise is easily to undunderstand
val p1 = Promise[Future[Int]]
ec.execute(() => {
// the fist task is start run
val p2 = Promise[Int]
//the second task is submit , but no run
ec.execute(() => {
p2.complete(Success(1))
println(s"task 2 -> p1:${p1},p2:${p2}")
})
//here the p1 is completed, not wait p2.future finish
p1.complete(Success(p2.future))
println(s"task 1 -> p1:${p1},p2:${p2}")// you can see the p1 is completed but the p2 have not
//first task is finish, will run second task
})
val result: Future[Future[Int]] = p1.future
Thread.sleep(1000)
println(result)
Related
I am very new to Scala and following the Scala Book Concurrency section (from docs.scala-lang.org). Based off of the example they give in the book, I wrote a very simple code block to test using Futures:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.util.{Failure, Success}
object Main {
def main(args: Array[String]): Unit = {
val a = Future{Thread.sleep(10*100); 42}
a.onComplete {
case Success(x) => println(a)
case Failure(e) => e.printStackTrace
}
Thread.sleep(5000)
}
}
When compiled and run, this properly prints out:
Future(Success(42))
to the console. I'm having trouble wrapping my head around why the Thread.sleep() call comes after the onComplete callback method. Intuitively, at least to me, would be calling Thread.sleep() before the callback so by the time the main thread gets to the onComplete method a is assigned a value. If I move the Thread.sleep() call to before a.onComplete, nothing prints to the console. I'm probably overthinking this but any help clarifying would be greatly appreciated.
When you use the Thread.sleep() after registering the callback
a.onComplete {
case Success(x) => println(a)
case Failure(e) => e.printStackTrace
}
Thread.sleep(5000)
then the thread that is executing the body of the future has the time to sleep one second and to set the 42 as the result of successful future execution. By that time (after approx. 1 second), the onComplete callback is already registered, so the thread calls this as well, and you see the output on the console.
The sequence is essentially:
t = 0: Daemon thread begins the computation of 42
t = 0: Main thread creates and registers callback.
t = 1: Daemon thread finishes the computation of 42
t = 1 + eps: Daemon thread finds the registered callback and invokes it with the result Success(42).
t = 5: Main thread terminates
t = 5 + eps: program is stopped.
(I'm using eps informally as a placeholder for some reasonably small time interval; + eps means "almost immediately thereafter".)
If you swap the a.onComplete and the outer Thread.sleep as in
Thread.sleep(5000)
a.onComplete {
case Success(x) => println(a)
case Failure(e) => e.printStackTrace
}
then the thread that is executing the body of the future will compute the result 42 after one second, but it would not see any registered callbacks (it would have to wait four more seconds until the callback is created and registered on the main thread). But once 5 seconds have passed, the main thread registers the callback and exits immediately. Even though by that time it has the chance to know that the result 42 has already been computed, the main thread does not attempt to execute the callback, because it's none of its business (that's what the threads in the execution context are for). So, right after registering the callback, the main thread exits immediately. With it, all the daemon threads in the thread pool are killed, and the program exits, so that you don't see anything in the console.
The usual sequence of events is roughly this:
t = 0: Daemon thread begins the computation of 42
t = 1: Daemon thread finishes the computation of 42, but cannot do anything with it.
t = 5: Main thread creates and registers the callback
t = 5 + eps: Main thread terminates, daemon thread is killed, program is stopped.
so that there is (almost) no time when the daemon thread could wake up, find the callback, and invoke it.
A lot of things in Scala are functions and don't necessarily look like it. The argument to onComplete is one of those things. What you've written is
a.onComplete {
case Success(x) => println(a)
case Failure(e) => e.printStackTrace
}
What that translates to after all the Scala magic is effectively (modulo PartialFunction shenanigans)
a.onComplete({ value =>
value match {
case Success(x) => println(a)
case Failure(e) => e.printStackTrace
}
})
onComplete isn't actually doing any work. It's just setting up a function that will be called at a later date. So we want to do that as soon as we can, and the Scala scheduler (specifically, the ExecutionContext) will invoke our callback at its convenience.
When I call par on collections, it seems to create about 5-10 threads, which is fine for CPU bound tasks.
But sometimes I have tasks which are IO bound, in which case I'd like to have 500-1000 threads pulling from IO concurrently - doing 10-15 threads is very slow and I see my CPUs mostly sitting idle.
How can I achieve this?
You could wrap your blocking io operations in blocking block:
(0 to 1000).par.map{ i =>
blocking {
Thread.sleep(100)
Thread.activeCount()
}
}.max // yield 67 on my pc, while without blocking it's 10
But you should ask yourself a question if you should use parallel collections for IO operations. Their use case is to perform a CPU heavy task.
I would suggest you to consider using futures for IO calls.
You should also consider using a custom execution context for that task because the global execution context is a public singleton and you don't have control what code uses it and for which purpose. You could easily starve parallel computations created by external libraries if you used all threads from it.
// or just use scala.concurrent.ExecutionContext.Implicits.global if you don't care
implicit val blockingIoEc: ExecutionContextExecutor = ExecutionContext.fromExecutor(
Executors.newCachedThreadPool()
)
def fetchData(index: Int): Future[Int] = Future {
//if you use global ec, then it's required to mark computation as blocking to increase threads,
//if you use custom cached thread pool it should increase thread number even without it
blocking {
Thread.sleep(100)
Thread.activeCount()
}
}
val futures = (0 to 1000).map(fetchData)
Future.sequence(futures).onComplete {
case Success(data) => println(data.max) //prints about 1000 on my pc
}
Thread.sleep(1000)
EDIT
There is also a possibility to use custom ForkJoinPool using ForkJoinTaskSupport:
import java.util.concurrent.ForkJoinPool //scala.concurrent.forkjoin.ForkJoinPool is deprecated
import scala.util.Random
import scala.collection.parallel
val fjpool = new ForkJoinPool(2)
val customTaskSupport = new parallel.ForkJoinTaskSupport(fjpool)
val numbers = List(1,2,3,4,5).par
numbers.tasksupport = customTaskSupport //assign customTaskSupport
I'm currently using Scala / Play2 framework / MongoDB (reactivemongo)
I have a function in a request doing like this : find maximum value in a collection, increase the that maximum number by a random, and save the new value to that collection, and return the new value.
def generateCode():Future[String] = {
// find maximum
maximum = Future[].... map { maxValue =>
// increase maxValue
newValue = maxValue + random
// save back to database
}
}
The problem is I want this code is only 1 thread run at a time. Because if 2 thread run this a same time, then value con be conflicted.
Example:
thread 1: read max = 100, thread 2 read max = 100
thread 1: increase max = 105, thread 2 increase max = 102
thread 1: save 105 to db, thread 2 save 102 to db
finally the maximum in db is 102, in actually it should be 105.
How can I do this ?
As a rule ReactiveMongo API and operations on Future require implicit ExecutionContext in scope. So what you can do is define a single threaded execution context and use it in the class where you defined your generateCode() method and in the class where you call ReactiveMongo API.
import java.util.concurrent.Executors
implicit val ec: ExecutionContext = ExecutionContext.fromExecutor(Executors.newSingleThreadExecutor())
You can also pass ec explicitly to the methods that require implicit ExecutionContext. You just need to make sure the whole chain of asynchronous method calls uses the same single threaded execution context.
You can use Semaphore or ReentrantLock to implement lock:
val s = new ReentrantLock()
def generateCode():Future[String] = {
s.lock() //get lock block other threads to execute the db operation
// find maximum
maximum = Future[].... map { maxValue =>
// increase maxValue
newValue = maxValue + random
// save back to database
}
s.unlock()///after finish db operation, release this lock for other threads can get the Semaphore to continue work
}
I am trying to run some same task in parallel in a F# Console project.
The task is as follow
Dequeue a table name from a ConcurrentQueue(this queue contains the table names my program need to process)
Open a SqlDataReader for the table
Write each row in the SqlDataReader to a StreamWriter
Zip the file created by the StreamWriter
Repeat 1 - 4
So basically each task is a while loop (pose as a recursion) to contiunously process tables. And I would like to start 4 tasks in parallel, and another thing is I would like to stop execution with a user keystroke on the Console, say Enter key. But the execution should only be stopped if the current task has complete step 4.
I have tried the following
let rec DownloadHelper (cq:ConcurrentQueue<Table>) sqlConn =
let success, tb = cq.TryDequeue()
if success then
printfn "Processing %s %s" tb.DBName tb.TBName
Table2CSV tb.DBName tb.TBName sqlConn
DownloadHelper cq sqlConn
let DownloadTable (cq:ConcurrentQueue<Table>) connectionString=
use con = new SqlConnection(connectionString)
con.Open()
DownloadHelper cq con
let asyncDownloadTask = async { return DownloadTable cq connectionString}
let asyncMultiDownload =
asyncDownloadTask
|> List.replicate 4
|> Async.Parallel
asyncMultiDownload
|>Async.RunSynchronously
|>ignore
There are two problems with the above code,
It blockes the main thread, thus I dont know how to do the keystroke part
I am not sure how to stop execution gracefully.
My second try is as below to use CancellationToken,
let tokenSource = new CancellationTokenSource()
let cq = PrepareJobs connectionString
let asyncDownloadTask = async { DownloadTable cq connectionString}
let task = async {
asyncDownloadTask
|> List.replicate 4
|> Async.Parallel
|>ignore}
let val1 = Async.Start(task, cancellationToken =tokenSource.Token)
Console.ReadLine() |> ignore
tokenSource.Cancel()
Console.ReadLine() |> ignore
0
But it seems I am not even able to start the task at all.
There are three problems with your code.
First, the DownloadHelper should do one table only.
By making it recursive, you are taking too much control and
inhibiting parallelism.
Second, simply placing an operation in an async expression does not
magically make it async. Unless the DownloadTable function itself
is async, the code will block until it is finished.
So when you run four downloads in parallel, once started, they will
all run to completion, regardless of the cancellation token.
Thirdly, in your second example you use Async.Parallel
but then throw the output away, which is why your task does nothing!
I think what you wanted to do was throw away the result of the async, not the async itself.
Here's my version of your code, to demonstrate these points.
First, a dummy function that takes up time:
let longAtomicOperation milliSecs =
let sw = System.Diagnostics.Stopwatch()
let r = System.Random()
let mutable total = 0.0
sw.Start()
while sw.ElapsedMilliseconds < int64 milliSecs do
total <- total + sin (r.NextDouble())
// return something
total
// test
#time
longAtomicOperation 2000
#time
// Real: 00:00:02.000, CPU: 00:00:02.000, GC gen0: 0, gen1: 0, gen2: 0
Note that this function is not async -- once started it will run to completion.
Now let's put it an an async:
let asyncTask id = async {
// note that NONE of the operations are async
printfn "Started %i" id
let result = longAtomicOperation 5000 // 5 seconds
printfn "Finished %i" id
return result
}
None of the operations in the async block are async, so we are not getting
any benefit.
Here's the code to create four tasks in parallel:
let fourParallelTasks = async {
let! results =
List.init 4 asyncTask
|> Async.Parallel
// ignore
return ()
}
The result of the Async.Parallel is not ignored, but is assigned to a value,
which forces the tasks to be run. The async expression as a whole returns unit though.
If we test it:
open System.Threading
// start the task
let tokenSource = new CancellationTokenSource()
do Async.Start(fourParallelTasks, cancellationToken = tokenSource.Token)
// wait for a keystroke
System.Console.WriteLine("press a key to cancel")
System.Console.ReadLine() |> ignore
tokenSource.Cancel()
System.Console.ReadLine() |> ignore
We get output that looks like this, even if a key is pressed,
because once started, each task will run to completion:
press a key to cancel
Started 3
Started 1
Started 2
Started 0
Finished 1
Finished 3
Finished 2
Finished 0
On the other hand, if we create a serial version, like this:
let fourSerialTasks = async {
let! result1 = asyncTask 1
let! result2 = asyncTask 2
let! result3 = asyncTask 3
let! result4 = asyncTask 4
// ignore
return ()
}
Then, even though the tasks are atomic, the cancellation token is tested between
each step, which allows cancellation of the subsequence tasks.
// start the task
let tokenSource = new CancellationTokenSource()
do Async.Start(fourSerialTasks, cancellationToken = tokenSource.Token)
// wait for a keystroke
System.Console.WriteLine("press a key to cancel")
System.Console.ReadLine() |> ignore
tokenSource.Cancel()
System.Console.ReadLine() |> ignore
The above code can be cancelled between each step when a key is pressed.
To process all elements of the queue this way in batches of four,
just convert the parallel version into a loop:
let rec processQueueAsync() = async {
let! result = processFourElementsAsync()
if result <> QueueEmpty then
do! processQueueAsync()
// ignore
return ()
}
Finally, to me, using async is not about running things in parallel so much
as it is to write non-blocking code. So if your library code is blocking, the async
approach is not going to provide too much benefit.
To ensure your code is non-blocking, you need to using the async versions of SqlDataReader methods
in your helper, such as NextResultAsync.
How can I limit number of threads that are being executed at the same time?
Here is sample of my algorithm:
for(i = 0; i < 100000; i++) {
Thread.start {
// Do some work
}
}
I would like to make sure that once number of threads in my application hits 100, algorithm will pause/wait until number of threads in the app goes below 100.
Currently "some work" takes some time to do and I end up with few thousands of threads in my app. Eventually it runs out of threads and "some work" crashes. I would like to fix it by limiting number of pools that it can use at one time.
Please let me know how to solve my issue.
I believe you are looking for a ThreadPoolExecutor in the Java Concurrency API. The idea here is that you can define a maximum number of threads in a pool and then instead of starting new Threads with a Runnable, just let the ThreadPoolExecutor take care of managing the upper limit for Threads.
Start here: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
import java.util.concurrent.*;
import java.util.*;
def queue = new ArrayBlockingQueue<Runnable>( 50000 )
def tPool = new ThreadPoolExecutor(5, 500, 20, TimeUnit.SECONDS, queue);
for(i = 0; i < 5000; i++) {
tPool.execute {
println "Blah"
}
}
Parameters for the ThreadBlockingQueue constructor: corePoolSize (5), this is the # of threads to create and to maintain if the system is idle, maxPoolSize (500) max number of threads to create, 3rd and 4th argument states that the pool should keep idle threads around for at least 20 seconds, and the queue argument is a blocking queue that stores queued tasks.
What you'll want to play around with is the queue sizes and also how to handle rejected tasks. If you need to execute 100k tasks, you'll either have to have a queue that can hold 100k tasks, or you'll have to have a strategy for handling a rejected tasks.