Limiting number of threads for actor pool to 1 - multithreading

In one of my actors react method I output the thread pool by doing:
val scalaThreadSet = asScalaSet(Thread.getAllStackTraces().keySet());
scalaThreadSet.foreach(element => Console.println("Thread=" + element + ",state=" +
element.getState()))
I see a bunch of threads:
Thread=Thread[ForkJoinPool-1-worker-6,5,main],state=WAITING
Thread=Thread[Signal Dispatcher,9,system],state=RUNNABLE
Thread=Thread[ForkJoinPool-1-worker-10,5,main],state=RUNNABLE
Thread=Thread[ForkJoinPool-1-worker-13,5,main],state=WAITING
Thread=Thread[ForkJoinPool-1-worker-7,5,main],state=WAITING
Thread=Thread[ForkJoinPool-1-worker-9,5,main],state=WAITING
Thread=Thread[ForkJoinPool-1-worker-14,5,main],state=WAITING
I wish to reduce the size of thread pool to one and only see one thread, so I pass in:
So I pass in:
-Dactors.maxPoolSize=1
as a VM argument.
My expectation is I should now only see one thread but I still see loads. Any ideas?

Short answer
Try running the VM with
-Dactors.corePoolSize=1
Explanation
The ForkJoinScheduler, which is the default scheduler for most OSes running Java 1.6 or later, uses a DrainableForkJoinPool under to covers which, as far as I can tell, ignores the maxPoolSize property. See the makeNewPool method of the ForkJoinScheduler and the constructor for DrainableForkJoinPool.

Related

How to sync Delphi event while running DB operations in a background thread?

Using Delphi 7 & UIB, I'm running database operations in a background thread to eliminate problems like:
Timeout
Priority
Immediate Force-reconnect after network-loss
Non-blocked UI
Keeping an opened DB connection alive
User canceling
I've read ALL related topics here, and realized: using while isMyThreadStillRuning and not UserCanceled do sleep(100); end; isn't the recommended way to do this, but rather using TEvent.WaitFor(3000)....
The solutions here are either about sending signals FROM or TO... the thread, or doing it with messages, but never both ways.
Reading the help file, I've also found TSimpleEvent, which seems to be easier to use.
So what is the recommended way to communicate between Main-UI + DB-Thread in both ways?
Should I simply create 2+2 TSimpleEvent?
to start a new transaction (thread should stop sleeping)
force-STOP execution
to signal back if it's moved to a new stage (transaction started / executed / commited=done)
to signal back if there is any error happened
or should there be only 1 TEvent?
Update 2:
First tests show:
2x TSimpleEvent is enough (1 for Thread + 1 for Gui)
Both created as public properties of the background thread
Force-terminating the thread does not work. (Too many errors impossible to handle..)
Better to set a variable like (Stop_yourself) and let it cancel and free itself, (while creating a new instance from the same class and try again.)
(still work in progress...)
You should move the query to a TThread. Unfortunately, anonymous threads are not available in D7 so you need to write your own TThread derived class. Inside, you need its own DB connection to prevent shared resources. From the caller method, you can wait for the thread to end. The results should be stored somewhere in the caller class. Ensure that the access to parameters of the query and for storing the result of the query is handled thread-safe by using a TMutex or TMonitor.

How to check what dispatcher is configured in akka application

I have following entry in conf file. But I'm not sure if this dispatcher setting is being picked up and what's ultimate parallelism value being used
akka{
actor{
default-dispatcher {
type = Dispatcher
executor = "fork-join-executor"
throughput = 3
fork-join-executor {
parallelism-min = 40
parallelism-factor = 10
parallelism-max = 100
}
}
}
}
I've 8 core machine so I expect 80 parallel threads to be in ready state
40min < 80 (8*10 factor) < 100max. I'd like to see what value is akka using for max parallel thread.
I created 45 child actors and in my logs, I'm printing the thread id [application-akka.actor.default-dispatcher-xx] and I don't see more than 20 threads running in parallel.
In order to max-out the parallelism factor, all the actors needs to be processing some messages at the same time. Are you sure this is the case in your application?
Take for example the following code
object Test extends App {
val system = ActorSystem()
(1 to 80).foreach{ _ =>
val ref = system.actorOf(Props[Sleeper])
ref ! "hello"
}
}
class Sleeper extends Actor {
override def receive: Receive = {
case msg =>
//Thread.sleep(60000)
println(msg)
}
}
If you consider your config and 8 cores, you will see a small amount of threads being spawned (4, 5?) as the processing of the messages is too quick for some real parallelism to build up.
On the contrary, if you keep your actors CPU-busy uncommenting the nasty Thread.sleep you will see the number of threads will bump up to 80. However, this will only last 1 minute, after which the threads will be gradually be retired from the pool.
I guess the main trick is: don't think of each actor being run on a separate thread. It's whenever one or more messages appear on an actor's mailbox that the dispatcher awakes and - indeed - dispatches the message processing task to a designated pool.
Assuming you have an ActorSystem instance you can check the values set in its configuration. This is how you could get your hand on the values you've set in the config file:
val system = ActorSystem()
val config = system.settings.config.getConfig("akka.actor.default-dispatcher")
config.getString("type")
config.getString("executor")
config.getString("throughput")
config.getInt("fork-join-executor.parallelism-min")
config.getInt("fork-join-executor.parallelism-max")
config.getDouble("fork-join-executor.parallelism-factor")
I hope this helps. You can also consult this page for more details on specific configuration settings.
Update
I've dug up a bit more in Akka to find out exactly what it uses for your settings. As you might already expect it uses a ForkJoinPool. The parallelism used to build it is given by:
object ThreadPoolConfig {
...
def scaledPoolSize(floor: Int, multiplier: Double, ceiling: Int): Int =
math.min(math.max((Runtime.getRuntime.availableProcessors * multiplier).ceil.toInt, floor), ceiling)
...
}
This function is used at some point to build a ForkJoinExecutorServiceFactory:
new ForkJoinExecutorServiceFactory(
validate(tf),
ThreadPoolConfig.scaledPoolSize(
config.getInt("parallelism-min"),
config.getDouble("parallelism-factor"),
config.getInt("parallelism-max")),
asyncMode)
Anyway, this is the parallelism that will be used to create the ForkJoinPool, which is actually an instance of java.lang.ForkJoinPool. Now we have to ask how many thread does this pool use? The short answer is that it will use the whole capacity (80 threads in our case) only if it needs it.
To illustrate this scenario, I've ran a couple of tests with various uses of Thread.sleep inside the actor. What I've found out is that it can use from somewhere around 10 threads (if no sleep call is made) to around the max 80 threads (if I call sleep for 1 second). The tests were made on a machine with 8 cores.
Summing it up, you will need to check the implementation used by Akka to see exactly how that parallelism is used, this is why I looked into ForkJoinPool. Other than looking at the config file and then inspecting that particular implementation I don't think you can do unfortunately :(
I hope this clarifies the answer - initially I thought you wanted to see how the actor system's dispatcher is configured.

Intel TBB v4.3 ignores task_scheduler_init parameter, always runs total threads equal to number of cores

I need to use task_scheduler_init to limit the number of threads to a number under of cores, however TBB ignores the number & always uses the number of cores (8 in this case).
This doesn't look like normal behavior to me. Please note that it is not a possibility for me to use a different version of TBB.
Snippet:
task_scheduler_init scheduler(nb_thread);
tbb::parallel_for(
tbb::blocked_range<size_t>(0, size),
[&](const tbb::blocked_range<size_t>& subrange) {
int tid = syscall(SYS_gettid);
dragon_draw_raw(
subrange.begin(),
subrange.end(),
dragon,
dragon_width,
dragon_height,
limits,
id
);
});
If the task_scheduler_init instance was not the first call to TBB on this thread, it explains why you see such a behavior. [Almost] any call to TBB initiates auto-initialization and the number of worker threads is fixed at this time, any further call to task_scheduler_init means basically nothing except additional reference to the internal task scheduler object.
If you are in control of the entire application and can make the call to task_scheduler_init object constructor the first call to TBB, it will fix your problem. But if you are writing a component or a library, look at task_arena feature, it allows to control concurrency limits regardless of the current settings.

Coldfusion limit to total number of threads

I've got some code that is trying to create 100 threaded http calls. It seems to be getting capped at about 40.
When I do threadJoin I'm only getting 38 - 40 sets of results from my http calls, despite the loop being from 1 - 100.
// thread http calls
pages = 100;
for (page="1";page <= pages; page++) {
thread name="req#page#" {
grabber.setURL('http://site.com/search.htm');
// request headers
grabber.addParam(type="url",name="page",value="#page#");
results = grabber.send().getPrefix();
arrayAppend(VARIABLES.arrResults,results.fileContent);
}
}
// rejoin threads
for (page="2";page <= pages; page++) {
threadJoin('req#page#',10000);
}
Is there a limit to the number of threads that CF can create? Is it to do with Java running in the background? Or can it not handle that many http requests?
Is there a just a much better way for me to do this than threaded HTTP calls?
The result you're seeing is likely because your variables aren't thread safe.
grabber.addParam(type="url",name="page",value="#page#");
That line is accessing Variables.Page which is shared by all of the spawned threads. Because threads start at different times, the value of page is often different from the value you think it is. This will lead to multiple threads having the same value for page.
Instead, if you pass page as an attribute to the thread, then each thread will have its own version of the variable, and you will end up with 100 unique values. (1-100).
Additionally you're writing to a shared variable as well.
arrayAppend(VARIABLES.arrResults,results.fileContent);
ArrayAppend is not thread safe and you will be overwriting versions of VARIABLES.arrResults with other versions of itself, instead of appending each bit.
You want to set the result to a thread variable, and then access that once the joins are complete.
thread name="req#page#" page=Variables.page {
grabber.setURL('http://site.com/search.htm');
// request headers
grabber.addParam(type="url",name="page",value="#Attributes.page#");
results = grabber.send().getPrefix();
thread.Result = results.fileContent;
}
And the join:
// rejoin threads
for (page="2";page <= pages; page++) {
threadJoin('req#page#',10000);
arrayAppend(VARIABLES.arrResults, CFThread['req#page#'].Result);
}
In the ColdFusion administrator, there's a setting for how many will run concurrently, mine's defaulted to 10. The rest apparently are queued. An Phantom42 mentions, you can up the number of running CF threads, however, with 100 or more threads, you may run into other problems.
On 32-bit processes, your whole process can only use 2gig of memory. Each thread uses up an amount of Stack memory, which isn't part of the heap. We've had problems with running out of memory with high numbers of threads as your Java Binary+Heap+Non-Heap(PermGen)+(threads*512k) can easily go over the 2-gig limit.
You'd also have to allow enough threads to handle your code above, as well as other requests coming into your app, which may bog down the app as a whole.
I would suggest changing your code to create N threads, each of which does more than 1 request. It's more work, but you break the N requests=N Threads problem. There's a couple of approaches you can take:
If you think that each request is going to take roughly the same time, then you can split up the work and give each thread a portion to work on before you start each one up.
Or each thread picks a URL off a list and processes it, you can then join to all N threads. You'd need to make sure you put locking around whatever counter you used to track progress though.
Check your Maximum number of running JRun threads setting in ColdFusion Administrator under the Request Tuning tab. The default is 50.

What multithreading package for Lua "just works" as shipped?

Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.
Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.
Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.
This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.
Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(
I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.
I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.

Resources