Scala - best API for doing work inside multiple threads - multithreading

In Python, I am using a library called futures, which allows me to do my processing work with a pool of N worker processes, in a succinct and crystal-clear way:
schedulerQ = []
for ... in ...:
workParam = ... # arguments for call to processingFunction(workParam)
schedulerQ.append(workParam)
with futures.ProcessPoolExecutor(max_workers=5) as executor: # 5 CPUs
for retValue in executor.map(processingFunction, schedulerQ):
print "Received result", retValue
(The processingFunction is CPU bound, so there is no point for async machinery here - this is about plain old arithmetic calculations)
I am now looking for the closest possible way to do the same thing in Scala. Notice that in Python, to avoid the GIL issues, I was using processes (hence the use of ProcessPoolExecutor instead of ThreadPoolExecutor) - and the library automagically marshals the workParam argument to each process instance executing processingFunction(workParam) - and it marshals the result back to the main process, for the executor's map loop to consume.
Does this apply to Scala and the JVM? My processingFunction can, in principle, be executed from threads too (there's no global state at all) - but I'd be interested to see solutions for both multiprocessing and multithreading.
The key part of the question is whether there is anything in the world of the JVM with as clear an API as the Python futures you see above... I think this is one of the best SMP APIs I've ever seen - prepare a list with the function arguments of all invocations, and then just two lines: create the poolExecutor, and map the processing function, getting back your results as soon as they are produced by the workers. Results start coming in as soon as the first invocation of processingFunction returns and keep coming until they are all done - at which point the for loop ends.

You have way less boilerplate than that using parallel collections in Scala.
myParameters.par.map(x => f(x))
will do the trick if you want the default number of threads (same as number of cores).
If you insist on setting the number of workers, you can like so:
import scala.collection.parallel._
import scala.concurrent.forkjoin._
val temp = myParameters.par
temp.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(5))
temp.map(x => f(x))
The exact details of return timing are different, but you can put as much machinery as you want into f(x) (i.e. both compute and do something with the result), so this may satisfy your needs.
In general, simply having the results appear as completed is not enough; you then need to process them, maybe fork them, collect them, etc.. If you want to do this in general, Akka Streams (follow links from here) are nearing 1.0 and will facilitate the production of complex graphs of parallel processing.

There is both a Futures api that allows you to run work-units on a thread pool (docs: http://docs.scala-lang.org/overviews/core/futures.html) and a "parallell collections api" that you can use to perform parallell operations on collections: http://docs.scala-lang.org/overviews/parallel-collections/overview.html

Related

What is the best concurrency way of doing 10 000 continuous opencv operations simultaneously in Python3?

I used to have a relatively simple Python3 app that read a video streaming source and did continuous opencv and I/O-heavy (with files and databases) operations:
cap_vid = cv2.VideoCapture(stream_url)
while True:
# ...
# openCV operations
# database I/O operations
# file I/O operations
# ...
The app ran smoothly. However, there arose a need to do this not just with 1 channel, but with many, potentially 10 000 channels. Now, let's say I have a list of these channels (stream_urls). If I wrap my usual code inside for stream_url in stream_urls:, it will of course not work, because the iteration will never proceed further than the 0th index. So, the first thing that comes to mind is concurrency.
Now, as much as I know, there are 3 ways of doing concurrent programming in Python3: threading, asyncio, and multiprocessing:
I understand that in the case of multiprocessing the OS creates new (instances of) Python interpreter so there can be at most as many instances as there are cores of the machine, which seldom exceeds 16; however, the number of processes can be potentially up to 10 000. Also, the overhead from the use of multiprocessing exceeds the performance gains if the number of processes are more than a certain amount, so this one appears to be useless.
The case of threading seems the easiest in terms of the machinery it uses. I'd just wrap my code in a function and create threads like the following:
from threading import Thread
def work_channel(ch_src):
cap_vid = cv2.VideoCapture(ch_src)
while True:
# ...
# openCV operations
# database I/O operations
# file I/O operations
# ...
for stream_url in stream_urls:
Thread(target=work_channel, args=(stream_url,), daemon=True).start()
But there are a few problems with threading: first, using more than 11-17 threads nullifies any of its favourable effects because of the overhead costs. Also, it's not safe to work with file I/O in threading, which is a very important concern for me.
I don't know how to use asyncio and couldn't find how to do what I want with it.
Given the above scenario, which one of the 3 (or more if there are other methods that I am unaware of) concurrency methods should I use for the fastest, most accurate (or at least expected) performance? And what way should I use that method correctly?
Any help is appreciated.
Spawning more threads than your computer has cores is very much possible.
Could be as simple as:
import threading
all_urls = [your url list]
def threadfunction(url):
cap_vid = cv2.VideoCapture(url)
while True:
# ...
# openCV operations
# file I/O operations
# ...
for stream_url in all_urls:
threading.Thread(target=threadfunction, args=(stream_url,)).start()

Joining threads recursively in Prolog

I would like to create a variable number of threads in Prolog and make the main thread wait for all of them.
I have tried to make a join for each one of them in the predicate but it seems like they are waiting one for the other in a sequential order.
I have also tried storing the ids of the threads in a list and join each one after but it still isn't working.
In the code sample, I have also tried passing the S parameters in thread_join in the recursive call.
thr1(0):-!.
thr1(N):-
thread_create(someFunction(N),Id, []),
thread_join(Id, S),
N1 is N-1,
thr1(N1).
I expect the N predicates to overlap results when doing some print, but they are running in a sequential order.
Most likely the calls to your someFunction/1 predicate succeed faster than the time it takes to create the next thread, which is a relatively heavy process as SWI-Prolog threads are mapped to POSIX threads. Thus, to actually get overlapping results, the computation time of the thread goals must exceed thread creation time. For a toy example of accomplishing that, see:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/threads/sync

Threading in Python 3

I write Python 3 code, in which I have 2 functions. The first function insertBlock() inserts data in MongoDB collection 1, the second function insertTransactionData() takes data from collection 1 and inserts it into collection 2. Data is in very large amount so I use threading to increase performance. But when I use threading it is taking more time to insert data than without threading. I am so confused that exactly how threading will work in my code and how to increase performance? Here is the main function :
if __name__ == '__main__':
t1 = threading.Thread(target=insertBlock())
t1.start()
t2 = threading.Thread(target=insertTransactionData())
t2.start()
From the python documentation for threading:
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
So the correct usage is
threading.Thread(target=insertBlock)
(without the () after insertBlock), because otherwise insertBlock is called, executed normally (blocking the main thread) and target is set to it's return value None. This causes t1.start() not to do anything and you don't get any performance improvement.
Warning:
Be aware that multithreading gives you no guarantee on what the order of execution in different threads will be. You can not rely on the data that insertBlock has inserted into the database inside the insertTransactionData function, because at the time insertTransactionData uses this data, you can not be sure that it was already inserted. So, maybe multithreading does not work at all for this code or you need to restructure your code and only parallelize those parts that do not depend on each other.
I solved this problem by merging these two functionalities into one new function
insertBlockAndTransaction(startrange,endrange). As these two functionalities depend on each other so what I did is I insert transaction information immediately below where block information is inserted (block number was common and needed for both functionalities).Then did multithreading by creating 10 threads for single function:
for i in range(10):
print('thread:',i)
t1 = threading.Thread(target=insertBlockAndTransaction,args(5000000+i*10000,5000000+(i+1)*10000))
t1.start()
It helps me to deal with increasing execution time for more than 1lakh data.

I want to know about the multi thread with future on Scala

I know multi thread with future a little such as :
for(i <- 1 to 5) yield future {
println(i)
}
but this is all the threads do same work.
So, i want to know how to make two threads which do different work concurrently.
Also, I want to know is there any method to know all the thread is complete?
Please, give me something simple.
First of all, chances are you might be happy with parallel collections, especially if all you need is to crunch some data in parallel using multiple threads:
val lines = Seq("foo", "bar", "baz")
lines.par.map(line => line.length)
While parallel collections suitable for finite datasets, Futures are more oriented towards events-like processing and in fact, future defines task, abstracting away from execution details (one thread, multiple threads, how particular task is pinned to thread) -- all of this is controlled with execution context. What you can do with futures though is to add callback (on success, on failure, on both), compose it with another future or await for result. All this concepts are nicely explained in official doc which is worthwhile reading.

What multithreading package for Lua "just works" as shipped?

Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.
Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.
Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.
This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.
Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(
I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.
I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.

Resources