Pass by reference TCL - threading? - multithreading

I'm using the Snack audio processing kit along with TCL.
I want to cut up part of the sound and give this section to another thread to work with.
My question is how to pass something by reference, between threads in TCL.
proc a {} {
snack::sound snd
thread::send -async $Thread [list B snd]
}
set Thread [thead::create {
proc B{snd} {
... do something with snd
}
}

That's not going to work. Tcl threads are designed to be strongly isolated from each other since it massively reduces the amount of locking required for normal processing. The down-side of this is that passing things between threads is non-trivial (other than for short messages containing commands, which audio data isn't!) But there is a way forward…
If you can send the data as a chunk of bytes (at the script level) then I recommend transferring it between threads using the tsv package, which is parceled up with the thread package so you'll already have it. That will let you transport the data between threads relatively simply. Be aware that the snack package is not thread-aware in its script-level interface, so the data transfers are still going to involve copying, and Tk (like a great many GUI toolkits, FWIW) does not support multi-threaded use (well, not without techniques for another time) so if you're doing waveform visualization you've got some work ahead. (OTOH, modern CPUs have loads of time to spare too.)

Related

Are avcodec_send_packet and avcodec_receive_frame thread safe?

I am trying to implement video decoding application with libav decoder.
Most libav examples are built like this (pseudocode):
while true {
auto packet = receive_packet_from_network();
avcodec_send_packet(packet);
auto frame = alloc_empty_frame();
int r = avcodec_receive_frame(&frame);
if (r==0) {
send_to_render(frame);
}
}
Above is pseudocode.
Anyway, with this traditional cycle, when I wait receive frame complete and then wait rendering complete and then wait next data received from network incoming decoder buffer becomes empty. No HW decoder pipeline, low decode performance.
Additional limitation in my application - I know exactly that one received packet from network directly corresponds to one decoded frame.
Besides that, I would like to make solution faster. For that I want to split this cycle into 2 different threads like this:
//thread one
while true {
auto packet = receive_packet_from_network();
avcodec_send_packet(packet);
}
//thread two
while true {
auto frame = alloc_empty_frame();
int r = avcodec_receive_frame(&frame);
if (r==0) {
send_to_render(frame);
}
Purpose to split cycle into 2 different threads is to keep incoming decoder buffer always feed enough, mostly full. Only in that case I guess HW decoder I expect to use will be happy to work constantly pipelined. Of cause, I need thread synchronization mechanisms, not shown here just for simplicity. Of cause when EGAIN is returned from avcodec_send_packet() or avcodec_receive_frame() I need to wait for other thread makes its job feeding incoming buffer or fetching ready frames. That is another story.
Besides that, this threading solution does not work for me with random segmentation faults. Unfortunately I cannot find any libav documentation saying explicitly if such method is acceptable or not, are avcodec_send_packet() and avcodec_receive_frame() calls thread safe or not?
So, what is best way to load HW decoder pipeline? For me it is obvious that traditional poll cycles shown in any libav examples are not effective.
No, threading like this is not allowed in libavcodec.
But, FFmpeg and libavcodec do support threading and hardware pipelining. But, this is much lower-level and requires you, as the user, to let FFmpeg/libavcodec do its thing and not worry about it:
don't call send_packet() and receive_frame() from different threads;
set AVCodecContext.thread_count for threading;
let hardware wrappers in FFmpeg internally take care of pipelining, they know much better than you what to do. I can ask experts for more info if you're interested, I'm not 100% knowledgeable in this area, but can refer you to people that are.
if send_packet() returns AVERROR(EAGAIN), call receive_frame() first
if receive_frame() returns AVERROR(EAGAIN), please call send_packet() next.
With the correct thread_count, FFmpeg/libavcodec will decode multiple frames in parallel and use multiple cores.

How to speed up Parallel::ForkManager in perl

I am using EC2 amazon server to perform data processing of 63 files,
the server i am using has 16 core but using perl Parallel::ForkManager with number of thread = number of core then it seems like half the core are sleeping and the working core are not at 100% and fluctuate around 25%~50%
I also checked IO and it is mostly iddling.
use Sys::Info;
use Sys::Info::Constants qw( :device_cpu );
my $info = Sys::Info->new;
my $cpu = $info->device( CPU => %options );
use Parallel::ForkManager;
my $manager=new Parallel::ForkManager($cpu->count);
for($i=0;$i<=$#files_l;$i++)
{
$manager->start and next;
do_stuff($files_l[$i]);
$manager->finish;
}
$manager->wait_all_children;
The short answer is - we can't tell you, because it depends entirely on what 'do_stuff' is doing.
The major reasons why parallel code doesn't create linear speed increases are:
Process creation overhead - some 'work' is done to spawn a process, so if the children are trivially small, that 'wastes' effort.
Contented resources - the most common is disk IO, but things like file locks, database handles, sockets, or interprocess communication can also play a part.
something else causing a 'back off' that stalls a process.
And without knowing what 'do_stuff' does, we can't second guess what it might be.
However I'll suggest a couple of steps:
Double the number of processes to twice CPU count. That's often a 'sweet spot' because it means that any non-CPU delay in a process just means one of the others get to go full speed.
Try strace -fTt <yourprogram> (if you're on linux, the commands are slightly different on other Unix variants). Then do it again with strace -fTtc because the c will summarise syscall run times. Look at which ones take the most 'time'.
Profile your code to see where the hot spots are. Devel::NYTProf is one library you can use for this.
And on a couple of minor points:
my $manager=new Parallel::ForkManager($cpu->count);
Would be better off written:
my $manager=Parallel::ForkManager -> new ( $cpu->count);
Rather than using indirect object notation.
If you are just iterating #files then it might be better to not use a loop count variable and instead:
foreach my $file ( #files ) {
$manager -> start and next;
do_stuff($file);
$manager -> finish;
}

Multithreaded Game Loop Rendering/Updating (boost-asio)

So I have a single-threaded game engine class, which has separate functions for input, update and rendering, and I've just started learning to use the wonderful boost library (asio and thread components). And I was thinking of separating my update and render functions into separate threads (and perhaps separate the input and update functions from each other as well). Of course these functions will sometimes access the same locations in memory, so I decided to use boost/thread's strand functionality to prevent them from executing at the same time.
Right now my main game loop looks like this:
void SDLEngine::Start()
{
int update_time=0;
quit=false;
while(!quit)
{
update_time=SDL_GetTicks();
DoInput();//get user input and alter data based on it
DoUpdate();//update game data once per loop
if(!minimized)
DoRender();//render graphics to screen
update_time=SDL_GetTicks()-update_time;
SDL_Delay(max(0,target_time-update_time));//insert delay to run at desired FPS
}
}
If I used separate threads it would look something like this:
void SDLEngine::Start()
{
boost::asio::io_service io;
boost::asio::strand strand_;
boost::asio::deadline_timer input(io,boost::posix_time::milliseconds(16));
boost::asio::deadline_timer update(io,boost::posix_time::milliseconds(16));
boost::asio::deadline_timer render(io,boost::posix_time::milliseconds(16));
//
input.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoInput,this)));
update.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoUpdate,this)));
render.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoRender,this)));
//
io.run();
}
So as you can see, before the loop went: Input->Update->Render->Delay->Repeat
Each one was run one after the other. If I used multithreading I would have to use strands so that updates and rendering wouldn't be run at the same time. So, is it still worth it to use multithreading here? They would still basically be running one at a time in separate cores. I basically have no experience in multithreaded applications so any help is appreciated.
Oh, and another thing: I'm using OpenGL for rendering. Would multithreading like this affect the way OpenGL renders in any way?
You are using same strand for all handlers, so there is no multithreading at all. Also, your deadline_timer is in scope of Start() and you do not pass it anywhere. In this case you will not able to restart it from the handler (note its not "interval" timer, its just a "one-call timer").
I see no point in this "revamp" since you are not getting any benefit from asio and/or threads at all in this example.
These methods (input, update, render) are too big and they do many things, you cannot call them without blocking. Its hard to say precisely because i dont know whats the game and how it works, but I'd prefer to do following steps:
Try to revamp network i/o so its become fully async
Try to use all CPU cores
About what you have tried: i think its possible if you search your code for actions that really can run in parallel right now. For example: if you calculate for each NPC something that is not depending on other characters you can io_service.post() each to make use all threads that running io_service.run() at the moment. So your program stay singlethreaded, but you can use, say, 7 other threads on some "big" operations

What multithreading package for Lua "just works" as shipped?

Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.
Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.
Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.
This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.
Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(
I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.
I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.

Explanation of Text on Threading in "C# 3.0 in a Nutshell"

While reading C# 3.0 in a Nutshell by Joseph and Ben Albahari, I came across the following paragraph (page 673, first paragraph in section titled "Signaling with Wait and Pulse")
"The Monitor class provides another signalling construct via two static methods, Wait and Pulse. The principle is that you write the signalling logic yourself using custom flags and fields (enclosed in lock statements), and then introduce Wait and Pulse commands to mitigate CPU spinning. The advantage of this low-level approach is that with just Wait, Pulse, and the lock statement, you can achieve the functionality of AutoResetEvent, ManualResetEvent, and Semaphore, as well as WaitHandle's static methods WaitAll and WaitAny. Furthermore, Wait and Pulse
can be amenable in situations where
all of the wait handles are
parsimoniously challenged."
My question is, what is the correct interpretation of the last sentence?
A situation with a decent/large number of wait handles where WaitOne() is only occasionally called on any particular wait handle.
A situation with a decent/large number of wait handles where rarely does more than one thread tend to block on any particular wait handle.
Some other interpretation.
Would also appreciate illuminating examples of such situations and perhaps how and/or why they are more efficiently handled via Wait and Pulse rather than by other methods.
Thank you!
Edit: I found the text online here
What this is saying is that there are some situations where Wait and Pulse provides a simpler solution than wait handles. In general, this happens where:
The waiter, rather than the notifier, decides when to unblock
The blocking condition involves more than a simple flag (perhaps several variables)
You can still use wait handles in these situations, but Wait/Pulse tends to be simpler. The great thing about Wait/Pulse is that Wait releases the underlying lock while waiting. For instance, in the following example, we're reading _x and _y within the safety of a lock - and yet that lock is released while waiting so that another thread can update those variables:
lock (_locker)
{
while (_x < 10 && _y < 20) Monitor.Wait (_locker);
}
Another thread can then update _x and _y atomically (by virtue of the lock) and then Pulse to signal the waiter:
lock (_locker)
{
_x = 20;
_y = 30;
Monitor.Pulse (_locker);
}
The disadvantage of Wait/Pulse is that it's easier to get it wrong and make a mistake (for instance, by updating a variable and forgetting to Pulse). In situations where a program with wait handles is equally simple to a program with Wait/Pulse, I'd recommend going with wait handles for that reason.
In terms of efficiency/resource consumption (which I think you were alluding to), Wait/Pulse is usually faster and lighter (as it has a managed implementation). This is rarely a big deal in practice, though. And on that point, Framework 4.0 includes low-overhead managed versions of ManualResetEvent and Semaphore (ManualResetEventSlim and SemaphoreSlim).
Framework 4.0 also provides many more synchronization options that lessen the need for Wait/Pulse:
CountdownEvent
Barrier
PLINQ / Data Parallelism (AsParallel, Parallel.Invoke, Parallel.For, Parallel.ForEach)
Tasks and continuations
All of these are much higher-level than Wait/Pulse and IMO are preferable for writing reliable and maintainable code (assuming they'll solve the task at hand).

Resources