Why is MATLAB job taking a long time running? - multithreading

I have a function (a convolution) which can get very slow if it operates on matrices of many many columns (function code below). I hence want to parallelize the code.
Example MATLAB code:
x = zeros(1,100);
x(rand(1,100)>0.8) = 1;
x = x(:);
c = convContinuous(1:100,x,#(t,p)p(1)*exp(-(t-p(2)).*(t-p(2))./(2*p(3).*p(3))),[1,0,3],false)
plot(1:100,x,1:100,c)
if x is a matrix of many columns, the code gets very slow... My first attempt was to change for to parfor statement, but it went wrong (see Concluding remarks below).
My second attempt was to follow this example, which shows how to schedule tasks in a job and then submit the job to a local server. That example is implemented in my function below by letting the last argument isParallel being true.
The example MATLAB code would be:
x = zeros(1,100);
x(rand(1,100)>0.8) = 1;
x = x(:);
c = convContinuous(1:100,x,#(t,p)p(1)*exp(-(t-p(2)).*(t-p(2))./(2*p(3).*p(3))),[1,0,3],true)
Now, MATLAB tells me:
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Warning: This job will remain queued until the Parallel Pool is closed.
And MATLAB terminal keeps on hold, waiting for something to finish. I then open Jobs Monitor by Home -> Parallel -> Monitor jobs and see there are two jobs, one of which has the state running. But none of them will ever finish.
Questions
Why is it taking too long to run, given it is a really simple task?
What would be the best way to parallelize my function below? (the "heavy" part is in the separated function convolveSeries)
File convContinuous.m
function res = convContinuous(tData, sData, smoothFun, par, isParallel)
% performs the convolution of a series of delta with a smooth function of parameters par
% tData = temporal space
% sData = matrix of delta series (each column is a different series that will be convolved with smoothFunc)
% smoothFun = function used to convolve with each column of sData
% must be of the form smoothFun(t, par)
% par = parameters to smoothing function
if nargin < 5 || isempty(isParallel)
isParallel = false;
end
if isvector(sData)
[mm,nn] = size(sData);
sData = sData(:);
end
res = zeros(size(sData));
[ ~, n ] = size(sData);
if ~isParallel
%parfor i = 1:n % uncomment this and comment line below for strange error
for i = 1:n
res(:,i) = convolveSeries(tData, sData(:,i), smoothFun, par);
end
else
myPool = gcp; % creates parallel pool if needed
sched = parcluster; % creates scheduler
job = createJob(sched);
task = cell(1,n);
for i = 1:n
task{i} = createTask(job, #convolveSeries, 1, {tData, sData(:,i), smoothFun, par});
end
submit(job);
wait(job);
jobRes = fetchOutputs(job);
for i = 1:n
res(:,i) = jobRes{i,1}(:);
end
delete(job);
end
if isvector(sData)
res = reshape(res, mm, nn);
end
end
function r = convolveSeries(tData, s, smoothFun, par)
r = zeros(size(s));
tSpk = s == 1;
j = 1;
for t = tData
for tt = tData(tSpk)
if (tt > t)
break;
end
r(j) = r(j) + smoothFun(t - tt, par);
end
j = j + 1;
end
end
Concluding remarks
As a side note, I was not able to do it using parfor because MATLAB R2015a gave me a strange error:
Error using matlabpool (line 27)
matlabpool has been removed.
To query the size of an already started parallel pool, query the 'NumWorkers' property of the pool.
To check if a pool is already started use 'isempty(gcp('nocreate'))'.
Error in parallel_function (line 317)
Nworkers = matlabpool('size');
Error in convContinuous (line 18)
parfor i = 1:n
My version command outputs
Parallel Computing Toolbox Version 6.6 (R2015a)
which is compatible with my MATLAB version. Almost all other tests I have done are OK. I am then compelled to think that this is a MATLAB bug.
I tried changing matlabpool to gcp and then retrieving the number of workers by parPoolObj.NumWorkers, and after altering this detail in two different built-in functions, I received another error:
Error in convContinuous>makeF%1/F% (line 1)
function res = convContinuous(tData, sData, smoothFun, par)
Output argument "res" (and maybe others) not assigned during call to "convContinuous>makeF%1/F%".
Error in parallel_function>iParFun (line 383)
output.data = processInfo.fun(input.base, input.limit, input.data);
Error in parProcess (line 167)
data = processFunc(processInfo, data);
Error in parallel_function (line 358)
stateInfo = parProcess(#iParFun, #iConsume, #iSupply, ...
Error in convContinuous (line 14)
parfor i = 1:numel(sData(1,:))
I suspect that this last error is generated because the function call inside parfor loop requires many arguments, but I don't really know it.
Solving the errors
Thanks to wary comments of people here (saying they could not reproduce my errors), I went on looking for the source of the error. I realized it was a local error due to having pforfun in my pathdef.m which I downloaded long ago from File Exchange.
Once I removed pforfun from my pathdef.m, parfor (line 18 in convContinuous function) started working well.
Thank you in advance!

The parallel pool you created is blocking your job from running. When you are using the jobs and tasks API you do not need (and must not have) a pool open. When you looked in Job Monitor, the running job you saw was the job that backs the parallel pool, that only finishes when the pool is deleted.
If you delete the line in convContinuous that says myPool = gcp, then it should work. As an optimization you can use the vectorised form of createTask, which is much more efficient than creating tasks in a loop i.e.
inputCell = cell(1, n);
for i = 1:n
inputCell{i} = {tData, sData(:,i), smoothFun, par};
end
task = createTask(job, #convolveSeries, 1, inputCell);
However, having said all that, you should be able to make this code work using parfor. The first error you encountered was due to matlabpool being removed, it has now been replaced by parpool.
The second error appears to be caused by your function not returning the correct outputs, but the error message does not appear to correspond to the code you posted, so I'm not sure. Specifically I don't know what convContinuous>makeF%1/F% (line 1) refers to.

Thanks to wary comments of people here (saying they could not reproduce my errors), I went on looking for the source of the error. I realized it was a local error due to having pforfun in my pathdef.m which I downloaded long ago from File Exchange.
Once I removed pforfun from my pathdef.m, parfor (line 18 in convContinuous function) started working well.

Related

Why is string creation so slow in Julia?

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.
This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?
This is some reproducible code:
function chop(s; nc=80)
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
s = "A"^500000
chop(s)
It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]
Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.
I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl
Can someone help me understand why this code above is so slow nonetheless?
Weirdly, the below is much faster, just by adding s = collect(s):
function chop(s; nc=80)
s = collect(s) #this line is new
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):
chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')
The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.
How the solution works
How does the given solution work:
findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
finally all is joined with \n as separator.
What is wrong in the OP solutions
First attempt:
the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)
Second attempt:
doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String
What would be a fast solution
If you want something faster than regex and correct you can use this code:
function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
#assert nc > 0
isempty(s) && return s
sz = sizeof(s)
cu = codeunits(s)
buf_sz = sz + div(sz, nc)
buf = Vector{UInt8}(undef, buf_sz)
start = 1
buf_loc = 1
while true
stop = min(nextind(s, start, nc), sz + 1)
copyto!(buf, buf_loc, cu, start, stop - start)
buf_loc += stop - start
if stop == sz + 1
resize!(buf, buf_loc - 1)
break
else
start = stop
buf[buf_loc] = UInt8('\n')
buf_loc += 1
end
end
return String(buf)
end
String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.
You could operate on bytes
function chop2(s; nc=80)
b = transcode(UInt8, s)
nr = ceil(Int64, length(b)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(b))
dat = UInt8[]
for i in 1:nr
append!(dat, #view(b[l(i):r(i)]))
i < nr && push!(dat, UInt8('\n'))
end
String(dat)
end
and the benchmarks (around 5000x faster):
#btime chop($s);
1.531 s (6267 allocations: 1.28 MiB)
julia> #btime chop2($s);
334.100 μs (13 allocations: 1.57 MiB)
Notes:
this code could be still made slightly faster by pre-allocating dat but I tried to bi similar to the original.
when having unicode characters neither yours nor this approach will not work as you cannot cut a unicode character in the middle
With the help of a colleage we figured out the main reason that makes the provided implementation so slow.
It turns out length(::String) has time complexity O(n) in Julia, and the results are not cached, so the longer the string, the more calls to length which itself takes longer the longer the input. See this Reddit post for a good discussion of the phenomenon:
Collecting the string into a vector resolves the bottleneck, because length of a vector is O(1) instead of O(n).
This is of course by no means the best way to solve the general problem, but it's a one line change that speeds up the code as provided.
This has similar performance to the version by #PrzemyslawSzufel, but is much simpler.
function chop3(s; nc=80)
L = length(s)
join((#view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end
I didn't choose firstindex(s), lastindex(s) as strings may not have arbitrary indices, but it makes no difference anyway.
#btime chop3(s) setup=(s=randstring(10^6)) # 1.625 ms (18 allocations: 1.13 MiB)
#btime chop2(s) setup=(s=randstring(10^6)) # 1.599 ms (14 allocations: 3.19 MiB)
Update: Based on suggestions by #BogumiłKamiński, working with ASCII strings, this version with sizeof is even 60% faster.
function chop3(s; nc=80)
L = sizeof(s)
join((#view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end

Torch: collectgarbage() not deallocating memory of torch tensors

I am running a code that has following structure:
network = createNetwork() -- loading a pre-trained network.
function train()
for i=1,#trainingsamples do
local ip = loadInput()
local ip_1 = someImageProcessing(ip)
local ip_2 = someImageProcessing(ip)
network:forward( ...some manipulation on ip_1,ip_2...)
network:backward()
collectgarbage('collect')
print debug.getlocal -- all local variables.
end
end
I am expecting that collectgarbage() will release all the memory held by ip_1, ip_2, and ip. But I could see the memory is not released. This causes memory leak. I am wondering what's happening. Can someone please help me in understanding the strange behavior of collectgarbage() and fixing the memory leak.
I am really sorry that I could not add the full code. Hope the snippet I have added is sufficient to understand the flow of my code and my network training code is very similar to a standard CNN training code.
EDIT:
Sorry for not mentioning the variables were declared local and using a keyword for a variable in the sample snippet. I have edited it now. The only global variable is the network which is declared outside of the train function and I feed ip_1, ip_2 as inputs to the network. Also I have added trimmed version of my actual code below.
network = createNetwork()
function trainNetwork()
local parameters,gradParameters = network:getParameters()
network:training() -- set flag for dropout
local bs = 1
local lR = params.learning_rate / torch.sqrt(bs)
local optimConfig = {learningRate = params.learning_rate,
momentum = params.momentum,
learningRateDecay = params.lr_decay,
beta1 = params.optim_beta1,
beta2 = params.optim_beta2,
epsilon = params.optim_epsilon}
local nfiles = getNoofFiles('train')
local weights = torch.Tensor(params.num_classes):fill(1)
criterion = nn.ClassNLLCriterion(weights)
for ep=1,params.epochs do
IMAGE_SEQ = 1
while (IMAGE_SEQ <= nfiles) do
xlua.progress(IMAGE_SEQ, nfiles)
local input, inputd2
local color_image, depth_image2, target_image
local nextInput = loadNext('train')
color_image = nextInput.data.rgb
depth_image2 = nextInput.data.depth
target_image = nextInput.data.labels
input = network0:forward(color_image) -- process RGB
inputd2 = networkd:forward(depth_image2):squeeze() -- HHA
local input_concat = torch.cat(input,inputd2,1):squeeze() -- concat RGB, HHA
collectgarbage('collect')
target = target_image:reshape(params.imWidth*params.imHeight) -- reshape target as vector
-- create closure to evaluate f(X) and df/dX
local loss = 0
local feval = function(x)
-- get new parameters
if x ~= parameters then parameters:copy(x) end
collectgarbage()
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
-- evaluate function for complete mini batch
local output = network:forward(input_concat) -- run forward pass
local err = criterion:forward(output, target) -- compute loss
loss = loss + err
-- estimate df/dW
local df_do = criterion:backward(output, target)
network:backward(input_concat, df_do) -- update parameters
local _,predicted_labels = torch.max(output,2)
predicted_labels = torch.reshape(predicted_labels:squeeze():float(),params.imHeight,params.imWidth)
return err,gradParameters
end -- feval
pm('Training loss: '.. loss, 3)
_,current_loss = optim.adam(feval, parameters, optimConfig)
print ('epoch / current_loss ',ep,current_loss[1])
os.execute('cat /proc/$PPID/status | grep RSS')
collectgarbage('collect')
-- for memory leakage debugging
print ('locals')
for x, v in pairs(locals()) do
if type(v) == 'userdata' then
print(x, v:size())
end
end
print ('upvalues')
for x,v in pairs(upvalues()) do
if type(v) == 'userdata' then
print(x, v:size())
end
end
end -- ii
print(string.format('Loss: %.4f Epoch: %d grad-norm: %.4f',
current_loss[1], ep, torch.norm(parameters)/torch.norm(gradParameters)))
if (current_loss[1] ~= current_loss[1] or gradParameters ~= gradParameters) then
print ('nan loss or gradParams. quiting...')
abort()
end
-- some validation code here
end --epochs
print('Training completed')
end
As #Adam said in the comment, in_1 and in_2 variables continue to be referenced and their values can't be garbage collected. Even if you change them to be local variables, they won't be garbage collected at that point as the block in which they are defined is not closed yet.
What you can do is to set in_1 and in_2 values to nil before calling collectgarbage, which should make the previously assigned values to be unreachable and eligible for garbage collection. This will only work if there is no other variable that may be storing the same value.
+1 to Paul's answer above; but note the word "should". Almost all of the time you will be fine. However if e.g. your code gets more complicated (and you start passing memory objects around and working on them), you may find that occasionally the Lua gc may decide to hold onto a memory object just for a little bit longer than expected. But don't worry (or waste time trying to work out why), eventually all unused memory objs will be collected by the Lua gc. A garbage collector is a complicated algorithm and can appear a little non-deterministic at times.
You create global variables to store values. So this variables will be avaliable all the time. So until you rewrite values it such vars gc can not collect them.
Just make vars local and call gc out from scope.
Also first cycle of GC may just call finalizer and second one free memory.
But not sure about that. So you can try call gc twice.
function train()
do
local in = loadInput()
local in_1 = someImageProcessing(in)
local in_2 = someImageProcessing(in)
network:forward( ...some manipulation on in_1,in_2...)
network:backward()
end
collectgarbage('collect')
collectgarbage('collect')
print debug.getlocal -- all local variables.
PS. in is not valid variable name in Lua

Semaphores makeWater() synchronization

This program claims to solve makeWater() synchronization problem. However, I could not understand how. I am new in semaphores. I would appriciate if you can help me to understand this code.
So you need to make H2O (2Hs and one O) combinations out of number of simultaneously running H-threads and O-threads.
The thing is one 'O' needs two 'H' s. And no sharings between two different water molecules.
So assume number of O and H threads start their processes.
No O thread can go beyond P(o_wait) because o-wait is locked, and should wait.
One random lucky H thread(say H*-1) can go pass P(mutex)(now mutex = 0 and count = 1) and and will go inside if(count%2 == 1), then up-count 'mutex' (now mutex = 1) and block in P(h_wait). (This count actually refers to H count)
Because 'mutex' was up-counted another random H-thread(H*-2) will start to go pass P(mutex) (Now mutex = 0 and count =2). But now the count is even -> hence it goes inside else. Then it will V(o_wait)(now o_wait = 1) and stuck in P(h_wait).
Now H*-1 is still at the previous position inside if block. But because o_wait is up-counted to 1, a lucky O thread(O*) can continue its process. It will do two V(h_wait) s (Now o_wait = 0, h_wait = 2), so that 2 previous H threads can continue(No any other, now h_wait = 0). So all 3 (2 Hs and O) can finish its process, while H*-2 is up-counting the 'mutex' (now mutex = 1).
Now the final values of global variables after completion one molecule, mutex = 1, h_wait = 0 and o_wait = 0, so exactly the initial status. Now the previous process will happen again and again, hence H2O molecules will be created.
I think you get clear with it. Please raise questions if any. :))

complete vs. simple i/o lua

I am trying to write a program to analyze data from a simulation. Since the simulation software I am using is what is running the Lua program, I am not sure if this is the right place to ask this question, but I am probably making a programming error.
I am struggling with the difference between using the simple and complete I/O models. I have a block of code, which works, and looks like this:
io.output([[filename_and_location]])
function segment.other_actions
if ion_splat ~= 0 then io.write(ion_px_mm, "\n") end
io.close()
end
Note: ion_splat and ion_px_mm are pre-determined variables that take on number values. This code is run over and over again throughout the simulation.
Then I decided to try achieving the same thing using the complete I/O model like this:
f = io.open([[file_name_and_location]],"w")
function segment.other_actions ()
if ion_splat ~= 0 then f:write(ion_py_mm, "\n") end
f:close()
end
end
This runs, but takes a lot longer than the other way. Why is that?
Example 1:
for i = 1, 1000 do
io.output("test.txt")
io.write("some data to be written\n")
io.close()
end
Example 2:
for i = 1, 1000 do
local f = io.open("test.txt", "w")
f:write("some data to be written\n")
f:close()
end
There is no measurable difference in the execution time.
The latter approach is usually preferable because the used file is identified explicitly.

Less strict alternative to parfor in matlab?

I have code that is structurally similar to the following in Matlab:
bestConfiguration = 0;
bestConfAwesomeness = 0;
for i=1:X
% note that providing bestConfAwesomeness to the function helps it stop if it sees the current configuration is getting hopeless anyway
[configuration, awesomeness] = expensive_function(i, bestConfAwesomeness);
if awesomeness > bestConfAwesomeness
bestConfAwesomeness = awesomeness;
bestConfiguration = configuration;
end
end
There is a bit more to it but the basic structure is the above. X can get very large. I am trying to make this code run in parallel, since expensive_function() takes a long time to run.
The problem is that Matlab won't let me just change for to parfor because it doesn't like that I'm updating the best configuration in the loop.
So far what I've done is:
[allConfigurations, allAwesomeness] = deal(cell(1, X));
parfor i=1:X
% note that this is not ideal because I am forced to use 0 as the best awesomeness in all cases
[allConfigurations{i}, allAwesomeness{i}] = expensive_function(i, 0);
end
for i=1:X
configuration = allConfigurations{i};
awesomeness = allAwesomeness{i};
if awesomeness > bestConfAwesomeness
bestConfAwesomeness = awesomeness;
bestConfiguration = configuration;
end
endfor
This is better in terms of time it takes to run; however, for large inputs it takes huge amounts of memory because all the configurations are always saved. Another problem is that using parfor forces me to always provide 0 as the best configuration even though better ones might be known.
Does Matlab provide a better way of doing this?
Basically, if I didn't have to use Matlab and could manage the threads myself, I'd have one central thread which gives jobs to workers (i.e. make them run expensive_function(i)) and once a worker returns, look at the data it produced and compare it to the best found so far and update it accordingly. There would be no need to save all the configurations which seems to be the only way to make parfor work.
Is there a way to do the above in Matlab?
Using the bestConfAwesomeness each time round the loop means that the iterations of your loop are not order-independent, hence why PARFOR is unhappy. One approach you could take is to use SPMD and have each worker perform expensiveFunction in parallel, and then communicate to update bestConfAwesomeness. Something like this:
bestConfiguration = 0;
bestConfAwesomeness = 0;
spmd
for idx = 1:ceil(X/numlabs)
myIdx = labindex + ((idx-1) * numlabs);
% should really guard against myIdx > X here.
[thisConf, thisAwesome] = expensiveFunction(myIdx, bestConfAwesomeness);
% Now, we must communicate to see if who is best
[bestConfiguration, bestAwesomeness] = reduceAwesomeness(...
bestConfiguration, bestConfAwesomeness, thisConf, thisAwesome);
end
end
function [bestConf, bestConfAwesome] = reduceAwesomeness(...
bestConf, bestConfAwesome, thisConf, thisAwesome)
% slightly lazy way of doing this, could be optimized
% but probably not worth it if conf & awesome both scalars.
allConfs = gcat(bestConf);
allAwesome = gcat(thisAwesome);
[maxThisTime, maxLoc] = max(allAwesome);
if maxThisTime > bestConfAwesome
bestConfAwesome = maxThisTime;
bestConf = allConfs(maxLoc);
end
end
I'm not sure that the kind of control over your threads is possible with Matlab. However, since X is very large, it may be worth doing the following, which costs you one more iteration of expensiveFunction:
%# calculate awesomeness
parfor i=1:X
[~,awesomeness(i)] = expensiveFunction(i);
end
%# find the most awesome i
[mostAwesome,mostAwesomeIdx] = min(awesomeness);
%# get the corresponding configuration
bestConfiguration = expensiveFunction(mostAwesomeIdx);

Resources