Improve loops performance with parallelization - multithreading

So I'm trying to wrap my head around Julia's parallelization options. I'm modelling stochastic processes as Markov chains. Since the chains are independent replicates, the outer loops are independent - making the problem embarrassingly parallel.
I tried to implement both a #distributed and a #threads solution, both of which seem to run fine, but aren't any faster than the sequential.
Here's a simplified version of my code (sequential):
function dummy(steps = 10000, width = 100, chains = 4)
out_N = zeros(steps, width, chains)
initial = zeros(width)
for c = 1:chains
# print("c=$c\n")
N = zeros(steps, width)
state = copy(initial)
N[1,:] = state
for i = 1:steps
state = state + rand(width)
N[i,:] = state
end
out_N[:,:,c] = N
end
return out_N
end
What would be the correct way of parallelizing this problem to increase performance?

Here is the correct way to do it (at the time of writing this answer the other answer does not work - see my comment).
I will use slightly less complex example than in the question (however very similar).
1. Not parallelized version (baseline scenario)
using Random
const m = MersenneTwister(0);
function dothestuff!(out_N, N, ic, m)
out_N[:, ic] .= rand(m, N)
end
function dummy_base(m=m, N=100_000,c=256)
out_N = Array{Float64}(undef,N,c)
for ic in 1:c
dothestuff!(out_N, N, ic, m)
end
out_N
end
Testing:
julia> using BenchmarkTools; #btime dummy_base();
106.512 ms (514 allocations: 390.64 MiB)
2. Parallelize with threads
#remember to run before starting Julia:
# set JULIA_NUM_THREADS=4
# OR (Linux)
# export JULIA_NUM_THREADS=4
using Random
const mt = MersenneTwister.(1:Threads.nthreads());
# required for older Julia versions, look still good in later versions :-)
function dothestuff!(out_N, N, ic, m)
out_N[:, ic] .= rand(m, N)
end
function dummy_threads(mt=mt, N=100_000,c=256)
out_N = Array{Float64}(undef,N,c)
Threads.#threads for ic in 1:c
dothestuff!(out_N, N, ic, mt[Threads.threadid()])
end
out_N
end
Let us test the performance:
julia> using BenchmarkTools; #btime dummy_threads();
46.775 ms (535 allocations: 390.65 MiB)
3. Parallelize with processes (on a single machine)
using Distributed
addprocs(4)
using Random, SharedArrays
#everywhere using Random, SharedArrays, Distributed
#everywhere Random.seed!(myid())
#everywhere function dothestuff!(out_N, N, ic)
out_N[:, ic] .= rand(N)
end
function dummy_distr(N=100_000,c=256)
out_N = SharedArray{Float64}(N,c)
#sync #distributed for ic in 1:c
dothestuff!(out_N, N, ic)
end
out_N
end
Performance (note that inter-process communication takes some time and hence for small computations threads will be usually better):
julia> using BenchmarkTools; #btime dummy_distr();
62.584 ms (1073 allocations: 45.48 KiB)

You can use #distributed macro, to run processes in parallel
#everywhere using Distributed, SharedArrays
addprocs(4)
#everywhere function inner_loop!(out_N, chain_number,steps,width)
N = zeros(steps, width)
state = zeros(width)
for i = 1:steps
state .+= rand(width)
N[i,:] .= state
end
out_N[:,:,chain_number] .= N
nothing
end
function dummy(steps = 10000, width = 100, chains = 4)
out_N = SharedArray{Float64}((steps, width, chains); pids = collect(1:4))
#sync for c = 1:chains
# print("c=$c\n")
#spawnat :any inner_loop!(out_N, c, steps,width)
end
sdata(out_N)
end

Related

Matrix multiplication is slower when multithreading in Julia

I am working with big matrices (size of 30k rows and ~100 columns). I am doing some matrix multiplication and the process would take around 20 seconds. This is my code:
#time begin
result = -1
data = -1
for i=1:size
first_matrix = #view data[i * split,:]
for j=1:size
second_matrix = #view Qg[j * split,:]
matrix_multiplication = first_matrix * second_matrix'
current_sum = sum(matrix_multiplication)
global result
if current_sum > result
result = current_sum
data = matrix_multiplication[1,1]
end
end
end
end
Trying to optimize this a little more, I tried to use multi-threading (julia --thread 4) to get better performance.
#time begin
global result = -1
global data = -1
lock = ReentrantLock()
for i=1:size
first_matrix = #view data[i * split,:]
Threads.#threads for j=1:size
second_matrix = #view Qg[j * split,:]
matrix_multiplication = first_matrix * second_matrix'
current_sum = sum(matrix_multiplication)
global result
if current_sum > result
lock(lock)
result = current_sum
data = matrix_multiplication[1,1]
unlock(lock)
end
end
end
end
By adding multi-threading I thought I would get an increase in performance, but the performance got worse (~40 seconds). I removed the lock to see if that was the issue, but still got the same performance. I am running this on a Dual-Core Intel Core i5 (MacBook pro). Does anyone know why my multi-threading code doesn't work?

Julia: how to get a random permutation of a given string s?

I thought about two different ways, but both seem pretty ugly.
Transform the string s into an array a by splitting it, then use sample(a, length(s), replace=false) and join the array again into a string
Get a RandomPermutation r of length length(s) and join the single s[i] for i in r.
What's the right way? Unfortunately there is no method matching sample(::String, ::Int64; replace=false).
Perhaps defining a shuffle method for String constitutes type piracy, but, anyway, here's a suggested implemetation:
Base.shuffle(s::String) = isascii(s) ? s[randperm(end)] : join(shuffle!(collect(s)))
If you wanted to squeeze out performance from shuffle then you can consider:
function shufflefast(s::String)
ss = sizeof(s)
l = length(s)
ss == l && return String(shuffle!(copy(Vector{UInt8}(s))))
v = Vector{Int}(l)
i = start(s)
for j in 1:l
v[j] = i
i = nextind(s, i)
end
p = pointer(s)
u = Vector{UInt8}(ss)
k = 1
for i in randperm(l)
for j in v[i]:(i == l ? ss : v[i+1]-1)
u[k] = unsafe_load(p, j)
k += 1
end
end
String(u)
end
For large strings it is over 4x faster for ASCII and 3x faster for UTF-8.
Unfortunately it is messy - so I would rather treat it as an exercise. However, it uses only exported functions so it is not a hack.
Inspired by the optimization tricks in Bogumil Kaminski's answer, the following is a version with almost the same performance, but a bit clearer (in my opinion) and using a second utility function which may be of value in itself:
function strranges(s) # returns the ranges of bytes spanned by chars
u = Vector{UnitRange{Int64}}()
sizehint!(u,sizeof(s))
i = 1
while i<=sizeof(s)
ii = nextind(s,i)
push!(u,i:ii-1)
i = ii
end
return u
end
function shufflefast(s)
ss = convert(Vector{UInt8},s)
uu = Vector{UInt8}(length(ss))
i = 1
#inbounds for r in shuffle!(strranges(s))
for j in r
uu[i] = ss[j]
i += 1
end
end
return String(uu)
end
Example timing:
julia> using BenchmarkTools
julia> s = "ďaľšý"
julia> #btime shuffle($s) # shuffle from DNF's answer
831.200 ns (9 allocations: 416 bytes)
"ýľďša"
julia> #btime shufflefast($s) # shuffle from this answer
252.224 ns (5 allocations: 432 bytes)
"ľýďaš"
julia> #btime kaminskishufflefast($s) # shuffle from Kaminski's answer
197.345 ns (4 allocations: 384 bytes)
"ýašďľ"
EDIT: a little better performance - see code comments
This is from Bogumil Kaminski's answer where I am trying to avoid calculating length (*) if it is not necessary:
function shufflefast2(s::String)
ss = sizeof(s)
local l
for l in 1:ss
#if ((codeunit(s,l) & 0xc0) == 0x80)
if codeunit(s,l)>= 0x80 # edit (see comments bellow why)
break
end
end
ss == l && return String(shuffle!(copy(Vector{UInt8}(s))))
v = Vector{Int}(ss)
i = 1
l = 0
while i<ss
l += 1
v[l] = i
i = nextind(s, i)
end
v[l+1] = ss+1 # edit - we could do this because ss>l
p = pointer(s)
u = Vector{UInt8}(ss)
k = 1
for i in randperm(l)
# for j in v[i]:(i == l ? ss : v[i+1]-1)
for j in v[i]:v[i+1]-1 # edit we could do this because v[l+1] is defined (see above)
u[k] = unsafe_load(p, j)
k += 1
end
end
String(u)
end
Example timing for ascii string:
julia> srand(1234);#btime for i in 1:100 danshufflefast("test") end
19.783 μs (500 allocations: 34.38 KiB)
julia> srand(1234);#btime for i in 1:100 bkshufflefast("test") end
10.408 μs (300 allocations: 18.75 KiB)
julia> srand(1234);#btime for i in 1:100 shufflefast2("test") end
10.280 μs (300 allocations: 18.75 KiB)
Difference is too small, sometimes bkshufflefast is faster. Performance has to be equal. Whole length has to be count and there is same allocation.
Example timing for unicode string:
julia> srand(1234);#btime for i in 1:100 danshufflefast(s) end
24.964 μs (500 allocations: 42.19 KiB)
julia> srand(1234);#btime for i in 1:100 bkshufflefast(s) end
20.882 μs (400 allocations: 37.50 KiB)
julia> srand(1234);#btime for i in 1:100 shufflefast2(s) end
19.038 μs (400 allocations: 40.63 KiB)
shufflefast2 is a little but clearly faster here. A little more allocation than Bogumil's function and a little less allocation than in Dan's solution.
(*) - I a little hope that String implementation in Julia will be faster in future and length could be much quicker than it is now.

Corona SDK / Lua / memory leak

So I have this rain module for a game that I am developing, which is causing a massive system memory leak, which leads to lag and ultimately crash of the application.
The function "t.start" is called with a timer every 50 ms.
Though I I've tried I can't really find the cause for this! Maybe I am overlooking something but I can't help it. As you see I niled out the graphics related locals...Does anyone notice something?
As a secondary issue : Does anyone have tips on preloading next scene for a smooth scene change? Because the loading itself is causing a short freeze when I put it in "scene:show()"...
Thanks for your help!
Greetings, Nils
local t = {}
local composer = require("composer")
t.drops = {}
function t.fall(drops, group)
for i = 1, #drops, 1 do
local thisDrop = drops[i]
function thisDrop:enterFrame()
if aboutToBeDestroyed == true then
Runtime:removeEventListener("enterFrame", self)
return true
end
local randomY = math.random(32, 64)
if self.x ~= nil then
self:translate(0, randomY)
if self.y > 2000 then
self:removeSelf()
Runtime:removeEventListener("enterFrame", self)
self = nil
end
end
end
Runtime:addEventListener("enterFrame", drops[i])
thisDrop = nil
end
end
t.clean = function()
for i = 1, #t.drops, 1 do
if t.drops[i] ~= nil then
table.remove(t.drops, i)
t.drops[i] = nil
end
end
end
function t.start(group)
local drops = {}
local theGroup = group
for i = 1, 20, 1 do
local randomWidth = math.random(5, 30)
local dropV = display.newRect(group, 1, 1, randomWidth, 30)
local drop1 = display.newSnapshot(dropV.contentWidth , dropV.contentHeight * 3)
drop1.canvas:insert(dropV)
drop1.fill.effect = "filter.blurVertical"
drop1.fill.effect.blurSize = 30
drop1.fill.effect.sigma = 140
drop1:invalidate("canvas")
drop1:scale(0.75, 90)
drop1:invalidate("canvas")
drop1:scale(1, 1 / 60)
drop1:invalidate("canvas")
local drop = display.newSnapshot(drop1.contentWidth * 1.5, drop1.contentHeight)
drop.canvas:insert(drop1)
drop.fill.effect = "filter.blurHorizontal"
drop.fill.effect.blurSize = 10
drop:invalidate("canvas")
drop.alpha = 0.375
local randomY = math.random(-500, 500)
drop.y = randomY
drop.anchorY = 0
drop.x = (i - 1) * 54
drops[i] = drop
table.insert(t.drops, drop)
local dropV, drop1, drop = nil
end
composer.setVariable("drops", t.drops)
t.fall(drops, group)
drops = nil
t.clean()
end
return t
EDIT : I found out that it definitely has something to do with the nested snapshots, which are created for the purpose of applying filter effects. I removed one snapshot, so that I only have a vector object inside a snapshot and voila : memory increases way slower. The question is : why?
Generally, you don't need enterFrame event at all - you can simply do transition from start point (math.random(-500, 500)) to end point (2000 in your code). Just randomise speed and use onComplete handler to remove object
local targetY = 2000
local speedPerMs = math.random(32, 64) * 60 / 1000
local timeToTravel = (targetY - randomY) / speedPerMs
transition.to( drop, {
time = timeToTravel,
x = xx,
y = targetY,
onComplete = function()
drop:removeSelf()
end
} )
Edit 1: I found that with your code removing drop is not enough. This works for me:
drop:removeSelf()
dropV:removeSelf()
drop1:removeSelf()
Some ideas about memory consumption:
1) Probably you can use 1 enterFrame handler for array of drops - this will reduce memory consumption. Also don't add methods to local objects like 'function thisDrop:enterFrame()' - this is not optimal here, because you creating 20 new functions every 50 ms
2) Your code creates 400 'drop' objects every second and they usually live no more than ~78 frames (means 1.3 sec in 60fps environment). Better to use pool of objects and reuse existing objects
3) enterFrame function depends on current fps of device, so your rain will be slower with low fps. Low fps -> objects falls slower -> more objects on scene -> fps go down. I suggest you to calculate deltaTime between 2 enterFrame calls and ajust falling speed according to deltaTime
Edit 2 Seems like :removeSelf() for snapshot didn't remove child object. I modified your code and memory consumption drops a lot
if self.y > 2000 then
local drop1 = self.group[1]
local dropV = drop1.group[1]
dropV:removeSelf()
drop1:removeSelf()
self:removeSelf()
Runtime:removeEventListener("enterFrame", self)
self = nil
end

Torch out of memory in thread when using torch.serialize twice

I'm trying to add a parallel dataloader to the torch-dataframe in order to add torchnet compatibility. I've used the tnt.ParallelDatasetIterator and changed it so that:
A basic batch is loaded outside the threads
The batch is serialized and sent to the thread
In the thread the batch is deserialized and converts the batch data to tensors
The tensors are returned in a table that has the input and target keys in order to match the tnt.Engine setup.
The problem occurs the second time the enque is called with an error: .../torch_distro/install/bin/luajit: not enough memory. I'm currently only working with mnist with an adapted mnist-example. The enque loop now looks like this (with debugging memory output):
-- `samplePlaceholder` stands in for samples which have been
-- filtered out by the `filter` function
local samplePlaceholder = {}
-- The enque does the main loop
local idx = 1
local function enqueue()
while idx <= size and threads:acceptsjob() do
local batch, reset = self.dataset:get_batch(batch_size)
if (reset) then
idx = size + 1
else
idx = idx + 1
end
if (batch) then
local serialized_batch = torch.serialize(batch)
-- In the parallel section only the to_tensor is run in parallel
-- this should though be the computationally expensive operation
threads:addjob(
function(argList)
io.stderr:write("\n Start");
io.stderr:write("\n 1: " ..tostring(collectgarbage("count")))
local origIdx, serialized_batch, samplePlaceholder = unpack(argList)
io.stderr:write("\n 2: " ..tostring(collectgarbage("count")))
local batch = torch.deserialize(serialized_batch)
serialized_batch = nil
collectgarbage()
collectgarbage()
io.stderr:write("\n 3: " .. tostring(collectgarbage("count")))
batch = transform(batch)
io.stderr:write("\n 4: " .. tostring(collectgarbage("count")))
local sample = samplePlaceholder
if (filter(batch)) then
sample = {}
sample.input, sample.target = batch:to_tensor()
end
io.stderr:write("\n 5: " ..tostring(collectgarbage("count")))
collectgarbage()
collectgarbage()
io.stderr:write("\n 6: " ..tostring(collectgarbage("count")))
io.stderr:write("\n End \n");
return {
sample,
origIdx
}
end,
function(argList)
sample, sampleOrigIdx = unpack(argList)
end,
{idx, serialized_batch, samplePlaceholder}
)
end
end
end
I've sprinkled collectgarbage and also tried to remove any objects not needed. The memory output is rather straight forward:
Start
1: 374840.87695312
2: 374840.94433594
3: 372023.79101562
4: 372023.85839844
5: 372075.41308594
6: 372023.73632812
End
The function that loops the enque is the non-ordered function that is trivial (the memory error is thrown at the second enque and the ):
iterFunction = function()
while threads:hasjob() do
enqueue()
threads:dojob()
if threads:haserror() then
threads:synchronize()
end
enqueue()
if table.exact_length(sample) > 0 then
return sample
end
end
end
So the problem was the torch.serialize where the function in the set-up coupled the entire dataset to the function. When adding:
serialized_batch = nil
collectgarbage()
collectgarbage()
The problem was resolved. I further wanted to know what was taking up so much space and the culprit turned out to be that I had defined the function in an environment with a large dataset that got intertwined with the function, massively increasing the size. Here the original definition of the data local
mnist = require 'mnist'
local dataset = mnist[mode .. 'dataset']()
-- PROBLEMATIC LINE BELOW --
local ext_resource = dataset.data:reshape(dataset.data:size(1),
dataset.data:size(2) * dataset.data:size(3)):double()
-- Create a Dataframe with the label. The actual images will be loaded
-- as an external resource
local df = Dataframe(
Df_Dict{
label = dataset.label:totable(),
row_id = torch.range(1, dataset.data:size(1)):totable()
})
-- Since the mnist package already has taken care of the data
-- splitting we create a single subsetter
df:create_subsets{
subsets = Df_Dict{core = 1},
class_args = Df_Tbl({
batch_args = Df_Tbl({
label = Df_Array("label"),
data = function(row)
return ext_resource[row.row_id]
end
})
})
}
it turns out that removing the line that I highlighted reduces the memory usage from 358 Mb down to 0.0008 Mb! The code that I used for testing the performance was:
local mem = {}
table.insert(mem, collectgarbage("count"))
local ser_data = torch.serialize(batch.dataset)
table.insert(mem, collectgarbage("count"))
local ser_retriever = torch.serialize(batch.batchframe_defaults.data)
table.insert(mem, collectgarbage("count"))
local ser_raw_retriever = torch.serialize(function(row)
return ext_resource[row.row_id]
end)
table.insert(mem, collectgarbage("count"))
local serialized_batch = torch.serialize(batch)
table.insert(mem, collectgarbage("count"))
for i=2,#mem do
print(i-1, (mem[i] - mem[i-1])/1024)
end
Which produced originally the output:
1 0.0082607269287109
2 358.23344707489
3 0.0017471313476562
4 358.90182781219
and after the fix:
1 0.0094480514526367
2 0.00080204010009766
3 0.00090408325195312
4 0.010146141052246
I tried using the setfenv for the function but it didn't resolve the issue. There is still a performance penalty for sending the serialized data to the thread but the main problem is resolved and without the expensive data retriever the function is considerably smaller.

Fortran error: Program received signal SIGSEGV: Segmentation fault - invalid memory reference

I'm try to run an ocean temperature model for 25 years using the explicit method (parabolic differential equation).
If I run for a year a = 3600 or five years a = 18000 it works fine.
However, when I run it for 25 years a = 90000 it crashes.
a is the amount of time steps used. And a year is considered to be 360 days. The time step is 4320 seconds, delta_t = 4320..
Here is my code:
program task
!declare the variables
implicit none
! initial conditions
real,parameter :: initial_temp = 4.
! vertical resolution (delta_z) [m], vertical diffusion coefficient (av) [m^2/s], time step delta_t [s]
real,parameter :: delta_z = 2., av = 2.0E-04, delta_t = 4320.
! gamma
real,parameter :: y = (av * delta_t) / (delta_z**2)
! horizontal resolution (time) total points
integer,parameter :: a = 18000
!declaring vertical resolution
integer,parameter :: k = 101
! declaring pi
real, parameter :: pi = 4.0*atan(1.0)
! t = time [s], temp_a = temperature at upper boundary [°C]
real,dimension(0:a) :: t
real,dimension(0:a) :: temp_a
real,dimension(0:a,0:k) :: temp
integer :: i
integer :: n
integer :: j
t(0) = 0
do i = 1,a
t(i) = t(i-1) + delta_t
end do
! temperature of upper boundary
temp_a = 12. + 6. * sin((2. * t * pi) / 31104000.)
temp(:,0) = temp_a(:)
temp(0,1:k) = 4.
! Vertical resolution
do j = 1,a
do n = 1,k
temp(j,n) = temp(j-1,n) + (y * (temp(j-1,n+1) - (2. * temp(j-1,n)) + temp(j-1,n-1)))
end do
temp(:,101) = temp(:,100)
end do
print *, temp(:,:)
end program task
The variable a is on line 11 (integer,parameter :: a = 18000)
As said, a = 18000 works, a = 90000 doesn't.
At 90000 get I get:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
RUN FAILED (exit value 1, total time: 15s)
I'm using a fortran on windows 8.1, NetBeans and Cygwin (which has gfortran built in).
I'm not sure if this problem is caused through bad compiler or anything else.
Does anybody have any ideas to this? It would help me a lot!
Regards
Take a look at the following lines from your code:
integer,parameter :: k = 101
real,dimension(0:a,0:k) :: temp
integer :: n
do n = 1,k
temp(j,n) = temp(j-1,n) + (y * (temp(j-1,n+1) - (2. * temp(j-1,n)) + temp(j-1,n-1)))
end do
Your array temp has bounds of 0:101, you loop n from 1 to 101 where in iteration n=101 you access temp(j-1,102), which is out of bounds.
This means you are writing to whatever memory lies beyond temp and while this makes your program always incorrect, it is only causing a crash sometimes which depends on various other things. Increasing a triggers this because column major ordering of your array means k changes contiguously and is strided by a, and as a increases your out of bounds access of the second dimension is further in memory beyond temp changing what is getting overwritten by your invalid access.
After your loop you set temp(:,101) = temp(:,100) meaning there is no need to calculate temp(:,101) in the above loop, so you can change its loop bounds from
do n = 1,k
to
do n = 1, k-1
which will fix the out of bounds access on temp.

Resources