I am attempting to suppress the academic license output from Gurobi. I have to solve an LP several times in my code and the print statement is slow and annoying to look at. I have created a working example. I wish to be able to execute the LP in parallel in Julia to increase performance. To suppress the output I use the package Suppressor. This seems to work well when running only using a single processor. However if I wrap the LP in a function and insert this into a loop and attempt to run the model in parallel I get an error:
TaskFailedException: SystemError: dup: bad file descriptor
Running the code works fine when using Windows 7 and an i5 processor.
Running the code on MAC OS 10.13.6 with an 2.66 GHz Intel core 2 duo does not work.
I am capable of running the code in parallel on MAC when it is not suppressed. Is it just because the hardware in ancient (the cores might not be able to communicate correctly to share information)? or is there a fix to this issue?
Additional information:
Julia: 1.5.2
Gurobi License version: 9.0.1
JuMP: 0.21.5
Suppressor: 0.2.0
clearconsole()
using Gurobi
using JuMP
using Suppressor
using BenchmarkTools
# Create LP frunction
function runlp()
#suppress begin
global primal = Model(Gurobi.Optimizer)
set_optimizer_attributes(primal, "OutputFlag" => 0)
set_optimizer_attributes(primal, "Threads" => 1)
end
# Declare variables with lower bound 0
#variable(primal, x1 >= 0)
#variable(primal, x2 >= 0)
#variable(primal, -Inf <= x3 <= Inf)
# Declare minimization of costs objective function
#objective(primal, Min, -5*x1+4*x2-3*x3)
# Declare constraint for minimum of lubricant 1
#constraint(primal, Cons_1,2*x1-3*x2-x3 <= 5)
#constraint(primal, Cons_2,4*x1-x2+2*x3 >= 11)
#constraint(primal, Cons_3,-3*x1+4*x2+2*x3 <= 8)
#constraint(primal, Cons_4,6*x1-5*x2+x3 == 1)
# Optimize model
optimize!(primal)
end
# Create function to ru LP multiple times
function testing()
for i in 1:100
runlp()
end
end
testing()
#btime testing()
function testing_par()
#sync Threads.#threads for i in 1:100
runlp()
end
end
#btime testing_par()
I'm not sure how Supressor works, but the way to resolve this with Gurobi.jl is to re-use an environment for multiple solves:
https://github.com/jump-dev/Gurobi.jl#reusing-the-same-gurobi-environment-for-multiple-solves
Related
I have a question regarding multithreading in Julia and how to parallelize a for loop effectively.
Suppose you have a nested for loop and a computer with 4 cores. A straightforward way is to add Threads.#threads in front of the for loop. Assuming that the cores can run what they need to do without interference.
As I have understood this, only the outermost part of the nested for loop is parallelized. Assuming that N = 15 and M = 14 then a computer with 4 cores would be a bottleneck.
However, if you have a PC with 32 cores, then 32-15= 17 cores would be doing nothing. However, there would be 210 combinations in total to compute.
Is this correct? Is this how Threads.#threads work? Is there a way to parallelize the combination of both i and j. Perhaps using FLoops? I have tried to read the documentation, however, I need to know if I am going in a completely wrong direction.
Threads.#threads for i in 1:N
for j in 1:M
# Do stuff
end
end
vs.
using FLoops
#floops for i in 1:N
for j in 1:M
# Do stuff
end
end
Thanks in advance
you could probably have a third variable that you can divide into the two variables.
Threads.#threads for k in 1:(N*M)
j = k % M
i = k ÷ M
alternatively using itertools.product will assign both i and j without the two extra lines.
#floop for (i,j) in product(1:N,1:M)
Recently I have to use Verilog-A to generate a set of random numbers (sigmaX, sigmaY, sigmaZ). Statistically, each of them has mean=0 and std=1, and sigmaX^2+sigmaY^2+sigmaZ^2=1. The following code in test_solver.va file is writen in Verilog-A to realize such random numebr set at each time step:
`include "disciplines.h"
`include "constants.h"
module test_va(p,n,mb,mc,md,me,mf,mg);
inout p,n;
output mb,mc,md,me,mf,mg;
electrical p,n,mb,mc,md,me,mf,mg;
real randomX,randomY,randomZ; // Gaussian random variables with mean = 0, stdev = 1
real sigmaX,sigmaY,sigmaZ; // Normalized thermal noise vector components
integer seedX,seedY,seedZ; // Seed variables for RNG
integer random_seed;
//------------------------------------------------------------------//
// Define mag(x, y, z)
//------------------------------------------------------------------//
analog function real mag;
input x, y, z;
real x, y, z;
begin
mag = sqrt(pow(x,2)+pow(y,2)+pow(z,2));
end
endfunction
analog begin
random_seed = 1;
seedX = $random+random_seed;
seedY = $random+random_seed;
seedZ = $random+random_seed;
randomX = $rdist_normal(seedX, 0.0, 1.0);
randomY = $rdist_normal(seedY, 0.0, 1.0);
randomZ = $rdist_normal(seedZ, 0.0, 1.0);
sigmaX = randomX/mag(randomX, randomY, randomZ);
sigmaY = randomY/mag(randomX, randomY, randomZ);
sigmaZ = randomZ/mag(randomX, randomY, randomZ);
V(mb) <+ randomX;
V(mc) <+ randomY;
V(md) <+ randomZ;
V(me) <+ sigmaX;
V(mf) <+ sigmaY;
V(mg) <+ sigmaZ;
end
endmodule
I used HSPICE 2019 to test the random number output at each simulation step, by running the folloing test_solver.sp file:
Title Simple
.option post=1
.option probe=0
*.option runlvl=4
.option ingold=2
*.option accurate=1
*.option method=bdf
*.option bdfrtol=1e-5
*.option bdfatol=1e-5
.option numdgt=4
.option brief
.option measfile=1
.option lis_new=1
.option vaopts=str('-G')
.save
.hdl ./test_solver.va
vin 1 0 PULSE(0 0.5 2NS 1NS 1NS 10NS 20NS)
X 1 0 2 3 4 5 6 7 test_va
.tran 0.01n 20.0n 1E-10 uic
.print tran V(1) V(2) V(3) V(4) V(5) V(6) V(7)
.end
However, I noticed that it always generates an identical random number set (sigmaX, sigmaY, sigmaZ) if I run in HSPICE consecutively. But my requirement is to have different random number sets when running the same code consecutively.
I also noticed that if I change random_seed=1 in the test_solver.va file, for example, to random_seed=2 (or 3 or 4 ...) and run in HSPICE, it will generate a different random number set than before. But it still generates the same set when running the same code consecutively.
So I wonder if there is anything wrong with my test_solver.va code, or we have to change "random_seed=1" every time. Then it might not be easy to realize if I integrate this code into others and run many times.
First of all, pseudo-random number generators are deterministic. That means if you start with the same seed you will always get the same result.
I'm not aware of any way to do what you want directly in Verilog-A. I think that you will need to write your own function in 'C'. One technique that is often used is to call a high resolution timer and assume that the time in micro- or nanoseconds is essentially random. Alternatively you can call a function like getrandom().
The next problem is getting the 'C' random value back to your Verilog-A. I'm not familiar with HSPICE, but this can be done with Verilog PLI on some other simulators.
Alternatively you could wrap your simulation in a shell script and do something like this
script: read /dev/urandom and write a random number to a file
run hspice
in your Verilog-A use a system task like $fread to read the file that the script produced
In Python, it is usually suggested to vectorize the code to make computation faster. For example, if you want to compute the inner product of two vectors, say a and b, usually
Code A
c = np.dot(a, b)
is better than
Code B
c = 0
for i in range(len(a)):
c += a[i] * b[i]
But in Julia, it seems sometimes vectorization is not that helpful. I reckoned '* and dot as vectorized versions and an explicit for loop as a non-vectorized version and got the following result.
using Random
using LinearAlgebra
len = 1000000
rng1 = MersenneTwister(1234)
a = rand(rng1, len)
rng2 = MersenneTwister(12345)
b = rand(rng2, len)
function inner_product(a, b)
result = 0
for i in 1: length(a)
result += a[i] * b[i]
end
return result
end
#time a' * b
#time dot(a, b)
#time inner_product(a, b);
0.013442 seconds (56.76 k allocations: 3.167 MiB)
0.003484 seconds (106 allocations: 6.688 KiB)
0.008489 seconds (17.52 k allocations: 976.752 KiB)
(I know using BenchmarkTools.jl is a more standard way to measure the performance.)
From the output, dot runs faster than for than '*, which is a contradiction to what has been presumed.
So my question is,
does Julia need (or sometimes need) vectorization to speed up computation?
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Your are not making the benchmarks correctly and the implementation of your function is suboptimal.
julia> using BenchmarkTools
julia> #btime $a' * $b
429.400 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime dot($a,$b)
426.299 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime inner_product($a, $b)
970.500 μs (0 allocations: 0 bytes)
249985.36801903677
The correct implementation:
function inner_product_correct(a, b)
result = 0.0 #use the same type as elements in the args
#simd for i in 1: length(a)
#inbounds result += a[i] * b[i]
end
return result
end
julia> #btime inner_product_correct($a, $b)
530.499 μs (0 allocations: 0 bytes)
249985.36801902478
There is still the difference (however less significant) because dot is using the optimized BLAS implementation which is multi-threaded. You could (following Bogumil's comment set OPENBLAS_NUM_THREADS=1 and then you will find that the times of BLAS function will be identical as the Julia implementation.
Note also that working with float numbers is tricky in many ways:
julia> inner_product_correct(a, b)==dot(a,b)
false
julia> inner_product_correct(a, b) ≈ dot(a,b)
true
Finally, in Julia deciding whether to use vectorization or write the loop yourself is up two you - there is no performance penalty (as long as you write type stable code and use #simd and #inbounds where required). However in your codes you were not testing vectorization but you were comparing calling BLAS to writing the loop yourself. Here is the must-read to understand what is going on https://docs.julialang.org/en/v1/manual/performance-tips/
Let me add my practical experience as a comment (too long for a standard comment):
does Julia need (or sometimes need) vectorization to speed up computation?
Julia does not need vectorization as Python (see the answer by Przemysław), but in practice if you have a well written vectorized function (like dot) then use it as, while possible, it might be sometimes tricky to write as performant function yourself (people have probably spent days on optimizing dot, especially to optimally use multiple threads).
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
When you use vectorized code then it all depends on implementation of the function you want to use. In this case dot(a, b) and a' * b are exactly the same as when you write #edit a' * b gives you in this case:
*(u::AdjointAbsVec{<:Number}, v::AbstractVector{<:Number}) = dot(u.parent, v)
and you see it is the same.
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Julia is a compiled language, while Python is an interpreted language. In some cases Python interpreter can provide fast execution speed, but in other cases it currently is not able to do it (but it does not mean that in the future it will not improve). In particular vectorized functions (like dot in your question) are most likely written in some compiled language, so Julia and Python will not differ much in typical cases as they just call this compiled function. However, when you use loops (non-vectorized code) then currently Python will be slower than Julia.
I am currently attempting to implement a metaheuristic (genetic) algorithm. In this venture i also want to try and create somewhat fast and efficient code. However, my experience in creating efficient coding is not very great. I was therefore wondering if some people could give some "quick tips" to increase the efficiency of my code. I have created a small functional example of my code which contains most of the elements that the code will contain i regards to preallocating arrays, custom mutable structs, random numbers, pushing into arrays etc.
The options that I have already attempted to explore are options in regards to the package "StaticArrays". However many of my arrays must be mutable (so we need MArrays) and many of them will become very large > 100. The documentation of StaticArrays specify that the size of the StaticArrays package must remain small to remain efficient.
According to the documentation Julia 1.5.2 should be thread safe in regards to rand(). I have therefor attempted to multithread for-loops in my functions to make them run faster. And this results in a slight performance increase .
However if people can se a more efficient way of allocating Arrays or pushing in SpotPrices into an array it would be greatly appreciated! Any other performance tips are also very welcome!
# Packages
clearconsole()
using DataFrames
using Random
using BenchmarkTools
Random.seed!(42)
df = DataFrame( SpotPrice = convert(Array{Float64}, rand(-266:500,8832)),
month = repeat([1,2,3,4,5,6,7,8,9,10,11,12]; outer = 736),
hour = repeat([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]; outer = 368))
# Data structure for the prices per hour
mutable struct SpotPrices
hour :: Array{Float64,1}
end
# Fill-out data structure
function setup_prices(df::DataFrame)
prices = []
for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
prices = setup_prices(df)
# Sampler function
function MC_Sampler(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
samples = MC_Sampler(prices, 100)
#btime setup_prices(df)
#btime MC_Sampler(prices,100)
function setup_prices_par(df::DataFrame)
prices = []
#sync Threads.#threads for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
# Sampler function
function MC_Sampler_par(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
#sync Threads.#threads for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
#btime setup_prices_par(df)
#btime MC_Sampler_par(prices,100)
Have a look at read very carefully https://docs.julialang.org/en/v1/manual/performance-tips/
Basic cleanups start with:
Your SpotPrices struct does not need to me mutable. Anyway since there is only one field you could just define it as SpotPrices=Vector{Float64}
You do not want untyped containers - instead of prices = [] do prices = Float64[]
Using DataFrames.groupby will be much faster than finding unique elements and filtering by them
If yo do not need initialze than do not do it Vector{Float64}(undef, sample_size) is much faster than zeros(sample_size, 24)
You do not need to synchronize #sync before a multi-threaded loop
Create a random states - one separate one for each thread and use them whenever calling the rand function
Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bins
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Timing via
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
yields
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
reveals a significant difference between the two results (up to 20% in this example).
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB
Yes, Z[b] += Xij*Yij is indeed a race condition.
There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.
You are right that the access to Z is under a race condition.
You might be better off defining num_threads copies of Z, as cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64) and perform a sum along axis 0 after the prange loop.
return np.sum(Z, axis=0)
Cython code can have a with gil statement in a parallel region but it is only documented for error handling. You could have a look at the general C code to see whether that would trigger an atomic OpenMP operation but I doubt it.