Cart-Pole Python Performance Comparison

Cart-Pole Python Performance Comparison - python-3.x

I am comparing a cart and pole simulation with python 3.7 and Julia 1.2. In python the simulation is written as a class object as seen below, and in Julia is is just a function. I am getting a consistent 0.2 seconds time to solve using Julia which is much slower than python. I do not understand Julia well enough to understand why. My guess is it has something to do with compiling or the way the loop is set up.
import math
import random
from collections import namedtuple
RAD_PER_DEG = 0.0174533
DEG_PER_RAD = 57.2958
State = namedtuple('State', 'x x_dot theta theta_dot')
class CartPole:
""" Model for the dynamics of an inverted pendulum
"""
def __init__(self):
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.length = 0.5 # actually half the pole's length
self.force_mag = 10.0
self.tau = 0.02 # seconds between state updates
self.x = 0
self.x_dot = 0
self.theta = 0
self.theta_dot = 0
#property
def state(self):
return State(self.x, self.x_dot, self.theta, self.theta_dot)
def reset(self, x=0, x_dot=0, theta=0, theta_dot=0):
""" Reset the model of a cartpole system to it's initial conditions
" theta is in radians
"""
self.x = x
self.x_dot = x_dot
self.theta = theta
self.theta_dot = theta_dot
def step(self, action):
""" Move the state of the cartpole simulation forward one time unit
"""
total_mass = self.masspole + self.masscart
pole_masslength = self.masspole * self.length
force = self.force_mag if action else -self.force_mag
costheta = math.cos(self.theta)
sintheta = math.sin(self.theta)
temp = (force + pole_masslength * self.theta_dot ** 2 * sintheta) / total_mass
# theta acceleration
theta_dotdot = (
(self.gravity * sintheta - costheta * temp)
/ (self.length *
(4.0/3.0 - self.masspole * costheta * costheta /
total_mass)))
# x acceleration
x_dotdot = temp - pole_masslength * theta_dotdot * costheta / total_mass
self.x += self.tau * self.x_dot
self.x_dot += self.tau * x_dotdot
self.theta += self.tau * self.theta_dot
self.theta_dot += self.tau * theta_dotdot
return self.state
To run the simulation the following code was used
from cartpole import CartPole
import time
cp = CartPole()
start = time.time()
for i in range(100000):
cp.step(True)
end = time.time()
print(end-start)
The Julia code is
function cartpole(state, action)
"""Cart and Pole simulation in discrete time
Inputs: cartpole( state, action )
state: 1X4 array [cart_position, cart_velocity, pole_angle, pole_velocity]
action: Boolean True or False where true is a positive force and False is a negative force
"""
gravity = 9.8
masscart = 1.0
masspole = 0.1
l = 0.5 # actually half the pole's length
force_mag = 10.0
tau = 0.02 # seconds between state updates
# x = 0
# x_dot = 0
# theta = 0
# theta_dot = 0
x = state[1]
x_dot = state[2]
theta = state[3]
theta_dot = state[4]
total_mass = masspole + masscart
pole_massl = masspole * l
if action == 0
force = force_mag
else
force = -force_mag
end
costheta = cos(theta)
sintheta = sin(theta)
temp = (force + pole_massl * theta_dot^2 * sintheta) / total_mass
# theta acceleration
theta_dotdot = (gravity * sintheta - costheta * temp)/ (l *(4.0/3.0 - masspole * costheta * costheta / total_mass))
# x acceleration
x_dotdot = temp - pole_massl * theta_dotdot * costheta / total_mass
x += tau * x_dot
x_dot += tau * x_dotdot
theta += tau * theta_dot
theta_dot += tau * theta_dotdot
new_state = [x x_dot theta theta_dot]
return new_state
end
The run code is
#time include("cartpole.jl")
function run_sim()
"""Runs the cartpole simulation
No inputs or ouputs
Use with #time run_sim() for timing puposes.
"""
state = [0 0 0 0]
for i = 1:100000
state = cartpole( state, 0)
#print(state)
#print("\n")
end
end
#time run_sim()

Your Python version takes 0.21s on my laptop. Here are timing results for the original Julia version on the same system:
julia> #time run_sim()
0.222335 seconds (654.98 k allocations: 38.342 MiB)
julia> #time run_sim()
0.019425 seconds (100.00 k allocations: 10.681 MiB, 37.52% gc time)
julia> #time run_sim()
0.010103 seconds (100.00 k allocations: 10.681 MiB)
julia> #time run_sim()
0.012553 seconds (100.00 k allocations: 10.681 MiB)
julia> #time run_sim()
0.011470 seconds (100.00 k allocations: 10.681 MiB)
julia> #time run_sim()
0.025003 seconds (100.00 k allocations: 10.681 MiB, 52.82% gc time)
The first run includes JIT compilation and takes ~0.2s whereas after that each run takes 10-20ms. That breaks down into ~10ms of actual compute time and ~10s of garbage collection time triggered every four calls or so. That means that Julia is about 10-20x faster than Python, excluding JIT compilation time, which is not bad for a straight port.
Why not count JIT time when benchmarking? Because you don't actually care about how long it takes to run fast programs like benchmarks. You're timing small benchmark problems to extrapolate how long it it will take to run larger problems where speed really matters. JIT compilation time is proportional to the amount of code you're compiling not to problem size. So when solving larger problems that you actually care about, the JIT compilation will still only take 0.2s, which is a negligible fraction of execution time for large problems.
Now, let's see about making the Julia code even faster. This is actually very simple: use a tuple instead of a row vector for your state. So initialize the state as state = (0, 0, 0, 0) and then update the state similarly:
new_state = (x, x_dot, theta, theta_dot)
That's it, otherwise the code is identical. For this version the timings are:
julia> #time run_sim()
0.132459 seconds (479.53 k allocations: 24.020 MiB)
julia> #time run_sim()
0.008218 seconds (4 allocations: 160 bytes)
julia> #time run_sim()
0.007230 seconds (4 allocations: 160 bytes)
julia> #time run_sim()
0.005379 seconds (4 allocations: 160 bytes)
julia> #time run_sim()
0.008773 seconds (4 allocations: 160 bytes)
The first run still includes JIT time. Subsequent runs are now 5-10ms, which is about 25-40x faster than the Python version. Note that there are almost no allocations—small, fixed numbers of allocations are just for return values and won't trigger GC if this is called from other code in a loop.

Okay, so I've just run your Python and Julia code, and I get different results: 1.41 s for 10m iterations for Julia, 25.5 seconds for 10m iterations for Python. Already, Julia is 18x faster!
I think perhaps the issue is that #time is not accurate when run in global scope - you need multi-second timings for it to be accurate enough. You can use the package BenchmarkTools to get accurate timings of small functions.

Standard performance tips apply: https://docs.julialang.org/en/v1/manual/performance-tips/index.html
In particular, use dots to avoid allocations, and fuse loops. Also for this kind of small-array computations, consider using https://github.com/JuliaArrays/StaticArrays.jl which is much faster

Related

Slower times in multithreading using Julia 1.3.1

I started recently using the new multithreading interface in the 1.3.1 version. After I tried the fibonacci example in this blog post and getting significant speedups, I started experimenting with some old algorithms of mine.
I have a function that uses the trapezoid method to calculate integrals, both below or above a curve:
function trapezoid( x :: AbstractVector ,
y :: AbstractVector ;
y0 :: Number = 0.0 ,
inv :: Number = NaN )
int = zeros(length(x)-1)
for i = 2:length(x)
if isnan(inv) == true
int[i-1] = (y[i]+y[i-1]-2y0) * (x[i]-x[i-1]) / 2
else
int[i-1] = (2inv-(y[i]+y[i-1])-2y0) * (x[i]-x[i-1]) / 2
end # if
end # for
integral = sum(int) ;
return integral
end
Then I have a very inefficient algorithm that determines the midpoint index of a curve comparing the area below and above the curve:
function EAM_without_threads( x :: Vector{Float64} ,
y :: Vector{Float64} ,
y0 :: Real ,
ymean :: Real )
approx = Vector{Float64}(undef,length(x)-1)
for i in 1:length(x)-1
x1 = #view(x[1:i ])
x2 = #view(x[i:end])
y1 = #view(y[1:i ])
y2 = #view(y[i:end])
Al = trapezoid( x1 , y1 , y0=y0 )
Au = trapezoid( x2 , y2 , inv=ymean )
approx[i] = abs(Al-Au)
end
minind = findmin(approx)[2]
return x[minind]
end
And:
function EAM_with_threads( x :: Vector{Float64} ,
y :: Vector{Float64} ,
y0 :: Real ,
ymean :: Real )
approx = Vector{Float64}(undef,length(x)-1)
for i in 1:length(x)-1
x1 = #view(x[1:i ])
x2 = #view(x[i:end])
y1 = #view(y[1:i ])
y2 = #view(y[i:end])
Al = #spawn trapezoid( x1 , y1 , y0=y0 )
Au = #spawn trapezoid( x2 , y2 , inv=ymean )
approx[i] = abs(fetch(Al)-fetch(Au))
end
minind = findmin(approx)[2]
return x[minind]
end
This is what I used to try both functions:
using SpecialFunctions
using BenchmarkTools
x = collect(-10.0:5e-4:10.0)
y = erf.(x)
And then got these results:
julia> #btime EAM_without_threads(x,y,-1.0,1.0)
7.515 s (315905 allocations: 11.94 GiB)
julia> #btime EAM_with_threads(x,y,-1.0,1.0)
10.295 s (1274131 allocations: 12.00 GiB)
I don't understand... Using htop I can see that all my 8 threads are working almost at full capacity. This is my machine:
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4712MQ CPU # 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
JULIA_NUM_THREADS = 8
I know about the overhead of dealing with several threads, and in small problems I understand if it's slower, but why in this case?
I'm also searching for multithreading "good practices", because I guess not every piece of code will benefit from parallelism.
Thank you all in advance.

Your code is doing some very redundant work here. It's doing a full trapezoidal integral for each step, instead of just updating Al and Au incrementally. Here I've rewritten the code so that it does zero allocations, and my version of the EAM is on my computer 5 orders of magnitude faster than the original, without using any threads.
In general: before you start looking into things like threading, consider whether your algorithm is efficient. You can get much bigger speedups from a fast algorithm than from threading.
function trapz(x, y; y0=0.0, inv=NaN)
length(x) != length(y) && error("Input arrays cannot have different lengths")
s = zero(eltype(x))
if isnan(inv)
#inbounds for i in eachindex(x, y)[1:end-1]
s += (y[i+1] + y[i] - 2y0) * (x[i+1] - x[i])
end
else
#inbounds for i in eachindex(x, y)[1:end-1]
s += (2inv - (y[i+1] + y[i]) - 2y0) * (x[i+1] - x[i])
end
end
return s / 2
end
function eam(x, y, y0, ymean)
length(x) != length(y) && error("Input arrays cannot have different lengths")
Au = trapz(x, y; inv=ymean)
Al = zero(Au)
amin = abs(Al - Au)
ind = firstindex(x)
#inbounds for i in eachindex(x, y)[2:end-1] # 2:length(x)-1
Al += (y[i] + y[i-1] - 2y0) * (x[i] - x[i-1]) / 2
Au -= (2ymean - (y[i] + y[i-1])) * (x[i] - x[i-1]) / 2
aval = abs(Al - Au)
if aval < amin
(amin, ind) = (aval, i)
end
end
return x[ind]
end
Benchmarks here (I use #time for your code and #btime for my own, since it would just be too time consuming to use #btime on really slow code):
julia> x = collect(-10.0:5e-4:10.0);
julia> y = erf.(x);
julia> #time EAM_without_threads(x, y, -1.0, 1.0)
15.611004 seconds (421.72 k allocations: 11.942 GiB, 11.73% gc time)
0.0
julia> #btime eam($x, $y, -1.0, 1.0)
181.991 μs (0 allocations: 0 bytes)
0.0
A small extra remark: you should not write if isnan(inv) == true, that is redundant. Just write if isnan(inv).

Try this
function EAM_with_threads( x :: Vector{Float64} ,
y :: Vector{Float64} ,
y0 :: Real ,
ymean :: Real )
approx = Vector{Float64}(undef,length(x)-1)
Threads.#threads for i in 1:length(x)-1
x1 = #view(x[1:i ])
x2 = #view(x[i:end])
y1 = #view(y[1:i ])
y2 = #view(y[i:end])
Al = trapezoid( x1 , y1 , y0=y0 )
Au = trapezoid( x2 , y2 , inv=ymean )
approx[i] = abs(Al-Au)
end
minind = findmin(approx)[2]
return x[minind]
end
Your for loop is easily parallelize, so the lowest fruit is to do each iteration of the "for loop" in parallel. It is much easier to reduce the overall time taken by doing this in parallel then to try and parallelize the internal instance of a "for loop".
I know about the overhead of dealing with several threads, and in
small problems I understand if it's slower, but why in this case?
Well, I think your first problem is that you didn't benchmark how long it takes to do
trapezoid( x1 , y1 , y0=y0 )
If you did, you will find that it takes hardly any time at all. Anything that does not take up a substantial amount of time is not worth doing in parallel. If A and B is independent and they both take up a long time then you should do A and B in parallel. Otherwise find something else to parallelize first.
Lets look at what you have
x = collect(-10.0:5e-4:10.0)
and
for i in 1:length(x)-1
So basically your for loop has around 40000 iterations
Your multithreading method takes
total_time = setup_time * 40000 + ind_work_time/2 * 40000
Where as parallelizing the for loop takes
total_time = setup_time * 1 + ind_work_time * 40000/8
For comparison, the non-multithreaded method take
total_time = ind_work_time * 40000

Manually implementing approximation functions

I have a dataset from kaggle of 45,253 rows and a single column for temperature in Kelvin for the city of Detroit. It's mean = 282.97, std = 11, min = 243.48, max = 308.05.
This is the result when plotted as a histogram of 100 bins with density=True:
I am expected to write the following two functions and see whichever one approximates the closest to the histogram:
Like this one here using scipy.stats.norm.pdf:
I generated the above image using:
x = np.linspace(dataset.Detroit.min(), dataset.Detroit.max(), 1001)
P_norm = norm.pdf(x, dataset.Detroit.mean(), dataset.Detroit.std())
plot_pdf_single(x, P_norm)
However, whenever I try to implement any of the two approximation functions all of my values for P_norm result in 0s or infs.
This is what I tried:
P_norm = [(1.0/(np.sqrt(2.0*pi*(std*std))))*np.exp(((-x_i-mu)*(-x_i-mu))/(2.0*(std*std))) for x_i in x]
I also broke it down into parts for a single x_i:
part1 = ((-x[0] - mu)*(-x[0] - mu)) / (2.0*(std * std))
part2 = np.exp(part1)
part3 = 1.0 / (np.sqrt(2.0 * pi * (std*std)))
total = part3*part2
I got the following values:
1145.3913234604413
inf
0.036267480036493875
inf

Since both of the equations use the same formula:
def pdf_approximation(x_i, mu, std):
return (1.0 / (np.sqrt(2.0 * pi * (std*std)))) * np.exp((-(x_i-mu)*(x_i-mu)) / (2.0 * (std*std)))
The code for the first approximation is:
mu = 283
std = 11
P_norm = np.array([pdf_approximation(x_i, mu, std) for x_i in x])
plot_pdf_single(x, P_norm)
The code for the second approximation is:
mu1 = 276
std1 = 6
mu2 = 293
std2 = 6.5
P_norm = np.array([(pdf_approximation(x_i, mu1, std1) * 0.5) + (pdf_approximation(x_i, mu2, std2) * 0.5) for x_i in x])
plot_pdf_single(x, P_norm)

How do i fix this error when converting a Matlab code to Python

I converted a Matlab code into python by manually typing it out. However i keep getting an error message which i still have not been able to fix. what am i doing wrong and how do i get the plot as that in Matlab? Just is little information about the code; this is a Explicit finite difference method for solving pressure distribution in an oil reservoir with production from the middle block only. Its similar to the heat equation, Ut=Uxx. I was told to add more text because my question is mostly code so had to add all these details. I think that notification has vanished now.
[P_new[N] = 4000 #last blocks at all time levels equals 4000
IndexError: index 9 is out of bounds for axis 0 with size 9]
The Matlab code which runs ok is below: The python code follows.
clear
clc
% Solution of P_t = P_{xx}
L = 1000 ; %ft length of reservoir
W = 100 ; %ft reservoir width
h = 50 ;%ft pay thickness
poro = 0.25; % rock porosity
k_o = 5; %md effective perm to oil
P_i = 4000; %psia initial pressure
B_o = 1.25; %oil formation vol fact
mu = 5; %cp oil visc
c_t = 0.0000125; %1/atm total compressibility
Q_o = 10;%stb/day production rate from central well
alpha = c_t*mu*poro/k_o;
T = 1;
N_time = 50;
dt = T/N_time;
% % Number of grid cells
N =9; %number of grid cells
%N =11;%number of grid cells
dx = (L/(N-1)); %distance between grid blocks
x = 0+dx*0.5:dx:L+dx; %points in space
for i=1:N
P_old(i)=P_i;
FPT(i)=0;
end
FPT((N+1)/2)=-Q_o*B_o*mu/1.127/W/dx/h/k_o; %source term at the center block of grid cell
P_new = P_old;
for j = 1:N_time
for k = 1: N
if k<2
P_new(k)=4000;%P_old(k)+dt/alpha*((P_old(k+1)-2*P_old(k)+P_old(k))/dx^2+FPT(k));
elseif k > N-1
P_new(k) = 4000;%P_old(k)+dt/alpha*((P_old(k)-2*P_old(k)+P_old(k-1))/dx^2+FPT(k));
else
P_new(k) = P_old(k)+dt/alpha*((P_old(k+1)-2*P_old(k)+P_old(k-1))/dx^2+FPT(k));
end
end
plot(x,P_new, '-x')
xlabel('X')
ylabel('P(X)')
hold on
grid on
%%update "u_old" before you move forward to the next time level
P_old = P_new;
end
hold off
Python Code:
import numpy as np
import matplotlib.pyplot as plt
# Solution of P_t = P_{xx}
L = 1000 #ft length of reservoir
W = 100 #ft reservoir width
h = 50 #ft pay thickness
poro = 0.25 # rock porosity
k_o = 5 #md effective perm to oil
P_i = 4000 #psia initial pressure
B_o = 1.25 #oil formation vol fact
mu = 5 #cp oil visc
c_t = 0.0000125 #1/atm total compressibility
Q_o = 10 #stb/day production rate from central well
alpha = c_t * mu * poro / k_o
T = 1
N_time = 20
dt = T / N_time
# % Number of grid cells
N = 9 #number of grid cells
dx = (L / (N - 1)) #distance between grid blocks
x= np.arange(0.0,L+dx,dx)
P_old = np.zeros_like(x) #pressure at previous time level
P_new = np.zeros_like(x) #pressure at previous time level
FPT = np.zeros_like(x)
for i in range(0,N):
P_old[i]= P_i
FPT[int((N + 1) / 2)]= -Q_o * B_o * mu / (1.127 * W * dx * h * k_o) # source term at the center block of grid cell
P_new = P_old
d=np.arange(0,N)
for j in range(0,N_time):
for k in range(0,N):
P_new[0] = 4000 #pressure at first block for all time levels equals 4000
P_new[N] = 4000 #pressure at last block for all time levels equals 4000
P_new[k]= P_old[k] + dt / alpha * ((P_old[k+1] - 2 * P_old[k] + P_old[k - 1]) / dx ** 2 + FPT[k])
plt.plot(x, P_new)
plt.xlabel('X')
plt.ylabel('P(X)')
P_old = P_new

Matlab uses 1 based indexing , Python arrays use "0" based indexing. If you define an array of length N in python, the indices are from 0 to N-1.
So just replace the index N to index N-1 in your code as below and it works.
import numpy as np
import matplotlib.pyplot as plt
# Solution of P_t = P_{xx}
L = 1000 #ft length of reservoir
W = 100 #ft reservoir width
h = 50 #ft pay thickness
poro = 0.25 # rock porosity
k_o = 5 #md effective perm to oil
P_i = 4000 #psia initial pressure
B_o = 1.25 #oil formation vol fact
mu = 5 #cp oil visc
c_t = 0.0000125 #1/atm total compressibility
Q_o = 10 #stb/day production rate from central well
alpha = c_t * mu * poro / k_o
T = 1
N_time = 20
dt = T / N_time
# % Number of grid cells
N = 9 #number of grid cells
dx = (L / (N - 1)) #distance between grid blocks
x= np.arange(0.0,L+dx,dx)
P_old = np.zeros_like(x) #pressure at previous time level
P_new = np.zeros_like(x) #pressure at previous time level
FPT = np.zeros_like(x)
for i in range(0,N):
P_old[i]= P_i
FPT[int((N + 1) / 2)]= -Q_o * B_o * mu / (1.127 * W * dx * h * k_o) # source term at the center block of grid cell
P_new = P_old
d=np.arange(0,N)
for j in range(0,N_time):
for k in range(0,N-1):
P_new[0] = 4000 #pressure at first block for all time levels equals 4000
P_new[N-1] = 4000 #pressure at last block for all time levels equals 4000
P_new[k]= P_old[k] + dt / alpha * ((P_old[k+1] - 2 * P_old[k] + P_old[k - 1]) / dx ** 2 + FPT[k])
plt.plot(x, P_new)
plt.xlabel('X')
plt.ylabel('P(X)')
P_old = P_new
output:

Why Spark stage's Executor Computing Time takes much more time than usual when run as the first action?

Here's my code:
import time
import random
NUM_SAMPLES = 1000000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Here's the stages page, it takes about 0.5s much more time for the stage0 than the other stages and I want to know where the 0.5s gone? Additional info, the main different is in Executor Computing Time.

Efficient way of loading minibatches < gpu memory

I have the following scenario:
My dataset >> gpu memory
My minibatches < gpu memory ... such that depending on size I can fit up to 10 in memory at once while still training no problem.
The size of my dataset means I won't revisit datapoints, so I guess no point in making them shared? Or is there? I was thinking that maybe it would be beneficial to have up to 10 shared initialised variables of size=mini-batch, such that I can I swap 10 in at once instead of just one at a a time. Also, is it possible to preload mini-batches in parallel?

If you're not revisiting datapoints then there probably isn't any value in using shared variables.
The following code could be modified and used to evaluate the different methods of getting data into your specific computation.
The "input" method is the one that will probably be best when you have no need to revisit data. The "shared_all" method may outperform everything else but only if you can fit the entire dataset in GPU memory. The "shared_batched" allows you to evaluate whether hierarchically batching your data could help.
In the "shared_batched" method, the dataset is divided into many macro batches and each macro batch is divided into many micro batches. A single shared variable is used to hold a single macro batch. The code evaluates all the micro batches within the current macro batch. Once a complete macro batch has been processed the next macro batch is loaded into the shared variable and the code iterates over the micro batches within it again.
In general, it might be expected that small numbers of large memory transfers will operate faster than larger numbers of smaller transfers (where the total transfered is the same for each). But this needs to be tested (e.g. with the code below) before it can be known for sure; YMMV.
The use of the "borrow" parameter may also have a significant impact on the performance, but be aware of the implications before using it.
import math
import timeit
import numpy
import theano
import theano.tensor as tt
def test_input(data, batch_size):
assert data.shape[0] % batch_size == 0
batch_count = data.shape[0] / batch_size
x = tt.tensor4()
f = theano.function([x], outputs=x.sum())
total = 0.
start = timeit.default_timer()
for batch_index in xrange(batch_count):
total += f(data[batch_index * batch_size: (batch_index + 1) * batch_size])
print 'IN\tNA\t%s\t%s\t%s\t%s' % (batch_size, batch_size, timeit.default_timer() - start, total)
def test_shared_all(data, batch_size):
batch_count = data.shape[0] / batch_size
for borrow in (True, False):
start = timeit.default_timer()
all = theano.shared(data, borrow=borrow)
load_time = timeit.default_timer() - start
x = tt.tensor4()
i = tt.lscalar()
f = theano.function([i], outputs=x.sum(), givens={x: all[i * batch_size:(i + 1) * batch_size]})
total = 0.
start = timeit.default_timer()
for batch_index in xrange(batch_count):
total += f(batch_index)
print 'SA\t%s\t%s\t%s\t%s\t%s' % (
borrow, batch_size, batch_size, load_time + timeit.default_timer() - start, total)
def test_shared_batched(data, macro_batch_size, micro_batch_size):
assert data.shape[0] % macro_batch_size == 0
assert macro_batch_size % micro_batch_size == 0
macro_batch_count = data.shape[0] / macro_batch_size
micro_batch_count = macro_batch_size / micro_batch_size
macro_batch = theano.shared(numpy.empty((macro_batch_size,) + data.shape[1:], dtype=theano.config.floatX),
borrow=True)
x = tt.tensor4()
i = tt.lscalar()
f = theano.function([i], outputs=x.sum(), givens={x: macro_batch[i * micro_batch_size:(i + 1) * micro_batch_size]})
for borrow in (True, False):
total = 0.
start = timeit.default_timer()
for macro_batch_index in xrange(macro_batch_count):
macro_batch.set_value(
data[macro_batch_index * macro_batch_size: (macro_batch_index + 1) * macro_batch_size], borrow=borrow)
for micro_batch_index in xrange(micro_batch_count):
total += f(micro_batch_index)
print 'SB\t%s\t%s\t%s\t%s\t%s' % (
borrow, macro_batch_size, micro_batch_size, timeit.default_timer() - start, total)
def main():
numpy.random.seed(1)
shape = (20000, 3, 32, 32)
print 'Creating random data with shape', shape
data = numpy.random.standard_normal(size=shape).astype(theano.config.floatX)
print 'Running tests'
for macro_batch_size in (shape[0] / pow(10, i) for i in xrange(int(math.log(shape[0], 10)))):
test_shared_all(data, macro_batch_size)
test_input(data, macro_batch_size)
for micro_batch_size in (macro_batch_size / pow(10, i) for i in
xrange(int(math.log(macro_batch_size, 10)) + 1)):
test_shared_batched(data, macro_batch_size, micro_batch_size)
main()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cart-Pole Python Performance Comparison - python-3.x

Standard performance tips apply: https://docs.julialang.org/en/v1/manual/performance-tips/index.html In particular, use dots to avoid allocations, and fuse loops. Also for this kind of small-array computations, consider using https://github.com/JuliaArrays/StaticArrays.jl which is much faster

Related

Slower times in multithreading using Julia 1.3.1

Manually implementing approximation functions

How do i fix this error when converting a Matlab code to Python

Why Spark stage's Executor Computing Time takes much more time than usual when run as the first action?

Efficient way of loading minibatches < gpu memory

Categories

Resources