Warp-ctc can not run on gpu - pytorch

I have some variable saved from crnn-pytorch, which runs on gpu and uses warp-ctc.
warp-ctc: https://github.com/SeanNaren/warp-ctc
Then I use those variables in my code, I can run it on cpu and the results are correct. But when I run it on gpu, I get Segmentation fault error.
This is my code which runs on gpu:
import torch
from warpctc_pytorch import CTCLoss
import numpy as np
from torch.autograd import Variable
i = 1
criterion = CTCLoss()
criterion = criterion.cuda()
a = np.load('preds_%d.npy' % i)
b = np.load('text_%d.npy' % i)
c = np.load('preds_size_%d.npy' % i)
d = np.load('length_%d.npy' % i)
#a = Variable(torch.from_numpy(a)).cuda()
#b = Variable(torch.from_numpy(b)).cuda()
#c = Variable(torch.from_numpy(c)).cuda()
#d = Variable(torch.from_numpy(d)).cuda()
a = torch.from_numpy(a).cuda()
b = torch.from_numpy(b).cuda()
c = torch.from_numpy(c).cuda()
d = torch.from_numpy(d).cuda()
print(a.dtype)
print(b.dtype)
print(c.dtype)
print(d.dtype)
cost = criterion(a, b, c, d) / 64
print('cost:', cost)
I get Segmentation fault. When delete all .cuda(), I get correct answer.
All cpu and gpu test has passed.
I really hope someone can help.

Related

Translating a mixed-integer programming formulation to Scipy

I would like to solve the above formulation in Scipy and solve it using milp(). For a given graph (V, E), f_ij and x_ij are the decision variables. f_ij is the flow from i to j (it can be continuous). x_ij is the number of vehicles from i to j. p is the price. X is the available number vehicles in a region. c is the capacity.
I have difficulty in translating the formulation to Scipy milp code. I would appreciate it if anyone could give me some pointers.
What I have done:
The code for equation (1):
f_obj = [p[i] for i in Edge]
x_obj = [0]*len(Edge)
obj = f_obj + v_obj
Integrality:
f_cont = [0 for i in Edge] # continous
x_int = [1]*len(Edge) # integer
integrality = f_cont + x_int
Equation (2):
def constraints(self):
b = []
A = []
const = [0]*len(Edge) # for f_ij
for i in v: # for x_ij
for e in Edge:
if e[0] == i:
const.append(1)
else:
const.append(0)
A.append(const)
b.append(self.accInit[i])
const = [0]*len(Edge) # for f_ij
return A, b
Equation (4):
[(0, demand[e]) for e in Edge]
I'm going to do some wild guessing, given how much you've left open to interpretation. Let's assume that
this is a maximisation problem, since the minimisation problem is trivial
Expression (1) is actually the maximisation objective function, though you failed to write it as such
p and d are floating-point vectors
X is an integer vector
c is a floating-point scalar
the graph edges, since you haven't described them at all, do not matter for problem setup
The variable names are not well-chosen and hide what they actually contain. I demonstrate potential replacements.
import numpy as np
from numpy.random._generator import Generator
from scipy.optimize import milp, Bounds, LinearConstraint
import scipy.sparse
from numpy.random import default_rng
rand: Generator = default_rng(seed=0)
N = 20
price = rand.uniform(low=0, high=10, size=N) # p
demand = rand.uniform(low=0, high=10, size=N) # d
availability = rand.integers(low=0, high=10, size=N) # X aka. accInit
capacity = rand.uniform(low=0, high=10) # c
c = np.zeros(2*N) # f and x
c[:N] = -price # (1) f maximized with coefficients of 'p'
# x not optimized
CONTINUOUS = 0
INTEGER = 1
integrality = np.empty_like(c, dtype=int)
integrality[:N] = CONTINUOUS # f
integrality[N:] = INTEGER # x
upper = np.empty_like(c)
upper[:N] = demand # (4) f
upper[N:] = availability # (2) x
eye_N = scipy.sparse.eye(N)
A = scipy.sparse.hstack((-eye_N, capacity*eye_N)) # (3) 0 <= -f + cx
result = milp(
c=c, integrality=integrality,
bounds=Bounds(lb=np.zeros_like(c), ub=upper),
constraints=LinearConstraint(lb=np.zeros(N), A=A),
)
print(result.message)
flow = result.x[:N]
vehicles = result.x[N:].astype(int)

Optimizing asymmetrically reweighted penalized least squares smoothing (from matlab to python)

I'm trying to apply the method for baselinining vibrational spectra, which is announced as an improvement over asymmetric and iterative re-weighted least-squares algorithms in the 2015 paper (doi:10.1039/c4an01061b), where the following matlab code was provided:
function z = baseline(y, lambda, ratio)
% Estimate baseline with arPLS in Matlab
N = length(y);
D = diff(speye(N), 2);
H = lambda*D'*D;
w = ones(N, 1);
while true
W = spdiags(w, 0, N, N);
% Cholesky decomposition
C = chol(W + H);
z = C \ (C' \ (w.*y) );
d = y - z;
% make d-, and get w^t with m and s
dn = d(d<0);
m = mean(d);
s = std(d);
wt = 1./ (1 + exp( 2* (d-(2*s-m))/s ) );
% check exit condition and backup
if norm(w-wt)/norm(w) < ratio, break; end
end
that I rewrote into python:
def baseline_arPLS(y, lam, ratio):
# Estimate baseline with arPLS
N = len(y)
k = [numpy.ones(N), -2*numpy.ones(N-1), numpy.ones(N-2)]
offset = [0, 1, 2]
D = diags(k, offset).toarray()
H = lam * numpy.matmul(D.T, D)
w_ = numpy.ones(N)
while True:
W = spdiags(w_, 0, N, N, format='csr')
# Cholesky decomposition
C = cholesky(W + H)
z_ = spsolve(C.T, w_ * y)
z = spsolve(C, z_)
d = y - z
# make d- and get w^t with m and s
dn = d[d<0]
m = numpy.mean(dn)
s = numpy.std(dn)
wt = 1. / (1 + numpy.exp(2 * (d - (2*s-m)) / s))
# check exit condition and backup
norm_wt, norm_w = norm(w_-wt), norm(w_)
if (norm_wt / norm_w) < ratio:
break
w_ = wt
return(z)
Except for the input vector y the method requires parameters lam and ratio and it runs ok for values lam<1.e+07 and ratio>1.e-01, but outputs poor results. When values are changed outside this range, for example lam=1e+07, ratio=1e-02 the CPU starts heating up and job never finishes (I interrupted it after 1min). Also in both cases the following warning shows up:
/usr/local/lib/python3.9/site-packages/scipy/sparse/linalg/dsolve/linsolve.py: 144: SparseEfficencyWarning: spsolve requires A to be CSC or CSR matrix format warn('spsolve requires A to be CSC or CSR format',
although I added the recommended format='csr' option to the spdiags call.
And here's some synthetic data (similar to one in the paper) for testing purposes. The noise was added along with a 3rd degree polynomial baseline The method works well for parameters bl_1 and fails to converge for bl_2:
import numpy
from matplotlib import pyplot
from scipy.sparse import spdiags, diags, identity
from scipy.sparse.linalg import spsolve
from numpy.linalg import cholesky, norm
import sys
x = numpy.arange(0, 1000)
noise = numpy.random.uniform(low=0, high = 10, size=len(x))
poly_3rd_degree = numpy.poly1d([1.2e-06, -1.23e-03, .36, -4.e-04])
poly_baseline = poly_3rd_degree(x)
y = 100 * numpy.exp(-((x-300)/15)**2)+\
200 * numpy.exp(-((x-750)/30)**2)+ \
100 * numpy.exp(-((x-800)/15)**2) + noise + poly_baseline
bl_1 = baseline_arPLS(y, 1e+07, 1e-01)
bl_2 = baseline_arPLS(y, 1e+07, 1e-02)
pyplot.figure(1)
pyplot.plot(x, y, 'C0')
pyplot.plot(x, poly_baseline, 'C1')
pyplot.plot(x, bl_1, 'k')
pyplot.show()
sys.exit(0)
All this is telling me that I'm doing something very non-optimal in my python implementation. Since I'm not knowledgeable enough about the intricacies of scipy computations I'm kindly asking for suggestions on how to achieve convergence in this calculations.
(I encountered an issue in running the "straight" matlab version of the code because the line D = diff(speye(N), 2); truncates the last two rows of the matrix, creating dimension mismatch later in the function. Following the description of matrix D's appearance I substituted this line by directly creating a tridiagonal matrix using the diags function.)
Guided by the comment #hpaulj made, and suspecting that the loop exit wasn't coded properly, I re-visited the paper and found out that the authors actually implemented an exit condition that was not featured in their matlab script. Changing the while loop condition provides an exit for any set of parameters; my understanding is that algorithm is not guaranteed to converge in all cases, which is why this condition is necessary but was omitted by error. Here's the edited version of my python code:
def baseline_arPLS(y, lam, ratio):
# Estimate baseline with arPLS
N = len(y)
k = [numpy.ones(N), -2*numpy.ones(N-1), numpy.ones(N-2)]
offset = [0, 1, 2]
D = diags(k, offset).toarray()
H = lam * numpy.matmul(D.T, D)
w_ = numpy.ones(N)
i = 0
N_iterations = 100
while i < N_iterations:
W = spdiags(w_, 0, N, N, format='csr')
# Cholesky decomposition
C = cholesky(W + H)
z_ = spsolve(C.T, w_ * y)
z = spsolve(C, z_)
d = y - z
# make d- and get w^t with m and s
dn = d[d<0]
m = numpy.mean(dn)
s = numpy.std(dn)
wt = 1. / (1 + numpy.exp(2 * (d - (2*s-m)) / s))
# check exit condition and backup
norm_wt, norm_w = norm(w_-wt), norm(w_)
if (norm_wt / norm_w) < ratio:
break
w_ = wt
i += 1
return(z)

docplex.cp.model is slower than the exhaustive search

I am working on a combinatorial optimisation problem and realised the CPLEX is taking a significant time to run. Here is a toy example:
I am using the python API for docplex
import numpy as np
from docplex.cp.model import CpoModel
N = 5000
S = 10
k = 2
u_i = np.random.rand(N)[:,np.newaxis]
u_ij = np.random.rand(N*S).reshape(N, S)
beta = np.random.rand(N)[:,np.newaxis]
m = CpoModel(name = 'model')
R = range(0, S)
idx = [(j) for j in R]
I = m.binary_var_dict(idx)
m.add_constraint(m.sum(I[j] for j in R)<= k)
total_rev = m.sum(beta[i,0] / ( 1 + u_i[i,0]/sum(I[j] * u_ij[i,j] for j in R) ) for i in range(N) )
m.maximize(total_rev)
sol=m.solve(agent='local')
sol.print_solution()
for i in R:
if sol[I[i]]==1:
print('i : '+str(i))
Part of the output is as follows:
Model constraints: 1, variables: integer: 10, interval: 0, sequence: 0
Solve status: Optimal
Search status: SearchCompleted, stop cause: SearchHasNotBeenStopped
Solve time: 76.14 sec
-------------------------------------------------------------------------------
Objective values: (1665.58,), bounds: (1665.74,), gaps: (9.27007e-05,)
Variables:
+ 10 anonymous variables
The same I tried with an exhaustive search:
import numpy as np
import pandas as pd
from itertools import combinations,permutations,product
import time
start = time.time()
results = []
for K_i in range(1,k+1): #K
comb = list(combinations(range(S), K_i))
A = len(comb)
for a in range(A):# A
comb_i = comb[a]
I = np.repeat(0,S).reshape(-1,1)
I[comb_i,0] = 1
u_j = np.matmul(u_ij,I)
total_rev = np.sum(beta/ (1 + u_i/u_j))
results.append({'comb_i':comb_i, 'total_rev':total_rev })
end = time.time()
time_elapsed = end - start
print('time_elapsed : ', str(time_elapsed))
results = pd.DataFrame(results)
opt_results = results[results['total_rev'] == max(results['total_rev'].values)]
print(opt_results)
Output:
time_elapsed : 0.012971639633178711
comb_i total_rev
23 (1, 6) 1665.581329
As you can see the CPLEX is 1000 times slower than the exhaustive search. Is there a way to improve the CPLEX algorithm?
if you change
sol=m.solve(agent='local')
to
sol=m.solve(agent='local',SearchType="DepthFirst")
you ll get the optimal solution faster.
Nb:
Proving optimality may take time sometimes with CPOptimizer
For this particular problem:
sol=m.solve(agent='local', SearchType='DepthFirst', Workers=1)
should help out a lot.

Chudnovsky algorithm python incorrect decimals

my goal is to get 100.000 or 200.000 correct decimals of Pi in Python. For this, I have tried using the Chudnovsky algorithm, but I've got some issues along the way.
First, the program only gives me 29 chars, instead of the 50 I want to test the correctness. I know this is a small issue, but I don't understand what I've done wrong.
Second, only the first 14 decimals are correct. After those, I start getting inaccurate Pi decimals according to about all Pi numbers on the internet. How do I get way more correct decimals?
And last, how do I let my code run on all 4 of the threads I have? I've tried using Pool, but it doesn't seem to work. (Checked it with Windows task manager)
This is my code:
from math import *
from decimal import Decimal, localcontext
from multiprocessing import Pool
import time
k = 0
s = 0
c = Decimal(426880*sqrt(10005))
if __name__ == '__main__':
start = time.time()
pi = 0
with localcontext() as ctx:
ctx.prec = 50
with Pool(None) as pool:
for k in range(0,500):
m = Decimal((factorial(6 * k)) / (factorial(3 * k) * Decimal((factorial(k) ** 3))))
l = Decimal((545140134 * k) + 13591409)
x = Decimal((-262537412640768000) ** k)
subPi = Decimal(((m*l)/x))
s = s + subPi
print(c*(s**-1))
print(time.time() - start)
In addition to the small details discussed in the comments and proposed by #mark-dickinson I think I've fixed the multithreading but I haven't had a chance to test it, let me know if it works properly
UPDATE: Problems after the 28th digits were due to the assignment of sq and c before the decimal context change. Reassign their value after changing the context precision solved the problem
from math import *
import decimal
from decimal import Decimal, localcontext
from multiprocessing import Pool
import time
k = 0
s = 0
sq = Decimal(10005).sqrt() #useless here
c = Decimal(426880*sq) #useless here
def calculate():
global s, k
for k in range(0,500):
m = Decimal((factorial(6 * k)) / (factorial(3 * k) * Decimal((factorial(k) ** 3))))
l = Decimal((545140134 * k) + 13591409)
x = Decimal((-262537412640768000) ** k)
subPi = Decimal((m*l)/x)
s = s + subPi
print(c*(s**-1))
if __name__ == '__main__':
start = time.time()
pi = 0
decimal.getcontext().prec = 100 #change the precision to increse the result digits
sq = Decimal(10005).sqrt()
c = Decimal(426880*sq)
pool = Pool()
result = pool.apply_async(calculate)
result.get()
print(time.time() - start)

Why is PyTorch slower than PyOpenCL, which is slower than Numba on GPU?

I was working on a FDTD program that used the discrete Laplacian, which I can be implemented as a convolution operation. From what I have read, the main component of PyTorch is a tensor library optimized to perform operations commonly used for machine learning (such as convolutions). I was interested to compare it to other frameworks I have used, so I wrote a test program to apply the discrete Laplacian to a 1d array multiple times and compare execution times:
import torch as tr
import time
from numba import jit, cuda
import numpy as np
import pyopencl as cl
from pyopencl import array
#parameters
number_of_timesteps = 1000
number_of_elements = 10000000
#set up the inital conditions
torch_data = tr.rand((1,1,number_of_elements),dtype=tr.double) #torch convolution needs shape (minibatch,in_channels,iW)
numba_data = np.array([0] + list(torch_data[0][0].numpy()) + [0]) #add padding [0] for convolution. handled automatically in torch.
opencl_data = np.array([0] + list(torch_data[0][0].numpy()) + [0])
#test Torch
device = "cuda"
torch_data_a = torch_data.to(device)
torch_data_b = torch_data.to(device)
kernel = tr.tensor([[[1,-2,1]]],dtype=tr.double,device=device)
with tr.no_grad():
start_time = time.time()
for t in range(round(number_of_timesteps/2)): # /2 because each loop is two convolutions
torch_data_b = torch_data_a + 0.1* tr.nn.functional.conv1d(torch_data_a,kernel,padding=1)
torch_data_a = torch_data_b + 0.1* tr.nn.functional.conv1d(torch_data_b,kernel,padding=1)
print("Torch GPU time:",time.time()-start_time)
torch_data_numpy = np.array([0] + list(torch_data_a[0][0].cpu().numpy()) + [0])
#Numba GPU kernel
#cuda.jit
def numba_conv_cuda(x,x_new):
gid = cuda.grid(1)
if 0 < gid < x.size - 1 : # Check array boundaries
x_new[gid] = x[gid] + 0.1*(x[gid+1]+x[gid-1]-2*x[gid])
threadsperblock = 100
blockspergrid = (numba_data.size + (threadsperblock - 1)) // threadsperblock
x_a = cuda.to_device(numba_data)
x_b = cuda.to_device(numba_data)
start_time = time.time()
#actually run the kernel
for t in range(round(number_of_timesteps/2)): #again /2 because each loop is two convolutions
numba_conv_cuda[blockspergrid, threadsperblock](x_a,x_b)
numba_conv_cuda[blockspergrid, threadsperblock](x_b,x_a)
print("Numba GPU time:",time.time()-start_time)
numba_data = x_a.copy_to_host()
#test OpenCL
context = cl.create_some_context(interactive=False,answers=[0])
queue = cl.CommandQueue(context)
mem_flags = cl.mem_flags
program = cl.Program(context, """
#pragma OPENCL EXTENSION cl_khr_fp64 : enable //enable double precision calculations
__kernel void update_psi(__global const double *x, __global double *x_new)
{
int gid = get_global_id(0);
if(0 < gid && gid < x.size - 1){
x_new[gid] = x[gid] + 0.1*(x[gid+1]+x[gid-1]-2*x[gid]);
}
}
""".replace("x.size",str(opencl_data.size))).build()
x_a_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=opencl_data)
x_b_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=opencl_data)
#actually run the OpenCL
start_time = time.time()
for t in range(round(number_of_timesteps/2)): #again /2 because each loop is two convolutions
event = program.update_psi(queue, [threadsperblock*blockspergrid], [threadsperblock], x_a_buf, x_b_buf)
event.wait()
event = program.update_psi(queue, [threadsperblock*blockspergrid], [threadsperblock], x_b_buf, x_a_buf)
event.wait()
print("OpenCL GPU time:",time.time()-start_time)
event = cl.enqueue_copy(queue, opencl_data, x_a_buf)
event.wait()
print("Results are same?",np.allclose(torch_data_numpy,numba_data) and np.allclose(numba_data,opencl_data))
And these are the results testing on an Nvidia GPU:
Torch GPU time: 13.544365406036377
Numba GPU time: 0.2404193878173828
OpenCL GPU time: 0.9025869369506836
Results are same? True
I am surprised that the results show that a library designed for applying operations such as convolutions to be so much slower than Numba or PyOpenCL (which is not even optimized because I did not use any local memory on the GPU). Is this really the case, or did I do something wrong?
Additionally, why is the kernel written in c more than 3x slower than the kernel written in Python?

Resources