Timing paralelized numpy calculation - python-3.x

I have a simple pi-approximating script like so:
import numpy as np
import matplotlib.pyplot as plt
import time
start = 10
stop = 1000000
step = 100
exactsolution = np.pi
def montecarlopi(N=1000000):
random_x = np.random.random(size = N)
random_y = np.random.random(size = N)
bod = np.array([random_x, random_y]).T
square_area = N
quarter_circle_area = np.count_nonzero(np.linalg.norm(bod, axis = 1)<=1)
pi_approx = 4*quarter_circle_area/square_area
return pi_approx
if __name__ == '__main__':
times = []
results = []
attemps = np.arange(start = start, stop = stop, step = step)
for i in attemps:
start_time = time.time()
results.append(montecarlopi(i))
times.append(time.time()-start_time)
absolute_errors = np.abs(np.array(results)-exactsolution)
and I want to know how long the calculation takes based on the number of random attemps I use. As you can see I use a for loop to get each of the calculation times I need, but this defeats the purpose of Numpy, slowing down my code a lot. Effectively I'd like to just call montecarlopi() on the whole attemps array, but then I wouldn't have the calculation times.
Is there a way to time each paralelized calculation numpy does?

I used the timing code from the answer provided here:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
I only had to change labels to codecs in the line:
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Timing linear
Then you can run your whole timing experiment using
timings.plot_times([montecarlopi], inputs=np.arange(10, 1000000, 1000), repeats=3)
And get an output like this
Timing Logspace
Or more clear using logspacing
timings.plot_times([montecarlopi], inputs=np.logspace(1, 8, 8, dtype=np.int), repeats=3)

Related

Multiprocessing pool map for a BIG array computation go very slow than expected

I've experienced some difficulties when using multiprocessing Pool in python3. I want to do BIG array calculation by using pool.map. Basically, I've a 3D array which I need to do computation for 10 times and it generates 10 output files sequentially. This task can be done 3 times i,e, in the output we get 3*10=30 output files(*.txt). To do this, I've prepared the following script for small array calculation (a sample problem). However, when I use this script for a BIG array calculation or array come out from a series of files, then this piece of code (maybe pool) capture the memory, and it does not save any .txt file at the destination directory. There is no error message when I run the file with command mpirun python3 sample_prob_func.py
Can anybody suggest what is the problem in the sample script and how to write code to get rid of stuck? I've not received any error message, but don't know where the problem occurs. Any help is appreciated. Thanks!
import numpy as np
import multiprocessing as mp
from scipy import signal
import matplotlib.pyplot as plt
import contextlib
import os, glob, re
import random
import cmath, math
import time
import pdb
#File Storing path
save_results_to = 'File saving path'
arr_x = [0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0, 8.49, 12.0]
arr_y = [0, 8.49, 12.0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0]
N=len(arr_x)
np.random.seed(12345)
total_rows = 5000
arr = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr1 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr2 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
# Finding cross spectral density (CSD)
def my_func1(data):
# Do something here
return array1
t0 = time.time()
my_data1 = my_func1(arr)
my_data2 = my_func1(arr1)
my_data3 = my_func1(arr2)
print('Time required {} seconds to execute CSD--For loop'.format(time.time()-t0))
mydata_list = [my_data1,my_data3,my_data3]
def my_func2(data2):
# Do something here
return from_data2
start_freq = 100
stop_freq = 110
freq_range= np.around(np.linspace(start_freq,stop_freq,11)/10, decimals=2)
no_of_freq = len(freq_range)
list_arr =[]
def my_func3(csd):
list_csd=[]
for fr_count in range(start_freq, stop_freq):
csd_single = csd[:,:, fr_count]
list_csd.append(csd_single)
print('Shape of list is :', np.array(list_csd).shape)
return list_csd
def parallel_function(BIG_list_data):
with contextlib.closing(mp.Pool(processes=10)) as pool:
dft= pool.map(my_func2, BIG_list_data)
pool.close()
pool.join()
data_arr = np.array(dft)
print('shape of data :', data_arr.shape)
return data_arr
count_day = 1
count_hour =0
for count in range(3):
count_hour +=1
list_arr = my_func3(mydata_list[count]) # Load Numpy files
print('Array shape is :', np.array(arr).shape)
t0 = time.time()
data_dft = parallel_function(list_arr)
print('The hour number={} data is processing... '.format(count_hour))
print('Time in parallel:', time.time() - t0)
for i in range(no_of_freq-1): # (11-1=10)
jj = freq_range[i]
#print('The hour_number {} and frequency number {} data is processing... '.format(count_hour, jj))
dft_1hr_complx = data_dft[i,:,:]
np.savetxt(save_results_to + f'csd_Day_{count_day}_Hour_{count_hour}_f_{jj}_hz.txt', dft_1hr_complx.view(float))
As #JérômeRichard suggested,to aware your job scheduler you need to define the number of processors will engage to perform this task. So, the following command could help you: ncpus = int(os.getenv('SLURM_CPUS_PER_TASK', 1))
You need to use this line inside your python script. Also, inside the parallel_function use with contextlib.closing(mp.Pool(ncpus=10)) as pool: instead of with contextlib.closing(mp.Pool(processes=10)) as pool:. Thanks

Full diagonalization NumPy vs SciPy

I have to make use of diagonalization routines to obtain all eigenpairs of a Hermitian complex matrix. I am a bit limited by performance since I need to repeat the operation thousands of times and my matrices are roughly of 8000x8000. I have created a little comparison between the NumPy and SciPy routines for diagonalization of Hermitian matrices and I got these times on a 6 physical cores machine:
I am observing that for 8000x8000 matrices this scales to~0.8 minutes and I need to repeat the process 50000 times. Is there something I am missing here or are these actually the performance times? Overall, this all looks very slow specially if this needs to be repeated several times. In fact, on a 30 core machine, I observe little performance gain. I am using the Python3.8 under Anaconda distribution so this is linked against the MKL.
Here is the example code
import numpy as np
import scipy.linalg
import matplotlib.pyplot as pyt
from time import time
t_ls = []
d_ls = np.array([100, 500, 1000,2000,4000])
for N in d_ls:
A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )
A = 0.5*( A + np.conj(A.T) )
ts = time()
evals, evecs = np.linalg.eigh( A )
t_np = time()-ts
ts = time()
evals2, evecs2 = scipy.linalg.eigh( A )
t_sp = time()-ts
t_ls.append(np.array([t_np, t_sp]))
t_ls = np.array(t_ls)
pyt.plot( d_ls, t_ls[:,0], marker='s' )
pyt.plot( d_ls, t_ls[:,1], marker='^')
pyt.xlabel("N")
pyt.ylabel("time(secs)")
pyt.legend(["NumPy", "SciPy"])
pyt.show()
USING SVD AND MP PARALELLIZATION
Going through some of the comments in the post, I have tried SVD of the matrix and multiprocessing. I all cases I still see the serialized approach with NumPy eigh is the most efficient one; here is the code:
import numpy as np
import scipy.linalg
import matplotlib.pyplot as pyt
from time import time
import psutil
def f_mp_pool(*args):
N = args[0]
A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )
A = 0.5*( A + np.conj(A.T) )
evals, evecs = np.linalg.eigh(A)
return evals, evecs
nreps = 100
N = 700
ts = time()
for n in range(nreps):
A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )
A = 0.5*( A + np.conj(A.T) )
res = np.linalg.eigh(A)
print("serialized:", time()-ts)
#use svd
import scipy.linalg
ts = time()
for n in range(nreps):
res = scipy.linalg.svd( A, full_matrices=True, check_finite=False )
print("SVD:", time()-ts)
import multiprocessing as mp
nproc = psutil.cpu_count(logical=False)-1
mp_pool = mp.Pool(processes=nproc)
args_ls = [ (N,) for n in range(nreps) ]
ts = time()
res = mp_pool.starmap( f_mp_pool, args_ls )
print("parallel:", time()-ts)
Pytorch will b faster, and if you have GPU it can also take advantage of that, however not so much because the QR iteration is not good for parallel computation. I have a potential solution to accelerate that part on GPUs but I never actually implemented it.
import numpy as np
import scipy.linalg
import torch
import matplotlib.pyplot as plt
from time import time
t_ls = []
d_ls = np.array([100, 500, 1000,2000,4000])
for N in d_ls:
A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )
A = 0.5*( A + np.conj(A.T) )
# skipping numpy, it is slow here, you may put it back if you want
# ts = time()
# evals, evecs = np.linalg.eigh( A )
# t_np = time()-ts
ts = time()
evals2, evecs2 = scipy.linalg.eigh( A )
t_sp = time()-ts
# When using CPU torch will use intra operation
# parallelism, so if you care about latency
# this is better than using multiprocessing
A_cpu = torch.as_tensor(A)
ts = time()
evals3, evecs3 = torch.linalg.eigh(A_cpu)
t_cpu = time() - ts;
if torch.cuda.is_available():
# Using GPU will give a significant speedup for some
# operations, I guess the
A_gpu = A_cpu.to('cuda')
torch.cuda.synchronize()
ts = time()
evals4, evecs4 = torch.linalg.eigh(A_gpu)
torch.cuda.synchronize()
t_gpu = time() - ts;
else:
t_gpu = np.nan #if you don't have GPU let's skip this part
t_ls.append(np.array([np.nan, t_sp, t_cpu, t_gpu]))
print(t_ls[-1])
t_ls = np.array(t_ls)
plt.plot( d_ls, t_ls[:,0], marker='s' )
plt.plot( d_ls, t_ls[:,1], marker='^')
plt.plot( d_ls, t_ls[:,2], marker='+')
plt.plot( d_ls, t_ls[:,3], marker='d')
plt.xlabel("N")
plt.ylabel("time(secs)")
plt.legend(["NumPy", "SciPy", "PyTorch CPU", "PyTorch GPU"])
My plot

How can I interpolate values from two lists (in Python)?

I am relatively new to coding in Python. I have mainly used MatLab in the past and am used to having vectors that can be referenced explicitly rather than appended lists. I have a script where I generate a list of x- and y- (z-, v-, etc) values. Later, I want to interpolate and then print a table of the values at specified points. Here is a MWE. The problem is at line 48:
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
I'm not sure I have the correct syntax for the last two lines either:
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Here is the full script for the MWE:
#This script was written to test how to interpolate after data was created in a loop and stored as a list. Can a list be accessed explicitly like a vector in matlab?
#
from scipy.interpolate import interp1d
from math import * #for ceil
from astropy.table import Table #for Table
import numpy as np
# define the initial conditions
x = 0 # initial x position
y = 0 # initial y position
Rmax = 10 # maxium range
""" initializing variables for plots"""
x_list = [x]
y_list = [y]
""" define functions"""
# not necessary for this MWE
"""create sample data for MWE"""
# x and y data are calculated using functions and appended to their respective lists
h = 1
t = 0
tf = 10
N=ceil(tf/h)
# Example of interpolation without a loop: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html#d-interpolation-interp1d
#x = np.linspace(0, 10, num=11, endpoint=True)
#y = np.cos(-x**2/9.0)
#f = interp1d(x, y)
for i in range(N):
x = h*i
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
## Interpolation after x- and y-lists are already created
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Your help and patience will be greatly appreciated!
Best regards,
Alex
Your code has some glaring issues that made it really difficult to understand. Let's first take a look at some things I needed to fix:
for i in range(N):
x = h*1
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
You are appending a single value without modifying it. What I presume you wanted is down below.
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
This is where things get strange. First: use pandas tables, this is the more popular choice. Second: I have no idea what you are trying to loop over. What I presume you wanted was to vary the number of points for the interpolation, which I have done so below. Third: you are trying to interpolate a point, when you probably want to interpolate over a range of points (...interpolation). Lastly, you are using the interp1d function incorrectly. Please take a look at the code below or run it here; let me know what you exactly wanted (specifically: what should xq / xq(nn) be?), because the MRE you provided is quite confusing.
from scipy.interpolate import interp1d
from math import *
import numpy as np
Rmax = 10
h = 1
t = 0
tf = 10
N = ceil(tf/h)
x = np.arange(0,N+1)
y = np.cos(-x**2/9.0)
interval = 0.5
NN = ceil(Rmax/interval) + 1
ip_list = np.arange(1,interval*NN,interval)
xtable = []
ytable = []
for i,nn in enumerate(ip_list):
f = interp1d(x,y)
x_i = np.arange(0,nn+interval,interval)
xtable += [x_i]
ytable += [f(x_i)]
[print(i) for i in xtable]
[print(i) for i in ytable]

Product of tuples in generator

I have a generator that yields millions of tuples (~100 Millions) and I need the product (np.prod) of each tuple to then sum them up together.
I have the following example code that works fine for a reasonable number of tuples in the generator, but which takes a lot of time when the number is getting high. I am working on a instance with 64 cores and ~160GB of RAM and I am looking for a way to optimize my code if possible.
import random
import numpy as np
import multiprocessing as mp
import time
nprocs = mp.cpu_count()
pool = mp.Pool(processes=nprocs)
x = 1000000
mygen = ((random.randint(0, 100)/100, random.randint(0, 100)/100 ) for k in range(x))
start = time.time()
proba_all = sum(pool.map(np.prod, mygen))
print(proba_all)
end = time.time()
print (end-start)

Simulating 10,000 Coinflips in Python Very Slow

I am writing a simulation that creates 10,000 periods of 25 sets, with each set consisting of 48 coin tosses. Something in this code is making it run very slowly. It has been running for at least 20 minutes and it is still working. A similar simulation in R runs in under 10 seconds.
Here is the python code I am using:
import pandas as pd
from random import choices
threshold=17
all_periods = pd.DataFrame()
for i in range(10000):
simulated_period = pd.DataFrame()
for j in range(25):
#Data frame with 48 weeks as rows. Each run through loop adds one more year as column until there are 25
simulated_period = pd.concat([simulated_period, pd.DataFrame(choices([1, -1], k=48))],\
ignore_index=True, axis=1)
positives = simulated_period[simulated_period==1].count(axis=1)
negatives = simulated_period[simulated_period==-1].count(axis=1)
#Combine positives and negatives that are more than the threshold into single dataframe
sig = pd.DataFrame([[sum(positives>=threshold), sum(negatives>=threshold)]], columns=['positive', 'negative'])
sig['total'] = sig['positive'] + sig['negative']
#Add summary of individual simulation to the others
all_periods = pd.concat([all_periods, sig])
If it helps, here is the R script that is running quickly:
flip <- function(threshold=17){
#threshold is min number of persistent results we want to see. For example, 17/25 positive or 17/25 negative
outcomes <- c(1, -1)
trial <- do.call(cbind, lapply(1:25, function (i) sample(outcomes, 48, replace=T)))
trial <- as.data.frame(t(trial)) #48 weeks in columns, 25 years in rows.
summary <- sapply(trial, function(x) c(pos=length(x[x==1]), neg=length(x[x==-1])))
summary <- as.data.frame(t(summary)) #use data frame so $pos/$neg can be used instead of [1,]/[2,]
sig.pos <- length(summary$pos[summary$pos>=threshold])
sig.neg <- length(summary$neg[summary$neg>=threshold])
significant <- c(pos=sig.pos, neg=sig.neg, total=sig.pos+sig.neg)
return(significant)
}
results <- do.call(rbind, lapply(1:10000, function(i) flip(threshold)))
results <- as.data.frame(results)
Can anyone tell me what I'm running in python that is slowing the process down? Thank you.
Why don't you generate the whole big set
idx = pd.MultiIndex.from_product((range(10000), range(25)),
names=('period', 'set'))
df = pd.DataFrame(data=np.random.choice([1,-1], (10000*25, 48)), index=idx)
Took about 120ms on my computer. And then the other operations:
positives = df.eq(1).sum(level=0).gt(17).sum(axis=1).to_frame(name='positives')
negatives = df.eq(-1).sum(level=0).gt(17).sum(axis=1).to_frame(name='negatives')
all_periods = pd.concat( (positives, negatives), axis=1 )
all_periods['total'] = all_periods.sum(1)
take about 600ms extra.

Resources