Reading Files in Parallel mpi4py - python-3.x

I have a series of n files that I'd like to read in parallel using mpi4py. Every file contains a column vector and, as final result, I want to obtain a matrix containing all the single vectors as X = [x1 x2 ... xn].
In the first part of the code I create the list containing all the names of the files and I distribute part of the list to the different cores through the scatter method.
import numpy as np
import pandas as pd
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()
folder = "data/" # Input directory
files = [] # File List
# Create File List -----------------------------------------------------------
if rank == 0:
for i in range(1,2000):
filename = "file_" + str(i) + ".csv"
files = np.append(files,filename)
print("filelist complete!")
# Determine the size of each sub task
ave, res = divmod(files.size, nprocs)
counts = [ave + 1 if p < res else ave for p in range(nprocs)]
# Determine starting and ending indices of each sub-task
starts = [sum(counts[:p]) for p in range(nprocs)]
ends = [sum(counts[:p+1]) for p in range(nprocs)]
# Convert data into list of arrays
fileList = [files[starts[p]:ends[p]] for p in range(nprocs)]
else:
fileList = None
fileList = comm.scatter(fileList, root = 0)
Here I create a matrix X where to store the vectors.
# Variables Initialization ---------------------------------------------------
# Creation Support Vector
vector = pd.read_csv(folder + fileList[0])
vector = vector.values
vectorLength = len(vector)
# Matrix
X = np.ones((vectorLength, len(fileList)))
# ----------------------------------------------------------------------------
Here, I read the different files and I append the column vector to the matrix X. With the gather method I store all the X matrix calculated by the single cores into one single matrix X. The X matrix resulting from the gather method is a list of 2D numpy arrays. As final step, I reorganize the list X into a matrix
# Reading Files -----------------------------------------------------------
for i in range(len(fileList)):
data = pd.read_csv(folder + fileList[i])
data = np.array(data.values)
X[:,i] = data[:,0]
X = comm.gather(X, root = 0)
if rank == 0:
X_tot = np.empty((vectorLength, 1))
for i in range(nprocs):
X_proc = np.array(X[i])
X_tot = np.append(X_tot, X_proc, axis=1)
X_tot = X_tot[:,1:]
X = X_tot
del X_tot
print("printing X", X)
The code works fine. I tested it on a small dataset and did what it is meant to do. However I tried to run it on a large dataset and I got the following error:
X = comm.gather(X[:,1:], root = 0)
File "mpi4py/MPI/Comm.pyx", line 1578, in mpi4py.MPI.Comm.gather
File "mpi4py/MPI/msgpickle.pxi", line 773, in mpi4py.MPI.PyMPI_gather
File "mpi4py/MPI/msgpickle.pxi", line 778, in mpi4py.MPI.PyMPI_gather
File "mpi4py/MPI/msgpickle.pxi", line 191, in mpi4py.MPI.pickle_allocv
File "mpi4py/MPI/msgpickle.pxi", line 182, in mpi4py.MPI.pickle_alloc
SystemError: Negative size passed to PyBytes_FromStringAndSize
It seems a really general error, however I could process the same data in serial mode without problems or in parallel without using all the n files. I also noticed that only the rank 0 core seems to work, while the others seem to do nothing.
This is my first project using mpi4py so I'm sorry if the code is not perfect and if I have committed any conceptual mistake.

This error typically occurs when the data passed between MPI processes exceeds a certain size (I think 2GB). It's supposed to be fixed with future MPI versions, but for now, you'll probably have to resort to a workaround like storing your data on the hard disk and reading it with each process separately...
See for example here: https://github.com/mpi4py/mpi4py/issues/23

Related

How can I interpolate values from two lists (in Python)?

I am relatively new to coding in Python. I have mainly used MatLab in the past and am used to having vectors that can be referenced explicitly rather than appended lists. I have a script where I generate a list of x- and y- (z-, v-, etc) values. Later, I want to interpolate and then print a table of the values at specified points. Here is a MWE. The problem is at line 48:
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
I'm not sure I have the correct syntax for the last two lines either:
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Here is the full script for the MWE:
#This script was written to test how to interpolate after data was created in a loop and stored as a list. Can a list be accessed explicitly like a vector in matlab?
#
from scipy.interpolate import interp1d
from math import * #for ceil
from astropy.table import Table #for Table
import numpy as np
# define the initial conditions
x = 0 # initial x position
y = 0 # initial y position
Rmax = 10 # maxium range
""" initializing variables for plots"""
x_list = [x]
y_list = [y]
""" define functions"""
# not necessary for this MWE
"""create sample data for MWE"""
# x and y data are calculated using functions and appended to their respective lists
h = 1
t = 0
tf = 10
N=ceil(tf/h)
# Example of interpolation without a loop: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html#d-interpolation-interp1d
#x = np.linspace(0, 10, num=11, endpoint=True)
#y = np.cos(-x**2/9.0)
#f = interp1d(x, y)
for i in range(N):
x = h*i
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
## Interpolation after x- and y-lists are already created
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Your help and patience will be greatly appreciated!
Best regards,
Alex
Your code has some glaring issues that made it really difficult to understand. Let's first take a look at some things I needed to fix:
for i in range(N):
x = h*1
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
You are appending a single value without modifying it. What I presume you wanted is down below.
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
This is where things get strange. First: use pandas tables, this is the more popular choice. Second: I have no idea what you are trying to loop over. What I presume you wanted was to vary the number of points for the interpolation, which I have done so below. Third: you are trying to interpolate a point, when you probably want to interpolate over a range of points (...interpolation). Lastly, you are using the interp1d function incorrectly. Please take a look at the code below or run it here; let me know what you exactly wanted (specifically: what should xq / xq(nn) be?), because the MRE you provided is quite confusing.
from scipy.interpolate import interp1d
from math import *
import numpy as np
Rmax = 10
h = 1
t = 0
tf = 10
N = ceil(tf/h)
x = np.arange(0,N+1)
y = np.cos(-x**2/9.0)
interval = 0.5
NN = ceil(Rmax/interval) + 1
ip_list = np.arange(1,interval*NN,interval)
xtable = []
ytable = []
for i,nn in enumerate(ip_list):
f = interp1d(x,y)
x_i = np.arange(0,nn+interval,interval)
xtable += [x_i]
ytable += [f(x_i)]
[print(i) for i in xtable]
[print(i) for i in ytable]

perform principal component analysis on moving window of data

This was partly answered by #WhoIsJack but not completely solved given the errors I get. Basically, I'm trying to perform principal component analysis on a rolling window of data. For example, I'd run PCA on the last 200 days in the df, move forward 1 day and do PCA again on the last 200 days. So as you move forward each day, you'd include the next day's measurement and exclude the last measurement.
You have a random df:
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
Here's window size:
window = 200
Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
Define PCA fit-transform function. Instead of attempting to return the result, it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
Use rolling to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
The results should be contained here:
print(df_pca)
However, when I generate the results only the first row of data looks to contain PCAs while the rest of the rows are zero.
I also tried the following function:
def rolling_pca(x, window):
r = x.rolling(window=window)
pca = PCA(3)
y = pca.fit(r)
z = pca.fit_transform(y)
return z
window = 200
Which I thought would generate a new df with rolling PCAs:
data = df.apply(rolling_pca, window=window)
But I got the following error: setting an array element with a sequence.
I've also tried manually calculating with below. I get: "unsupported operand type(s) for /: 'Rolling' and 'int'"
def rolling_pca(x, window):
# create rolling dataframe
r = x.rolling(window=window)
# demand data
X = np.matrix(r)
X_dm = X - np.mean(X, axis = 0)
#Eigenvalue decomposition (of covariance matrix)
Cov_X = np.cov(X_dm, rowvar = False)
eigen = np.linalg.eig(Cov_X)
eig_values_X = np.matrix(eigen[0])
eig_vectors_X = np.matrix(eigen[1])
#transformed data
Y_dm = X_dm * eig_vectors_X
#assign transformed yields
yields_trans = Y_dm.copy()
# get PCs
pc1_yields = x.copy()
pcas = yields_trans[:,0:3]
return pcas
#assign window length
window = 300
rolling_pca(data, window=window)
And tried below. Get error: "LinAlgError: 0-dimensional array given. Array must be at least two-dimensional"
def pca(x):
# demand data
X = np.matrix(x.values)
X_dm = X - np.mean(X, axis = 0)
#Eigenvalue decomposition (of covariance matrix)
Cov_X = np.cov(X_dm, rowvar = False)
eigen = np.linalg.eig(Cov_X)
eig_values_X = np.matrix(eigen[0])
eig_vectors_X = np.matrix(eigen[1])
#transformed data
Y_dm = X_dm * eig_vectors_X
#assign transformed yields
yields_trans = Y_dm.copy()
# get 3 PCs
pcas = yields_trans[:,0:3]
final_pcas = pd.DataFrame(pcas)
return final_pcas
data.rolling(200).apply(pca)
Any thoughts would be appreciated!

I'm getting trouble in using numpy and handling list

This code read CSV file line by line and counts the number on each Unicode but I can't understand two parts of code like below.I've already googled but I could't find the answer. Could you give me advice ?
1) Why should I use numpy here instead of []?
emoji_time = np.zeros(200)
2) What does -1 mean ?
emoji_time[len(emoji_list)-1] = 1 ```
This is the code result:
0x100039, 47,
0x10002D, 121,
0x100029, 30,
0x100078, 6,
unicode_count.py
import codecs
import re
import numpy as np
​
file0 = "./message.tsv"
f0 = codecs.open(file0, "r", "utf-8")
list0 = f0.readlines()
f0.close()
print(len(list0))
​
len_list = len(list0)
emoji_list = []
emoji_time = np.zeros(200)
​
for i in range(len_list):
a = "0x1000[0-9A-F][0-9A-F]"
if "0x1000" in list0[i]: # 0x and 0x1000: same nuumber
b = re.findall(a, list0[i])
# print(b)
for j in range(len(b)):
if b[j] not in emoji_list:
emoji_list.append(b[j])
emoji_time[len(emoji_list)-1] = 1
else:
c = emoji_list.index(b[j])
emoji_time[c] += 1
print(len(emoji_list))
1) If you use a list instead of a numpy array the result should not change in this case. You can try it for yourself running the same code but replacing emoji_time = np.zeros(200) with emoji_time = [0]*200.
2) emoji_time[len(emoji_list)-1] = 1. What this line is doing is the follow: If an emoji appears for the first time, 1 is add to emoji_time, which is the list that contains the amount of times one emoji occurred. len(emoji_list)-1 is used to set the position in emoji_time, and it is based on the length of emoji_list (the minus 1 is only needed because the list indexing in python starts from 0).

Trouble aligning x-axis Matplotlib (Homework)

As stated about I have a homework assignment for a fundamentals of Data Science class. I am filtering out a tower with faulty information and plotting the data of the good tower by amplitude and timing.
The issue is with my mean line for my graph. It is suppose to run through the average of my points. Unfortunately I cannot seem to align across my X-axis.
My output looks like this:
I've tried solution I've found on stack overflow, but the best I could come up was a mean line for the whole graph using:mplot.plot(np.unique(columnOneF),np.poly1d(np.polyfit(columnOneF,columnTwoF,1))(np.unique(columnOneF)))
import csv
import matplotlib.pyplot as mplot
import numpy as np
File = open("WhiteSwordfish_ch1.csv")
csv_file = csv.reader(File)
columnOneF = []
columnTwoF = []
columnThreeF = []
MeanAmp = []
Freq = []
TempFreq = []
last = 0
for row in csv_file: # Loop graps all the rows out of the CSV File stores them by column in List
if float(row[2]) == 21.312057: # If statement check if the frequency if from the good tower if
Freq.append(row) # so it then grabs THE WHOLE ROW and stores in a a List
for row in Freq: # Program loops through only the good tower's data and sorts it into
columnOneF.append(float(row[0])) # Seperate list by type
columnTwoF.append(float(row[1]))
columnThreeF.append(float(row[2]))
# Mean Line Calculation
for i in Freq:
current = float(i[0])
if current == last:
TempFreq.append(float(i[1]))
else:
last = current
MeanAmp.append(np.mean(TempFreq))
# MeanAmp.insert(int(current), np.mean(TempFreq))
TempFreq = []
print(MeanAmp)
print(columnOneF)
# Graph One (Filter Data)
# ****************************************************************************
mplot.title("Filtered Data")
mplot.xlabel("Timing")
mplot.ylabel("Amplitude")
mplot.axis([-100, 800, -1.5, 1.5])
mplot.scatter(columnOneF, columnTwoF, color="red") # Clean Data POINTS
mplot.plot(MeanAmp, color="blue", linestyle="-") # Line
# mplot.plot(np.unique(columnOneF),np.poly1d(np.polyfit(columnOneF,columnTwoF,1))(np.unique(columnOneF)))
mplot.show() # Displays both graphs
You have passed only MeanAmp to the plot() function, which is interpreted as
plot(y) # plot y using x as index array 0..N-1
Source
If you provide x-cordinates, same as for the scatter() function, the lines will be aligned:
mplot.plot(columnOneF, MeanAmp, color="blue", linestyle="-")

parallel write to different groups with h5py

I'm trying to use parallel h5py to create an independent group for each process and fill each group with some data.. what happens is that only one group gets created and filled with data. This is the program:
from mpi4py import MPI
import h5py
rank = MPI.COMM_WORLD.Get_rank()
f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)
data = range(1000)
dset = f.create_dataset(str(rank), data=data)
f.close()
Any thoughts on what is going wrong here?
Thanks alot
Ok, so as mentioned in the comments I had to create the datasets for every process then fill them up.. The following code is writing data in parallel as many times as the size of the communicator:
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
data = [random.randint(1, 100) for x in range(4)]
f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=comm)
dset = []
for i in range(size):
dset.append(f.create_dataset('test{0}'.format(i), (len(data),), dtype='i'))
dset[rank][:] = data
f.close()

Resources