Looping through multiple dataframes does not calculate properly - python-3.x

I am attempting to perform calculations, then loop through the same pandas dataframe and perform the same calculation but with an altered variable (one that increases each time it loops). If the loop range is set to just 1, all rows calculate properly and the new dataframe is created. However, attempting to actually loop the program results in NaN values everywhere except the first row.
Omega loop
for i in range(10):
#Determine first and last Julian dates of data
t1 = df.ix[:0,'jd']
t2 = df.ix[n-1:,'jd']
t2 = t2.reset_index(drop=True)
tj = t2-t1
#Iterate over each observation within each star file
jd = df['jd']
dmag = df['dmag']
sinw = np.sin(2*omega*jd)
sum1 = sinw.sum()
cosw = np.cos(2*omega*jd)
sum2 = cosw.sum()
#Calculate tau
tau = ((np.arctan(sum1/sum2))/(2*omega))
avgdmag = dmag.sum()/n
#Calculate sample variance
tot = (df['dmag']-avgdmag)**2
tot2 = tot.sum()
var = tot2/(n-1)
#Calculate sums for power series
sum3 = sum3 + ((dmag - avgdmag)*np.cos(omega*(jd-tau)))
sum4 = sum4 + (np.cos(omega*(jd-tau)))**2
sum5 = sum5 + ((dmag - avgdmag)*np.sin(omega*(jd-tau)))
sum6 = sum6 + (np.sin(omega*(jd-tau)))**2
#Calculate power series and normalized power series
px = (((sum3**2)/sum4)+((sum5**2)/sum6))/2
pn = px/var
#Step through sequential frequencies
omega = omega + (1/tj)
I also received a runtime warning from NumPy caused by the omega term at the end. I disabled "invalid" warnings as it was not causing an issue with the actual calculations. The first dataframe that incorrectly computes is sinw and cosw. And all subsequently calculated dataframes have NaN values.

It is because your tj is a pd.Series of length 1, not scalar as you would expect. After the first loop, omega = omega + 1/tj becomes a Series of length 1 (with 0 as index). Then in the 2nd loop, tau = ((np.arctan(sum1/sum2))/(2*omega)) also becomes such a Series. When updating sum3, jd - tau (a Series of length n minus a Series of length 1) gives you a Series with all NaN except at index 0 where both series match. After that all subsequent Series has lots of NaNs.
The solution is to calculate tj as a scalar, such as
tj = df.loc[n-1,'jd'] - df.loc[0,'jd'] (assuming n = len(df)).
Anyway, your piece of code can be re-written for readability.
tj = df.loc[n-1,'jd'] - df.loc[0,'jd'] #tj is loop invariant
for _ in range(10):
sum1 = np.sin(2*omega*df['jd']).sum()
sum2 = np.cos(2*omega*df['jd']).sum()
tau = np.arctan(sum1/sum2)/(2*omega)
avgdmag = df['dmag'].mean()
var = df['dmag'].var() #unbiased sample variance
sum3 += ((df['dmag'] - avgdmag)*np.cos(omega*(df['jd']-tau)))
sum4 += (np.cos(omega*(df['jd']-tau)))**2
sum5 += ((df['dmag'] - avgdmag)*np.sin(omega*(df['jd']-tau)))
sum6 += (np.sin(omega*(df['jd']-tau)))**2
px = (((sum3**2)/sum4)+((sum5**2)/sum6))/2
pn = px/var
omega += 1/tj

Related

Speed Up a for Loop - Python

I have a code that works perfectly well but I wish to speed up the time it takes to converge. A snippet of the code is shown below:
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :]x))*data[i]/(norm(data[i])**2))
return y
rows, columns = data.shape
start = time.time()
iterate = 0
iterate_count = []
norm_count = []
res = 5
x_not = np.ones(columns)
norm_count.append(norm(x_not))
iterate_count.append(0)
while res > 1e-8:
for row in range(rows):
y = myfunction(x_not, row)
x_not = y
iterate += 1
iterate_count.append(iterate)
norm_count.append(norm(x_not))
res = abs(norm_count[-1] - norm_count[-2])
print('Converge at {} iterations'.format(iterate))
print('Duration: {:.4f} seconds'.format(time.time() - start))
I am relatively new in Python. I will appreciate any hint/assistance.
Ax=b is the problem we wish to solve. Here, 'A' is the 'data' and 'b' is the 'target'
Ugh! After spending a while on this I don't think it can be done the way you've set up your problem. In each iteration over the row, you modify x_not and then pass the updated result to get the solution for the next row. This kind of setup can't be vectorized easily. You can learn the thought process of vectorization from the failed attempt, so I'm including it in the answer. I'm also including a different iterative method to solve linear systems of equations. I've included a vectorized version -- where the solution is updated using matrix multiplication and vector addition, and a loopy version -- where the solution is updated using a for loop to demonstrate what you can expect to gain.
1. The failed attempt
Let's take a look at what you're doing here.
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :] # x)) * (data[i] / (norm(data[i])**2))
return y
You subtract
the dot product of (the ith row of data and x_not)
from the ith row of target,
limited at zero.
You multiply this result with the ith row of data divided my the norm of that row squared. Let's call this part2
Then you add this to the ith element of x_not
Now let's look at the shapes of the matrices.
data is (M, N).
target is (M, ).
x_not is (N, )
Instead of doing these operations rowwise, you can operate on the entire matrix!
1.1. Simplifying the dot product.
Instead of doing data[i, :] # x, you can do data # x_not and this gives an array with the ith element giving the dot product of the ith row with x_not. So now we have data # x_not with shape (M, )
Then, you can subtract this from the entire target array, so target - (data # x_not) has shape (M, ).
So far, we have
part1 = target - (data # x_not)
Next, if anything is greater than zero, set it to zero.
part1[part1 > 0] = 0
1.2. Finding rowwise norms.
Finally, you want to multiply this by the row of data, and divide by the square of the L2-norm of that row. To get the norm of each row of a matrix, you do
rownorms = np.linalg.norm(data, axis=1)
This is a (M, ) array, so we need to convert it to a (M, 1) array so we can divide each row. rownorms[:, None] does this. Then divide data by this.
part2 = data / (rownorms[:, None]**2)
1.3. Add to x_not
Finally, we're adding each row of part1 * part2 to the original x_not and returning the result
result = x_not + (part1 * part2).sum(axis=0)
Here's where we get stuck. In your approach, each call to myfunction() gives a value of part1 that depends on target[i], which was changed in the last call to myfunction().
2. Why vectorize?
Using numpy's inbuilt methods instead of looping allows it to offload the calculation to its C backend, so it runs faster. If your numpy is linked to a BLAS backend, you can extract even more speed by using your processor's SIMD registers
The conjugate gradient method is a simple iterative method to solve certain systems of equations. There are other more complex algorithms that can solve general systems well, but this should do for the purposes of our demo. Again, the purpose is not to have an iterative algorithm that will perfectly solve any linear system of equations, but to show what kind of speedup you can expect if you vectorize your code.
Given your system
data # x_not = target
Let's define some variables:
A = data.T # data
b = data.T # target
And we'll solve the system A # x = b
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
p = resid
while (np.abs(resid) > tolerance).any():
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
To contrast the fully vectorized approach with one that uses iterations to update the rows of x and resid_new, let's define another implementation of the CG solver that does this.
def solve_loopy(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
for i in range(len(x)):
x[i] = x[i] + alpha * p[i]
resid_new[i] = resid[i] - alpha * Ap[i]
# resid_new = resid - alpha * A # p
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
And our original vector method:
def solve_vect(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
Let's solve a simple system to see if this works first:
2x1 + x2 = -5
−x1 + x2 = -2
should give a solution of [-1, -3]
data = np.array([[ 2, 1],
[-1, 1]])
target = np.array([-5, -2])
print(solve_loopy(data, target))
print(solve_vect(data, target))
Both give the correct solution [-1, -3], yay! Now on to bigger things:
data = np.random.random((100, 100))
target = np.random.random((100, ))
Let's ensure the solution is still correct:
sol1 = solve_loopy(data, target)
np.allclose(data # sol1, target)
# Output: False
sol2 = solve_vect(data, target)
np.allclose(data # sol2, target)
# Output: False
Hmm, looks like the CG method doesn't work for badly conditioned random matrices we created. Well, at least both give the same result.
np.allclose(sol1, sol2)
# Output: True
But let's not get discouraged! We don't really care if it works perfectly, the point of this is to demonstrate how amazing vectorization is. So let's time this:
import timeit
timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
# Output: 0.25586539999994784
timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
# Output: 0.12008900000000722
Nice! A ~2x speedup simply by avoiding a loop while updating our solution!
For larger systems, this will be even better.
for N in [10, 50, 100, 500, 1000]:
data = np.random.random((N, N))
target = np.random.random((N, ))
t_loopy = timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
t_vect = timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
print(N, t_loopy, t_vect, t_loopy/t_vect)
This gives us:
N t_loopy t_vect speedup
00010 0.002823 0.002099 1.345390
00050 0.051209 0.014486 3.535048
00100 0.260348 0.114601 2.271773
00500 0.980453 0.240151 4.082644
01000 1.769959 0.508197 3.482822

Python: Fastest way to subtract elements of datasets of HDF5 file?

Here is one interesting problem.
Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).
Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.
Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).
import numpy as np
import time
import h5py
import sys
import csv
f_r = h5py.File('input.h5', 'r+')
dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape
left, right, count = 0,0,0
W = 4000 # Window half-width
n = 1
# **********************************************
# HDF5 Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
for j in range(r1):
e1 = dset1[j,1]
# move left pointer so that is within -delta of e
while left < r2 and dset2[left,1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and dset2[right,1] - e1 <= W:
right += 1
for i in range(left, right):
delta = e1 - dset2[i,1]
dset.resize(dset.shape[0] + n, axis=0)
dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
count += 1
print("\nFinal shape of dataset created: " + str(dset.shape))
f_w.close()
EDIT (Aug 8, chunking HDF5 file as suggested by #kcw78)
#kcw78: So, I tried chunking as well. The following works well for small files (<100MB) but the computation time increases incredibly when I play with GBs of data. Can something be improvised in my code to make it fast?
My suspicion is for j loop is computationally expensive and may be the reason, any suggestions ?
filename = 'file.h5'
with h5py.File(filename, 'r') as fid:
chunks1 = fid["dataset_1"][:, :]
with h5py.File(filename, 'r') as fid:
chunks2 = fid["dataset_2"][:, :]
print(chunks1.shape, chunks2.shape) # shape is (13900,4) and (138676,4)
count = 0
W = 4000 # Window half-width
# **********************************************
# HDF5-Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
# chunk size to read from first/second dataset
size1 = 34850
size2 = 34669
# save "n" no. of subtracted values in dset
n = 10**4
u = 0
fill_index = 0
for c in range(4): # read 4 chunks of dataset-1 one-by-one
h = c * size1
chunk1 = chunks1[h:(h + size1)]
for d in range(4): # read chunks of dataset-2
g = d * size2
chunk2 = chunks2[g:(g + size2)]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(chunk1.shape[0]): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
while left < r2 and chunk2[left, 1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and chunk2[right, 1] - e1 <= W:
right += 1
for i in range(left, right):
if chunk1[j, 0]<8193 and chunk2[i, 0] <8193:
e2 = chunk2[i, 1]
delta = e1 - e2 # subtract col.2 values
count += 1
if fill_index == (n):
dset.resize(dset.shape[0] + n, axis=0)
dset[u:(u + n), 0:4] = [count, e1, e1, delta]
u = u * n
fill_index = 0
fill_index += 1
del chunk2
del chunk1
f_w.close()
print(count) # these are (no. of) subtracted values such that the difference is between +/- 4000
EDIT (Jul 31)
I tried reading in chunks and even using memory mapping. It is efficient if I do not perform any subtraction and just go through the chunks. The for j in range(m): is the one that is inefficient; probably because I am grabbing each value of the chunk from file-1. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of that can be replaced for "for j in range(m):?
size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))
for chunk1 in chunks1: # grab chunks from file-1
m, _ = chunk1.shape
fp1[0:m,:] = chunk1
chunks2 = pd.read_csv(filename[1], chunksize=size2,
names=['ch', 'tmstp', 'lt', 'rt'])
for chunk2 in chunks2: # grab chunks from file-2
k, _ = chunk2.shape
fp2[0:k, :] = chunk2
for j in range(m): # Grabbing values from file-1's chunk
e1 = fp1[j,1]
delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
count += 1
fp2.flush()
a += k
fp1.flush()
del chunks2
i += m
prog_count += m

Plotting Average Length of Brownian Motion Realization

I have a function for a brownian motion:
mu , sig = 0 , 1 # normal dist
mu_s = 0 # mu in SDE
sig_s = 1 #sig in SDE
S0 = 10 # starting price of stock
n , m = 1000, 20 # paths = n = how many simulations, m for discritization
T = 1 # year
dt = 1 # each dt is one day
def ABM(n,m,S0,mu,sigma,dt):
np.random.seed(999)
mu_s = mu # mu in SDE
sig_s = sigma #sig in SDE
S0 = S0 # starting price of stock
n , m = n, m # paths = n = how many simulations, m for discritization
sig_db = sig_s*np.sqrt(dt)*np.random.normal(mu, sigma, (n,m+1))
mu_dt = mu_s*dt*np.ones([n,m+1])
sig_db[:,0] = 0 # set first column to zero
mu_dt[:,0] = 0
dS = mu_dt + sig_db
S = S0 + np.cumsum(dS,axis=1)
return n,m,S
n,m,S = ABM(1000,20,10,0,1,1)
Which works fine for plotting separate realizations on one plot:
index = np.arange(0,m+1)*np.ones([n,m+1]) # create indices as S_0, S_1, S_2
plt.plot(index.T,S.T)
but now I'd like to plot the average path length of those realizations for each time step and I'm not sure how to go about it. The expectation of arithmetic brownian motion is E(S)=S_0 + \mu*t which leads me to think I should be using np.mean() in some way but I can't seem to get it.
TIA
The matrix S consists of n realizations, you get the E(S(t)) by averaging along the realizations, i.e.
EE = np.mean(S, axis = 0)
Similarly, you can get the variance, also a function of time, via
np.mean((S - EE)**2, axis = 0)

create set of randomized column names in pandas dataframe

I am trying to create a set of columns (within panda dataframe) where the column names are randomized. This is because I want to generate filter data from a larger data-set in a randomized fashion.
How can I generate an N (= 4) * 3 set of column names as per below?
car_speed state_8 state_17 state_19 state_16 wd_8 wd_17 wd_19 wd_16 wu_8 wu_17 wu_19 wu_16
My potential code below, but doesn't really work. I need the blocks'state_' first, then 'wd_', and then 'wd_'. My code below generates 'state_', 'wd_', 'wu_' individually in consecutive order. I have problems further on, when it is in that order, of filling in the data from the larger data-set
def iteration1(data, classes = 50, sigNum = 4):
dataNN = pd.DataFrame(index = [0])
dataNN['car_speed'] = np.zeros(1)
while len(dataNN.columns) < sigNum + 1:
state = np.int(np.random.uniform(0, 50))
dataNN['state_'+str(state)] = np.zeros(1) # this is the state value set-up
dataNN['wd_' + str(state)] = np.zeros(1) # this is the weight direction
dataNN['wu_' + str(state)] = np.zeros(1) # this is the weight magnitude
count = 0 # initialize count row as zero
while count < classes :
dataNN.loc[count] = np.zeros(len(dataNN.columns))
for state in dataNN.columns[1:10]:
dataNN[state].loc[count] = data[state].loc[count]
count = count + 1
if count > classes : break
return dataNN
Assuming the problem you have is lack of grouping of "state_*", "wd_*", and "wu_*" I suggest that you first select sigNum / 3 random ints and then use them to label the columns. Like the following:
states = [np.int(np.random.uniform(0, 50)) for _ in range (sigNum/3)]
i = 0
while len(dataNN.columns) <= sigNum:
state = states[i]
i += 1
dataNN['state_'+str(state)] = np.zeros(1) # this is the state value set-up
dataNN['wd_' + str(state)] = np.zeros(1) # this is the weight direction
dataNN['wu_' + str(state)] = np.zeros(1) # this is the weight magnitude
import random
import pandas as pd
def iteration1(data, classes = 5, subNum = 15):
dataNN = pd.DataFrame(index = [0])
dataNN['car_speed'] = np.zeros(1)
states = random.sample(range(50), sub_sig)
for i in range(0, sub_sig, 1):
dataNN['state_'+str(states[i])] = np.zeros(1) # this is the state value set-up
for i in range(0, subNum, 1):
dataNN['wd_' + str(states[i])] = np.zeros(1) # this is the weight direction
for i in range(0, subNum, 1):
dataNN['wu_' + str(states[i])] = np.zeros(1) # this is the weight magnitude
return dataNN

I am trying to write a function that takes the summation of an array of arrays to produce one final array- Python

I am trying to do a summation of a set of arrays calculated by two different sets of values. The first set are my angles
''' Next are the theta values, starting from Z1 '''
x = np.array([30,-30,0,0,-30,30])
theta = x*np.pi/180
now the second set of values are seven numbers created by this function
'''
This function calculates the lamina spacing and outputs it as a list
'''
def Layers(k,N,H,h):
list1 = []
for k in range(1, N+2):
result = (-H/2) + (k-1)*h
k = k+1
list1.append(result)
return list1
k =1
Zl = Layers(k,N,H,h)
'''Convert the list into an array'''
Z = np.array(Zl)
print(Z)
which gives my second set of values Z
[-0.00045 -0.0003 -0.00015 0. 0.00015 0.0003 0.00045]
The other function used to set up the problem is Qbar function:
def Qbar(theta):
T = np.array([(np.cos(theta)**2, np.sin(theta)**2,2*np.sin(theta)*np.cos(theta)),
(np.sin(theta)**2,np.cos(theta)**2,-2*np.sin(theta)*np.cos(theta)),
(-np.sin(theta)*np.cos(theta),np.sin(theta)*np.cos(theta),
np.cos(theta)**2-np.sin(theta)**2)])
Sbar = np.dot(np.dot(np.transpose(T),Sr),T)
Qbar = np.linalg.inv(Sbar)
return Qbar
The trouble comes in when I try to use this function. Each A matrix is dependent on the Qbar(theta)*Z value. I am trying to sum ALL of the A a matrixes to get a net A matrix. What I tried to do here was write a function that systematically went through and calculated each of the 6 A matrixes for each of the 6 angles, and then sum them up and spit out the result.
def Amatrix(Qbar,theta,Z,i):
list1 = []
for i in range(0,np.prod(theta.shape)):
result = Qbar(theta[i])*(Z[i]-Z[i + 1])
list1.append(result)
i = i + 1
Am = np.array(list1)
A = np.sum(Am)
return A
i = 0
A = Amatrix(Qbar,theta,Z,i)
print(A)
I want to use the size of my theta array as the limit for the counter which is why I put np.shape(theta) in range. The result of running this code gives me
-176734053.14
For anyone who knows mathcad the algorithms, setup of the matricies, and the result can be seen below: m is cos(theta), n is sin(theta)

Resources