Spark: Running Backwards Elimination By P-Value With Linear Regressions - apache-spark

I presently have a Spark Dataframe with 2 columns:
1) a column where each row contains a vector of predictive features
2) a column containing the value to be predicted.
To discern the most predictive features for use in a later model, I am using backwards elimination by P-value, as outlined by this article. Below is my code:
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
for i in range(0, num_vars):
model = LinearRegression(featuresCol="filtered_features", labelCol="averageScore")
model = model.fit(scoresDf)
p_values = model.summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
drop_max_index_udf = udf(lambda elem, drop_index, var_count:
Vectors.dense([elem[j] for j in range(var_count) if j not in [drop_index]]), VectorUDT())
scoresDfs = scoresDf.withColumn("filtered_features", drop_max_index_udf(scoresDf["filtered_features"],
lit(max_index), lit(num_vars)))
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
The code runs, but the only problem is that every iteration takes drastically longer than the last. Based on the answer to this question, it appears that the code is re-evaluating all prior iterations every time.
Ideally, I would like to feed the entire logic into some Pipeline structure that would store it all lazily and then execute sequentially with no repeats when called upon, but I am unsure as to whether that is even possible given that none of Spark's estimator / transformer functions seem to fit this use case.
Any guidance would be appreciated, thanks!

You are creating the model repeatedly inside a loop. It is a time consuming process and needs to be done once per training data set and a set of parameters. Try the following -
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
modelAlgo = LinearRegression(featuresCol="filtered_features", labelCol="averageScore")
model = modelAlgo.fit(scoresDf)
for i in range(0, num_vars):
p_values = model.summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
drop_max_index_udf = udf(lambda elem, drop_index, var_count:
Vectors.dense([elem[j] for j in range(var_count) if j not in [drop_index]]), VectorUDT())
scoresDfs = scoresDf.withColumn("filtered_features", drop_max_index_udf(scoresDf["filtered_features"],
lit(max_index), lit(num_vars)))
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
Once you are happy with the model you save it. When you need to evaluate your data just read this model and predict with it.

why are you doing
model = model.fit(scoresDf)
when scoredDfs contains your new df with one less independent variable?
If you change your code with the following:
independent_vars = ['x0', 'x1', 'x2', 'x3', 'x4']
def remove_element(array, index):
return Vectors.dense(np.delete(array, index, 0))
remove_element_udf = udf(lambda a, i: remove_element(a, i), VectorUDT())
max_p = 1
i = 0
while (max_p > 0.05):
model = LinearRegression(featuresCol="filtered_features",
labelCol="averageScore",
fitIntercept=False)
model = model.fit(scoresDf)
print('iteration: ', i)
summary = model.summary
summary_df = pd.DataFrame({
'var': independent_vars,
'coeff': model.coefficients,
'se': summary.coefficientStandardErrors,
'p_value': summary.pValues
})
print(summary_df)
print("r2: %f" % summary.r2)
p_values = summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
max_var = independent_vars[max_index]
print('-> max_index {max_index}, corresponding to var {var}'.format(max_index=max_index, var=max_var))
scoresDf = scoresDf.withColumn("filtered_features", remove_element_udf(scoresDf["filtered_features"],
lit(max_index)))
independent_vars = np.delete(independent_vars, max_index, 0)
print()
i += 1
you will get
iteration: 0
var coeff se p_value
0 x0 0.174697 0.207794 0.402616
1 x1 -0.448982 0.203421 0.029712
2 x2 -0.452940 0.233972 0.055856
3 x3 -3.213578 0.209935 0.000000
4 x4 3.790730 0.212917 0.000000
r2: 0.870330
-> max_index 0, corresponding to var x0
iteration: 1
var coeff se p_value
0 x1 -0.431835 0.202087 0.035150
1 x2 -0.460711 0.233432 0.051297
2 x3 -3.218725 0.209525 0.000000
3 x4 3.768661 0.210970 0.000000
r2: 0.869365
-> max_index 1, corresponding to var x2
iteration: 2
var coeff se p_value
0 x1 -0.479803 0.203592 0.020449
1 x3 -3.344830 0.202501 0.000000
2 x4 3.669419 0.207925 0.000000
r2: 0.864065
in first and second iteration, two independent variables with p-value greater than 0.05 are removed

Related

Speed Up a for Loop - Python

I have a code that works perfectly well but I wish to speed up the time it takes to converge. A snippet of the code is shown below:
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :]x))*data[i]/(norm(data[i])**2))
return y
rows, columns = data.shape
start = time.time()
iterate = 0
iterate_count = []
norm_count = []
res = 5
x_not = np.ones(columns)
norm_count.append(norm(x_not))
iterate_count.append(0)
while res > 1e-8:
for row in range(rows):
y = myfunction(x_not, row)
x_not = y
iterate += 1
iterate_count.append(iterate)
norm_count.append(norm(x_not))
res = abs(norm_count[-1] - norm_count[-2])
print('Converge at {} iterations'.format(iterate))
print('Duration: {:.4f} seconds'.format(time.time() - start))
I am relatively new in Python. I will appreciate any hint/assistance.
Ax=b is the problem we wish to solve. Here, 'A' is the 'data' and 'b' is the 'target'
Ugh! After spending a while on this I don't think it can be done the way you've set up your problem. In each iteration over the row, you modify x_not and then pass the updated result to get the solution for the next row. This kind of setup can't be vectorized easily. You can learn the thought process of vectorization from the failed attempt, so I'm including it in the answer. I'm also including a different iterative method to solve linear systems of equations. I've included a vectorized version -- where the solution is updated using matrix multiplication and vector addition, and a loopy version -- where the solution is updated using a for loop to demonstrate what you can expect to gain.
1. The failed attempt
Let's take a look at what you're doing here.
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :] # x)) * (data[i] / (norm(data[i])**2))
return y
You subtract
the dot product of (the ith row of data and x_not)
from the ith row of target,
limited at zero.
You multiply this result with the ith row of data divided my the norm of that row squared. Let's call this part2
Then you add this to the ith element of x_not
Now let's look at the shapes of the matrices.
data is (M, N).
target is (M, ).
x_not is (N, )
Instead of doing these operations rowwise, you can operate on the entire matrix!
1.1. Simplifying the dot product.
Instead of doing data[i, :] # x, you can do data # x_not and this gives an array with the ith element giving the dot product of the ith row with x_not. So now we have data # x_not with shape (M, )
Then, you can subtract this from the entire target array, so target - (data # x_not) has shape (M, ).
So far, we have
part1 = target - (data # x_not)
Next, if anything is greater than zero, set it to zero.
part1[part1 > 0] = 0
1.2. Finding rowwise norms.
Finally, you want to multiply this by the row of data, and divide by the square of the L2-norm of that row. To get the norm of each row of a matrix, you do
rownorms = np.linalg.norm(data, axis=1)
This is a (M, ) array, so we need to convert it to a (M, 1) array so we can divide each row. rownorms[:, None] does this. Then divide data by this.
part2 = data / (rownorms[:, None]**2)
1.3. Add to x_not
Finally, we're adding each row of part1 * part2 to the original x_not and returning the result
result = x_not + (part1 * part2).sum(axis=0)
Here's where we get stuck. In your approach, each call to myfunction() gives a value of part1 that depends on target[i], which was changed in the last call to myfunction().
2. Why vectorize?
Using numpy's inbuilt methods instead of looping allows it to offload the calculation to its C backend, so it runs faster. If your numpy is linked to a BLAS backend, you can extract even more speed by using your processor's SIMD registers
The conjugate gradient method is a simple iterative method to solve certain systems of equations. There are other more complex algorithms that can solve general systems well, but this should do for the purposes of our demo. Again, the purpose is not to have an iterative algorithm that will perfectly solve any linear system of equations, but to show what kind of speedup you can expect if you vectorize your code.
Given your system
data # x_not = target
Let's define some variables:
A = data.T # data
b = data.T # target
And we'll solve the system A # x = b
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
p = resid
while (np.abs(resid) > tolerance).any():
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
To contrast the fully vectorized approach with one that uses iterations to update the rows of x and resid_new, let's define another implementation of the CG solver that does this.
def solve_loopy(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
for i in range(len(x)):
x[i] = x[i] + alpha * p[i]
resid_new[i] = resid[i] - alpha * Ap[i]
# resid_new = resid - alpha * A # p
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
And our original vector method:
def solve_vect(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
Let's solve a simple system to see if this works first:
2x1 + x2 = -5
−x1 + x2 = -2
should give a solution of [-1, -3]
data = np.array([[ 2, 1],
[-1, 1]])
target = np.array([-5, -2])
print(solve_loopy(data, target))
print(solve_vect(data, target))
Both give the correct solution [-1, -3], yay! Now on to bigger things:
data = np.random.random((100, 100))
target = np.random.random((100, ))
Let's ensure the solution is still correct:
sol1 = solve_loopy(data, target)
np.allclose(data # sol1, target)
# Output: False
sol2 = solve_vect(data, target)
np.allclose(data # sol2, target)
# Output: False
Hmm, looks like the CG method doesn't work for badly conditioned random matrices we created. Well, at least both give the same result.
np.allclose(sol1, sol2)
# Output: True
But let's not get discouraged! We don't really care if it works perfectly, the point of this is to demonstrate how amazing vectorization is. So let's time this:
import timeit
timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
# Output: 0.25586539999994784
timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
# Output: 0.12008900000000722
Nice! A ~2x speedup simply by avoiding a loop while updating our solution!
For larger systems, this will be even better.
for N in [10, 50, 100, 500, 1000]:
data = np.random.random((N, N))
target = np.random.random((N, ))
t_loopy = timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
t_vect = timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
print(N, t_loopy, t_vect, t_loopy/t_vect)
This gives us:
N t_loopy t_vect speedup
00010 0.002823 0.002099 1.345390
00050 0.051209 0.014486 3.535048
00100 0.260348 0.114601 2.271773
00500 0.980453 0.240151 4.082644
01000 1.769959 0.508197 3.482822

Stochastic Gradient Descent implementation in Python from scratch. is the implementation correct?

I know this would seem similar to a lot of questions asked previously on the same topic. I have surveyed most of them but they don't quite answer my question. My problem is that my gradient is not converging to optima, it is rather diverging and oscillating at even very low values of alpha.
My data generation function is below
X = [[float(np.random.randn(1)) for i in range(0,100)] for j in range(0,5)]
X = np.array(X).transpose()
Y = [float(0) for i in range(0,100)]
Y = 2*X[:,0] + 3*X[:,1] + 1*X[:,2] + 4*X[:,3] + 1*X[:,4] + 5
fig, ax = plt.subplots(1,5)
fig.set_size_inches(20,5)
k = 0
for j in range(0,5):
sns.scatterplot(X[:,k],Y,ax=ax[j])
k += 1
My SGD implementation is as below
def multilinreg(X,Y,epsilon = 0.000001,alpha = 0.01,K = 20):
Xnot = [[1] for i in range(0,len(X))]
Xnot = np.array(Xnot)
X = np.append(Xnot,X, axis = 1)
vars = X.shape[1]
W = []
W = [np.random.normal(1) for i in range(vars)]
W = np.array(W)
J = 0
for i in range(len(X)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + X[i,j] * W[j]
J = J + (0.5/(len(X)))*((Y[i]-Yunit)**2)
err = 1
iter = 0
Weights = []
Weights.append(W)
Costs = []
while err > epsilon:
index = [np.random.randint(len(Y)) for i in range(K)]
Xsample, Ysample = X[index,:], Y[index]
m =len(Xsample)
Ypredsample = []
for i in range(len(Xsample)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + X[i,j] * W[j]
Ypredsample.append(Yunit)
Ypredsample = np.array(Ypredsample)
for i in range(len(Xsample)):
for j in range(vars):
gradJunit = (-1)*(Xsample[i,j]*(Ysample[i] - Ypredsample[i]))
W[j] = W[j] - alpha*gradJunit
Jnew = 0
for i in range(len(Xsample)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + Xsample[i,j]*W[j]
Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)
Weights.append(W)
err = abs(float(Jnew - J))
J = Jnew
Costs.append(J)
iter += 1
if iter % 1000 == 0:
print(iter)
print(J)
Costs = np.array(Costs)
Ypred = []
for i in range(len(X)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + X[i,j] * W[j]
Ypred.append(Yunit)
Ypred = np.array(Ypred)
return Ypred, iter, Costs, W
The hyperparamaters are as below
epsilon = 1*(10)**(-20)
alpha = 0.0000001
K = 50
I don't think that it is a data issue.I am using a fairly straightforward linear function.
I think it is the equations but I have double checked them as well and they seem to be fine to me.
Several things are to be corrected in your implementation (most of them for efficiency reasons). Of course, you would gain time by simply defining w = np.array([5, 2, 3, 1, 4, 1]), but this does not answer the question as to why your SGD implementation does not work.
First of all, you define X by doing:
X = [[float(np.random.randn(1)) for i in range(0,100)] for j in range(0,5)]
X = np.array(X).transpose()
A faster way of doing this operation is simply by doing:
X = np.random.randn(100, 5)
Then, you define Y:
Y = [float(0) for i in range(0,100)]
Y = 2*X[:,0] + 3*X[:,1] + 1*X[:,2] + 4*X[:,3] + 1*X[:,4] + 5
The first initialisation Y = [float(0) for i in range(0,100)] is useless, since you instantly override Y with the second line. A more condensed way of writing this line could also have been:
Y = X # np.array([2, 3, 1, 4, 1]) + 5
Now, concerning your SGD implementation. The lines:
Xnot = [[1] for i in range(0,len(X))]
Xnot = np.array(Xnot)
X = np.append(Xnot,X, axis = 1)
can be rewritten more efficiently as:
X = np.hstack((np.ones(len(X)).reshape(-1, 1), X))
Similarly, the lines
W = []
W = [np.random.normal(1) for i in range(vars)]
W = np.array(W)
can be rewritten using numpy functions. Note than the first line W = [] is useless, since you override W immediately after without using it. np.random.normal can directly generate more than 1 sample using the size keyword argument. Plus, note that when using np.random.normal(1), you're sampling from the normal distribution with mean 1 and std 1, while you probably want to sample from the normal distribution with mean 0 and std 1. Hence, you should define:
W = np.random.normal(size=vars)
Yunit is the prediction you make using W. By definition, you can compute it by doing the following:
Yunit = X # W
which avoids the nested for loops. The way you compute J is strange though. If I'm not mistaken, J corresponds to your loss function. However, the formula for J, assuming a MSE loss is J = 0.5 * sum from k=1 to len(X) of (y_k - w*x_k) ** 2. Hence, these two nested for loops can be rewritten as:
Yunit = X # W
J = 0.5 * np.sum((Y - Yunit) ** 2)
As a side remark: naming err that way may me misleading, since the error is generally the cost, while it denotes the progress made at each step here. The lines:
Weights = []
Weights.append(W)
can be rewritten as:
Weights = [W]
It would be logic also to add J to your Costs list, since this is the one which corresponds to W:
Costs = [J]
Since you want to perform a stochastic gradient descent, there is no need to pick at random which samples you want to take from your dataset. You have two choices: either you update your weights at each sample, or you can compute the gradient of J w.r.t. your weights. The latter is a bit simpler to implement and generally converges more gracefully than the former. However, since you chose the former, this is the one I'll be working with. Note that even in this version, you don't have to pick your samples at random, but I'll be using the same method as you since this should also work. Concerning your sampling, I think it is better to ensure that you don't take the same index twice though. Hence, you may want to define index like this:
index = np.random.choice(np.arange(len(Y)), size=K, replace=False)
m is useless, since it will always be equal to K in this case. You should keep it if you perform a sampling without ensuring that you don't have the same index twice though. If you want to perform a sampling without checking you sampled the same index twice, just put replace=True in the choice function.
Once again, you can use matrix multiplication to compute Yunit more efficiently. Hence, you can replace:
Ypredsample = []
for i in range(len(Xsample)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + X[i,j] * W[j]
Ypredsample.append(Yunit)
by:
Ypredsample = X # W
Similarly, you can compute your weights update using numpy functions. Hence, you can replace:
for i in range(len(Xsample)):
for j in range(vars):
gradJunit = (-1)*(Xsample[i,j]*(Ysample[i] - Ypredsample[i]))
W[j] = W[j] - alpha*gradJunit
by:
W -= alpha * np.sum((Ypredsample - Ysample).reshape(-1, 1) * Xsample, axis=0)
Like before, computing your cost can be done using matrix multiplication. Note however that you should compute J over your whole dataset. Hence, you should replace:
Jnew = 0
for i in range(len(Xsample)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + Xsample[i,j]*W[j]
Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)
by:
Jnew = 0.5 * np.sum((Y - X # W) ** 2)
Finally, you can use matrix multiplication to make your predictions. Hence, your final code should look like this:
import numpy as np
X = np.random.randn(100, 5)
Y = X # np.array([2, 3, 1, 4, 1]) + 5
def multilinreg(X, Y, epsilon=0.00001, alpha=0.01, K=20):
X = np.hstack((np.ones(len(X)).reshape(-1, 1), X))
vars = X.shape[1]
W = np.random.normal(size=vars)
Yunit = X # W
J = 0.5 * np.sum((Y - Yunit) ** 2)
err = 1
Weights = [W]
Costs = [J]
iter = 0
while err > epsilon:
index = np.random.choice(np.arange(len(Y)), size=K, replace=False)
Xsample, Ysample = X[index], Y[index]
Ypredsample = Xsample # W
W -= alpha * np.sum((Ypredsample - Ysample).reshape(-1,1) * Xsample, axis=0)
Jnew = 0.5 * np.sum((Y - X # W) ** 2)
Weights.append(Jnew)
err = abs(Jnew - J)
J = Jnew
Costs.append(J)
iter += 1
if iter % 10 == 0:
print(iter)
print(J)
Costs = np.array(Costs)
Ypred = X # W
return Ypred, iter, Costs, W
Running it returns W=array([4.99956786, 2.00023614, 3.00000213, 1.00034205, 3.99963732, 1.00063196]) in 61 iterations with a final cost of 3.05e-05.
Now that we know that this code is correct, we can use it to determine where yours went wrong. In this piece of code:
for i in range(len(Xsample)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + X[i,j] * W[j]
Ypredsample.append(Yunit)
Ypredsample = np.array(Ypredsample)
you use X[i, j] instead of Xsample[i, j], which makes no sense. Plus, if you print W along with J and iter in your loop, you can see that the program finds the correct W quite quickly (once the previous fix has been made), but does not stop, probably because J is not correctly computed. The error is that this line:
Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)
isn't correctly indented. Indeed, it is not supposed to be part of the for j in range(vars) loop, but is supposed to be part of the for i in range(len(Xsample)) loop only, like this:
Jnew = 0
for i in range(len(Xsample)):
Yunit = 0
for j in range(vars):
Yunit = Yunit + Xsample[i,j]*W[j]
Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)
By correcting this, your code works correctly. This error is also present at the beginning of your program but does not affect it as long as more than two iterations are done.

Python: Fastest way to subtract elements of datasets of HDF5 file?

Here is one interesting problem.
Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).
Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.
Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).
import numpy as np
import time
import h5py
import sys
import csv
f_r = h5py.File('input.h5', 'r+')
dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape
left, right, count = 0,0,0
W = 4000 # Window half-width
n = 1
# **********************************************
# HDF5 Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
for j in range(r1):
e1 = dset1[j,1]
# move left pointer so that is within -delta of e
while left < r2 and dset2[left,1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and dset2[right,1] - e1 <= W:
right += 1
for i in range(left, right):
delta = e1 - dset2[i,1]
dset.resize(dset.shape[0] + n, axis=0)
dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
count += 1
print("\nFinal shape of dataset created: " + str(dset.shape))
f_w.close()
EDIT (Aug 8, chunking HDF5 file as suggested by #kcw78)
#kcw78: So, I tried chunking as well. The following works well for small files (<100MB) but the computation time increases incredibly when I play with GBs of data. Can something be improvised in my code to make it fast?
My suspicion is for j loop is computationally expensive and may be the reason, any suggestions ?
filename = 'file.h5'
with h5py.File(filename, 'r') as fid:
chunks1 = fid["dataset_1"][:, :]
with h5py.File(filename, 'r') as fid:
chunks2 = fid["dataset_2"][:, :]
print(chunks1.shape, chunks2.shape) # shape is (13900,4) and (138676,4)
count = 0
W = 4000 # Window half-width
# **********************************************
# HDF5-Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
# chunk size to read from first/second dataset
size1 = 34850
size2 = 34669
# save "n" no. of subtracted values in dset
n = 10**4
u = 0
fill_index = 0
for c in range(4): # read 4 chunks of dataset-1 one-by-one
h = c * size1
chunk1 = chunks1[h:(h + size1)]
for d in range(4): # read chunks of dataset-2
g = d * size2
chunk2 = chunks2[g:(g + size2)]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(chunk1.shape[0]): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
while left < r2 and chunk2[left, 1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and chunk2[right, 1] - e1 <= W:
right += 1
for i in range(left, right):
if chunk1[j, 0]<8193 and chunk2[i, 0] <8193:
e2 = chunk2[i, 1]
delta = e1 - e2 # subtract col.2 values
count += 1
if fill_index == (n):
dset.resize(dset.shape[0] + n, axis=0)
dset[u:(u + n), 0:4] = [count, e1, e1, delta]
u = u * n
fill_index = 0
fill_index += 1
del chunk2
del chunk1
f_w.close()
print(count) # these are (no. of) subtracted values such that the difference is between +/- 4000
EDIT (Jul 31)
I tried reading in chunks and even using memory mapping. It is efficient if I do not perform any subtraction and just go through the chunks. The for j in range(m): is the one that is inefficient; probably because I am grabbing each value of the chunk from file-1. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of that can be replaced for "for j in range(m):?
size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))
for chunk1 in chunks1: # grab chunks from file-1
m, _ = chunk1.shape
fp1[0:m,:] = chunk1
chunks2 = pd.read_csv(filename[1], chunksize=size2,
names=['ch', 'tmstp', 'lt', 'rt'])
for chunk2 in chunks2: # grab chunks from file-2
k, _ = chunk2.shape
fp2[0:k, :] = chunk2
for j in range(m): # Grabbing values from file-1's chunk
e1 = fp1[j,1]
delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
count += 1
fp2.flush()
a += k
fp1.flush()
del chunks2
i += m
prog_count += m

Smoothing values (neighbors between 1-9)

Instructions: Compute and store R=1000 random values from 0-1 as x. moving_window_average(x, n_neighbors) is pre-loaded into memory from 3a. Compute the moving window average for x for the range of n_neighbors 1-9. Store x as well as each of these averages as consecutive lists in a list called Y.
My solution:
R = 1000
n_neighbors = 9
x = [random.uniform(0,1) for i in range(R)]
Y = [moving_window_average(x, n_neighbors) for n_neighbors in range(1,n_neighbors)]
where moving_window_average(x, n_neighbors) is a function as follows:
def moving_window_average(x, n_neighbors=1):
n = len(x)
width = n_neighbors*2 + 1
x = [x[0]]*n_neighbors + x + [x[-1]]*n_neighbors
# To complete the function,
# return a list of the mean of values from i to i+width for all values i from 0 to n-1.
mean_values=[]
for i in range(1,n+1):
mean_values.append((x[i-1] + x[i] + x[i+1])/width)
return (mean_values)
This gives me an error, Check your usage of Y again. Even though I've tested for a few values, I did not get yet why there is a problem with this exercise. Did I just misunderstand something?
The instruction tells you to compute moving averages for all neighbors ranging from 1 to 9. So the below code should work:
import random
random.seed(1)
R = 1000
x = []
for i in range(R):
num = random.uniform(0,1)
x.append(num)
Y = []
Y.append(x)
for i in range(1,10):
mov_avg = moving_window_average(x, n_neighbors=i)
Y.append(mov_avg)
Actually your moving_window_average(list, n_neighbors) function is not going to work with a n_neighbors bigger than one, I mean, the interpreter won't say a thing, but you're not delivering correctness on what you have been asked.
I suggest you to use something like:
def moving_window_average(x, n_neighbors=1):
n = len(x)
width = n_neighbors*2 + 1
x = [x[0]]*n_neighbors + x + [x[-1]]*n_neighbors
mean_values = []
for i in range(n):
temp = x[i: i+width]
sum_= 0
for elm in temp:
sum_+= elm
mean_values.append(sum_ / width)
return mean_values
My solution for +100XP
import random
random.seed(1)
R=1000
Y = list()
x = [random.uniform(0, 1) for num in range(R)]
for n_neighbors in range(10):
Y.append(moving_window_average(x, n_neighbors))

Looping through multiple dataframes does not calculate properly

I am attempting to perform calculations, then loop through the same pandas dataframe and perform the same calculation but with an altered variable (one that increases each time it loops). If the loop range is set to just 1, all rows calculate properly and the new dataframe is created. However, attempting to actually loop the program results in NaN values everywhere except the first row.
Omega loop
for i in range(10):
#Determine first and last Julian dates of data
t1 = df.ix[:0,'jd']
t2 = df.ix[n-1:,'jd']
t2 = t2.reset_index(drop=True)
tj = t2-t1
#Iterate over each observation within each star file
jd = df['jd']
dmag = df['dmag']
sinw = np.sin(2*omega*jd)
sum1 = sinw.sum()
cosw = np.cos(2*omega*jd)
sum2 = cosw.sum()
#Calculate tau
tau = ((np.arctan(sum1/sum2))/(2*omega))
avgdmag = dmag.sum()/n
#Calculate sample variance
tot = (df['dmag']-avgdmag)**2
tot2 = tot.sum()
var = tot2/(n-1)
#Calculate sums for power series
sum3 = sum3 + ((dmag - avgdmag)*np.cos(omega*(jd-tau)))
sum4 = sum4 + (np.cos(omega*(jd-tau)))**2
sum5 = sum5 + ((dmag - avgdmag)*np.sin(omega*(jd-tau)))
sum6 = sum6 + (np.sin(omega*(jd-tau)))**2
#Calculate power series and normalized power series
px = (((sum3**2)/sum4)+((sum5**2)/sum6))/2
pn = px/var
#Step through sequential frequencies
omega = omega + (1/tj)
I also received a runtime warning from NumPy caused by the omega term at the end. I disabled "invalid" warnings as it was not causing an issue with the actual calculations. The first dataframe that incorrectly computes is sinw and cosw. And all subsequently calculated dataframes have NaN values.
It is because your tj is a pd.Series of length 1, not scalar as you would expect. After the first loop, omega = omega + 1/tj becomes a Series of length 1 (with 0 as index). Then in the 2nd loop, tau = ((np.arctan(sum1/sum2))/(2*omega)) also becomes such a Series. When updating sum3, jd - tau (a Series of length n minus a Series of length 1) gives you a Series with all NaN except at index 0 where both series match. After that all subsequent Series has lots of NaNs.
The solution is to calculate tj as a scalar, such as
tj = df.loc[n-1,'jd'] - df.loc[0,'jd'] (assuming n = len(df)).
Anyway, your piece of code can be re-written for readability.
tj = df.loc[n-1,'jd'] - df.loc[0,'jd'] #tj is loop invariant
for _ in range(10):
sum1 = np.sin(2*omega*df['jd']).sum()
sum2 = np.cos(2*omega*df['jd']).sum()
tau = np.arctan(sum1/sum2)/(2*omega)
avgdmag = df['dmag'].mean()
var = df['dmag'].var() #unbiased sample variance
sum3 += ((df['dmag'] - avgdmag)*np.cos(omega*(df['jd']-tau)))
sum4 += (np.cos(omega*(df['jd']-tau)))**2
sum5 += ((df['dmag'] - avgdmag)*np.sin(omega*(df['jd']-tau)))
sum6 += (np.sin(omega*(df['jd']-tau)))**2
px = (((sum3**2)/sum4)+((sum5**2)/sum6))/2
pn = px/var
omega += 1/tj

Resources