Python Index and Bounds error using data set - python-3.x

Our class is using Python as a solution tool for models. However, this is my first time with python or any programming language since VB in 1997 so I'm struggling. We have the following code provided to us.
from numpy import loadtxt, array, ones, column_stack
from numpy import dot, sqrt
from scipy.linalg import inv
from scipy.stats import norm, t
f = loadtxt('text data.raw')
y = f[:,4]
n = y.size
x = array([f[:,2],f[:,8],f[:,4]])
one = ones(n)
#xa = column_stack([one,f[:,3],f[:,4]])
xa = column_stack([one,x.T])
k = xa.shape[1]
xx = dot(xa.T,xa)
invx = inv(xx)
xy = dot(xa.T,y)
b = dot(invx,xy)
# Compute cov(b)
e = y - dot(xa,b)
s2 = dot(e.T,e)/(n-k)
covb = invx*s2
# Compute t-stat
tstat = b[1]/sqrt(covb[1][1])
#compute p-value
p = 1 - norm.cdf(tstat,0,1)
pt = 1 - t.cdf(tstat,88)
Our data set is a 10x88 matrix. Our goal is to create a linear program and find a few answers. On our data column 1 is already set to price which in our linear program is our desired out put and I need to use column 3,4, and 5. as my x1,x2, and x3. I'm not sure how or what line 9 and 11 values need to be changed to in order to accomplish that task nor am I currently understanding what those two lines are specifically calling for or doing in the program. Again, I'm not familiar with programming.
Everything I try generally yields an error similar to
IndexError: index 5 is out of bounds for axis 1 with size 5
Any suggestions?

Related

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Discrepancy in histograms generated by plt.hist() [duplicate]

This question already has answers here:
numpy arange: how to make "precise" array of floats?
(4 answers)
Closed 1 year ago.
I need help figuring out why there is a discrepancy in the histogram A and B generated in the code below. I'm a physicist and some colleagues and me noted this as we were plotting the same data in python, IDL and Matlab. Python and IDL have the same problem, however Matlab does not. Matlab always reproduce histogram B.
import numpy as np
import matplotlib.pyplot as plt
t = np.random.randint(-1000,1000,10**3)
# A
tA = t/1000
binsizeA = 0.05
xminA = -1
xmaxA = 1
binsA = np.arange(xminA, xmaxA+binsizeA, binsizeA)
hA, _ , _ = plt.hist(tA, bins=binsA, histtype="step", label="A")
# B
tB = t
binsizeB = 50
xminB = -1000
xmaxB = 1000
binsB = np.arange(xminB, xmaxB+binsizeB, binsizeB)
hB, _ , _ = plt.hist(tB/1000, bins=binsB/1000, histtype="step", label="B")
plt.legend()
plt.show()
print(hA==hB)
Plot showing the histograms
The original data are time tagged measurements with microsecond presision saved as integers. The problems seems to be when the array are divided by 1000 (from microsecond to millisecond). Is there a way to avoid this?
I start by "recreating" scenario A, but directly by scaling everything (data + bins) from B:
C - binsB / 1000
# C
tC = tB / 1000
xminC = xminB / 1000
xmaxC = xmaxB / 1000
binsC = binsB / 1000
hC, _ , _ = plt.hist(tC, bins=binsC, histtype="step", label="C")
assert((hB == hC).all())
This produces the same histogram as hB, so the problem is in the way binsA is made:
binsA = np.arange(xminA, xmaxA+binsizeA, binsizeA)
From its docstring:
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use numpy.linspace for these cases.
So either go route C or use linspace to create the non-integer bins with less rounding errors.
D - np.linspace
Interestingly, using linspace does not yield floating-point equal bins as binsB / 1000 does:
# D
tD = t / 1000
bincountD = 41
xminD = -1
xmaxD = 1
binsD = np.linspace(xminD, xmaxD, 41)
hC, _ , _ = plt.hist(tC, bins=binsC, histtype="step", label="C")
hD, _ , _ = plt.hist(tD, bins=binsD, histtype="step", label="D")
plt.legend()
plt.show()
By inspection, both binsC look equal to binsD, but still differ in their least signifcant digits. I can "clamp" them to yield the same histogram by binsX.round(2).
But in total, this serves as a reminder how tricky it is to achieve "exact" results. But note that this fact is amplified here, as all your samples were integers to begin with. If your data is floating point as well, bins and samples would not be value-identical.

How do I call a list of numpy functions without a for loop?

I'm doing data analysis that involves minimizing the least-square-error between a set of points and a corresponding set of orthogonal functions. In other words, I'm taking a set of y-values and a set of functions, and trying to zero in on the x-value that gets all of the functions closest to their corresponding y-value. Everything is being done in a 'data_set' class. The functions that I'm comparing to are all stored in one list, and I'm using a class method to calculate the total lsq-error for all of them:
self.fits = [np.poly1d(np.polyfit(self.x_data, self.y_data[n],10)) for n in range(self.num_points)]
def error(self, x, y_set):
arr = [(y_set[n] - self.fits[n](x))**2 for n in range(self.num_points)]
return np.sum(arr)
This was fine when I had significantly more time than data, but now I'm taking thousands of x-values, each with a thousand y-values, and that for loop is unacceptably slow. I've been trying to use np.vectorize:
#global scope
def func(f,x):
return f(x)
vfunc = np.vectorize(func, excluded=['x'])
…
…
#within data_set class
def error(self, x, y_set):
arr = (y_set - vfunc(self.fits, x))**2
return np.sum(arr)
func(self.fits[n], x) works fine as long as n is valid, and as far as I can tell from the docs, vfunc(self.fits, x) should be equivalent to
[self.fits[n](x) for n in range(self.num_points)]
but instead it throws:
ValueError: cannot copy sequence with size 10 to array axis with dimension 11
10 is the degree of the polynomial fit, and 11 is (by definition) the number of terms in it, but I have no idea why they're showing up here. If I change the fit order, the error message reflects the change. It seems like np.vectorize is taking each element of self.fits as a list rather than a np.poly1d function.
Anyway, if someone could either help me understand np.vectorize better, or suggest another way to eliminate that loop, that would be swell.
As the functions in question all have a very similar structure we can "manually" vectorize once we've extracted the poly coefficients. In fact, the function is then a quite simple one-liner, eval_many below:
import numpy as np
def poly_vec(list_of_polys):
O = max(p.order for p in list_of_polys)+1
C = np.zeros((len(list_of_polys), O))
for p, c in zip(list_of_polys, C):
c[len(c)-p.order-1:] = p.coeffs
return C
def eval_many(x,C):
return C#np.vander(x,11).T
# make example
list_of_polys = [np.poly1d(v) for v in np.random.random((1000,11))]
x = np.random.random((2000,))
# put all coeffs in one master matrix
C = poly_vec(list_of_polys)
# test
assert np.allclose(eval_many(x,C), [p(x) for p in list_of_polys])
from timeit import timeit
print('vectorized', timeit(lambda: eval_many(x,C), number=100)*10)
print('loopy ', timeit(lambda: [p(x) for p in list_of_polys], number=10)*100)
Sample run:
vectorized 6.817315469961613
loopy 56.35076989419758

Iterations over 2d numpy arrays with while and for statements

In the code supplied below I am trying to iterate over 2D numpy array [i][k]
Originally it is a code which was written in Fortran 77 which is older than my grandfather. I am trying to adapt it to python.
(for people interested whatabouts: it is a simple hydraulics transients event solver)
Bear in mind that all variables are introduced in my code which I don't paste here.
H = np.zeros((NS,50))
Q = np.zeros((NS,50))
Here I am assigning the first row values:
for i in range(NS):
H[0][i] = HR-i*R*Q0**2
Q[0][i] = Q0
CVP = .5*Q0**2/H[N]
T = 0
k = 0
TAU = 1
#Interior points:
HP = np.zeros((NS,50))
QP = np.zeros((NS,50))
while T<=Tmax:
T += dt
k += 1
for i in range(1,N):
CP = H[k][i-1]+Q[k][i-1]*(B-R*abs(Q[k][i-1]))
CM = H[k][i+1]-Q[k][i+1]*(B-R*abs(Q[k][i+1]))
HP[k][i-1] = 0.5*(CP+CM)
QP[k][i-1] = (HP[k][i-1]-CM)/B
#Boundary Conditions:
HP[k][0] = HR
QP[k][0] = Q[k][1]+(HP[k][0]-H[k][1]-R*Q[k][1]*abs(Q[k][1]))/B
if T == Tc:
TAU = 0
CV = 0
else:
TAU = (1.-T/Tc)**Em
CV = CVP*TAU**2
CP = H[k][N-1]+Q[k][N-1]*(B-R*abs(Q[k][N-1]))
QP[k][N] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
HP[k][N] = CP-B*QP[k][N]
for i in range(NS):
H[k][i] = HP[k][i]
Q[k][i] = QP[k][i]
Remember i is for rows and k is for columns
What I am expecting is that for all k number of columns the values should be calculated until T<=Tmax condition is met. I cannot figure out what my mistake is, I am getting the following errors:
RuntimeWarning: divide by zero encountered in true_divide
CVP = .5*Q0**2/H[N]
RuntimeWarning: invalid value encountered in multiply
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
ValueError: setting an array element with a sequence.
Looking at your first iteration:
H = np.zeros((NS,50))
Q = np.zeros((NS,50))
for i in range(NS):
H[0][i] = HR-i*R*Q0**2
Q[0][i] = Q0
The shape of H is (NS,50), but when you iterate over a range(NS) you apply that index to the 2nd dimension. Why? Shouldn't it apply to the dimension with size NS?
In numpy arrays have 'C' order by default. Last dimension is inner most. They can have a F (fortran) order, but let's not go there. Thinking of the 2d array as a table, we typically talk of rows and columns, though they don't have a formal definition in numpy.
Lets assume you want to set the first column to these values:
for i in range(NS):
H[i, 0] = HR - i*R*Q0**2
Q[i, 0] = Q0
But we can do the assignment whole rows or columns at a time. I believe new versions of Fortran also have these 'whole-array' functions.
Q[:, 0] = Q0
H[:, 0] = HR - np.arange(NS) * R * Q0**2
One point of caution when translating to Python. Indexing starts with 0; so does ranges and np.arange(...).
H[0][i] is functionally the same as H[0,i]. But when using slices you have to use the H[:,i] format.
I suspect your other iterations have similar problems, but I'll stop here for now.
Regarding the errors:
The first:
RuntimeWarning: divide by zero encountered in true_divide
CVP = .5*Q0**2/H[N]
You initialize H as zeros so it is normal that it complains of division by zero. Maybe you should add a conditional.
The third:
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
ValueError: setting an array element with a sequence.
You define CVP = .5*Q0**2/H[N] and then CV = CVP*TAU**2 which is a sequence. And then you try to assign a derivate form it to QP[N][K] which is an element. You are trying to insert an array to a value.
For the second error I think it might be related to the third. If you could provide more information I would like to try to understand what happens.
Hope this has helped.

pairwise subtraction of rows of an array and division in for loop

I have three arrays called flow_stress(4x3) and strain_rate(1x4) and T(1x3). I have interpolated the log of flow_stress with respect to [Temp = 1000/(T+273) which is line spaced into 50 terms] and with respect to srate (which is log of strain rate line-spaced into 100 terms) such that the flow_stress3 is an array of (100x50)
I am trying to create an array m which is equal to the (difference of consecutive rows of flow_stress3) divided by (difference of consecutive terms of srate).
Though the array flow_stress3 and array srate have correct values . values of m are wrong.
>import numpy as np
>import math
>from scipy import interpolate as sp
>import matplotlib as plt
>T = np.array([750,800,850])
## Enter temperature value in degree##
>strain_rate = np.array([0.0003,0.001,0.01,0.1])
>flow_stress = np.array([[95.96, 49.46,28.16],\
[126.62,80.51,46.45],\
[235.14,151.46,107.94],\
[319.15,228.77,165.63]])
>Temp = 1000/(273+T)
>k = (max(T)-min(T))/2
>TT = np. linspace(max(Temp), min(Temp),k)
## divide the temp into k##
>S = np.log10(flow_stress)
>flow_stress1 = np.empty(shape=[len(strain_rate),len(TT)])
## makes an empty array of dim ##
>(len(strain_rate),len(TT))
>SR= np.log10(strain_rate)
>n = (max(SR)-min(SR))/0.025
## divide SR by 0.025 to get number of terms in matrix##
>l = n//1
## operator // converts the fraction into integer##
>srate= np.linspace(min(SR), max(SR),l)
## divides SR into l equal no of parts##
>len_srate = len(srate)
>for i in range(len(strain_rate)):
## first interpolate between temp and log flow stress ##
>> f_linear = sp.interp1d(Temp,S[i,:])
>> flow_stress1[i,:] = f_linear(TT)
## interpolate at values given by TT ##
>flow_stress2 = np.empty(shape=[len(TT),len(srate)])
>for i in range(len(TT)):
>>f_linear = sp.interp1d(SR,flow_stress1[:,i])
>>flow_stress2[i,:] = f_linear(srate)
>flow_stress3 = flow_stress2.T
>print(len(flow_stress3))
>print(len(flow_stress3[0,:]))
>print(len(srate))
>print(len(TT))
>srate = srate.T
>m = np.zeros(shape=[len(srate),len(TT)],dtype=np.ndarray)
>for i in range(len(srate)-1):
>>m[i,:]= np.array((flow_stress3[i+1,:]-flow_stress3[i,:])/(srate[i+1]-srate[i]))
>m[len(srate)-1,:] = m[len(srate)-2,:]
I get a contour plot of m with respect to srate and T as like under in fig1 .
A plot with same data done in MatLab is also shown in next fig 2 . we know for sure that MatLab data is correct. With Python, as can be seen value of many raws and columns are identical which should not be the case.
fig1
fig2
If I understand correctly, your requirement is simply to calculate the first derivative of flow_stress3 wrt srate. The code seems rather complex for that. In particular, I don't understand what purpose the last line serves.
Since you're already using scipy, I would suggest using the UnivariateSpline function. Your code will shrink to something like:
from scipy.interpolate import UnivariateSpline
splinecoeff=UnivariateSpline(srate,flow_stress3)
m=splinecoeff.derivative()

Resources