Numpy arrays and comparisons with different values - python-3.x

For reproducibility reasons, I am sharing the data here.
From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, the following code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data_c": "data_c.csv", "data_r": "data_r.csv", "data_v": "data_v.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
data_c is a numpy.array that has only one unique quotient value 0.7, as does data_r with a unique quotient value of 0.5. However, data_v has two unique quotient values (either 0.5 or 0.8).
I wanted to loop through the quotient values of these CSV files and categorize them using a simple if-else statement. I get help from one StackOverflow contributor using numpy.array_equal as the following.
import numpy as np
unique_quotient = np.unique(quotient)
unique_data_c_quotient = np.r_[ 0.7]
unique_data_r_quotient = np.r_[ 0.5]
if np.array_equal( unique_quotient, unique_data_c_quotient ):
print('data_c')
elif np.array_equal( unique_quotient, unique_data_c_quotient ):
print('data_r')
This perfectly works for data_c and data_r whose values are 0.7 and 0.5 respectively. This means it works only when the quotient value is unique (or fixed). However, it doesn't work when the quotient value is more than one. For example, data_m has quotient values between 0.65 and 0.7 (i.e. 0.65<=quotient<=0.7) and data_v has two quotient values (0.5 and 0.8)
How can we solve this issue using numpy arrays?

If you consistently have unique quotients, and consistently have unique quotient bounds then I would recommend the following:
ud_m_bounds = np.r_[0.65,0.7]
uq = unique_quotient
uq_min,uq_max = uq.min(),uq.max()
def is_uq_bounded_by(unique_data_bounds):
ud_min,ud_max = unique_data_bounds.min(), unique_data_bounds.max()
left_bounded = ud_min <= uq_min <= ud_max
right_bounded = ud_min <= uq_max <= ud_max
bounded = left_bounded & right_bounded
return bounded
label = 'ERROR -- DATA UNCLASSIFIED'
if len(uq) > 2:
if is_uq_bounded_by( unique_data_m_bounds ):
label = 'data_m'
elif 0 < len(uq) <= 2:
if np.array_equal( uq, unique_data_v_quotient):
label = 'data_v'
if np.array_equal( uq, unique_data_c_quotient):
label = 'data_c'
elif np.array_equal( uq, unique_data_r_quotient):
label = 'data_r'
print(label)
Note that the method becomes dubious when the data begin to overlap.

Related

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Generate numpy matrix with unique range for each element

I'm trying to generate random matrices. However, each element of the random matrix has a different range. So I want to generate a random matrix such that each element has that random number within that range. So far i've been able to generate matrices with unique column ranges:
c1 = np.random.uniform(low=2, high=1000, size=(15,1))
c2 = np.random.uniform(low=0.001, high=100, size=(15,1))
c3 = np.random.uniform(low=30, high=10000, size=(15,1))
c4 = np.random.uniform(low=1, high=25, size=(15,1))
mtx = np.concatenate((c1,c2,c3,c4), axis=1)
Now Low and high for rows in mtx is also quite different. How can I generate such random matrix with each row element also having unique range and not just columns?
Something like this would probably work:
low = np.array([ 2, 0.001, 30, 1])
high = np.array([1000, 100, 10000, 25])
l = 15
mtx = np.random.rand((l,) + low.shape) * (high - low)[None, :] + low[None, :]
I think what you need to do to achieve what you want is the following:
Specify the low and high for each column and each row
Check for each element what the range is that it can be sampled from (that means the highest low and the lowest high of the two ranges imposed by its row and is column)
Sample each element separately (from a uniform distribution) with the element's specified high and low.
Now each element in each row will certainly be within the row's limits and the same would go for elements in a column.
You should be careful though not to select mutual exclusive ranges in rows and columns.
That said here some code that does this (with comments):
import numpy as np
from numpy.random import randint
n_rows = 15
n_cols = 4
# here I make random highs and lows for each row and column
# these are lists of tuples like this: [(39, 620), (83, 123), (67, 243), (77, 901)]
# where each tuple contains the low and high for the column (or row).
ranges_rows = [ (randint(0,100), randint(101, 1001)) for _ in range(n_rows) ]
ranges_cols = [ (randint(0,100), randint(101, 1001)) for _ in range(n_cols) ]
# make an empty matrix
mtx = np.empty((n_rows, n_cols))
# fill in the matrix
for x in range(n_rows):
for y in range(n_cols):
# get the specified low and high for both the column and row of the element
row_low, row_high = ranges_rows[x]
col_low, col_high = ranges_cols[y]
# the low and high for each element should be within range of both the
# row and column restrictions
elem_low = max([row_low, col_low])
elem_high = min([row_high, col_high])
# get the element within the range
rand_elem = np.random.uniform(low=elem_low, high=elem_high)
# put it in its right place in the matrix
mtx[x,y] = rand_elem

Iterations over 2d numpy arrays with while and for statements

In the code supplied below I am trying to iterate over 2D numpy array [i][k]
Originally it is a code which was written in Fortran 77 which is older than my grandfather. I am trying to adapt it to python.
(for people interested whatabouts: it is a simple hydraulics transients event solver)
Bear in mind that all variables are introduced in my code which I don't paste here.
H = np.zeros((NS,50))
Q = np.zeros((NS,50))
Here I am assigning the first row values:
for i in range(NS):
H[0][i] = HR-i*R*Q0**2
Q[0][i] = Q0
CVP = .5*Q0**2/H[N]
T = 0
k = 0
TAU = 1
#Interior points:
HP = np.zeros((NS,50))
QP = np.zeros((NS,50))
while T<=Tmax:
T += dt
k += 1
for i in range(1,N):
CP = H[k][i-1]+Q[k][i-1]*(B-R*abs(Q[k][i-1]))
CM = H[k][i+1]-Q[k][i+1]*(B-R*abs(Q[k][i+1]))
HP[k][i-1] = 0.5*(CP+CM)
QP[k][i-1] = (HP[k][i-1]-CM)/B
#Boundary Conditions:
HP[k][0] = HR
QP[k][0] = Q[k][1]+(HP[k][0]-H[k][1]-R*Q[k][1]*abs(Q[k][1]))/B
if T == Tc:
TAU = 0
CV = 0
else:
TAU = (1.-T/Tc)**Em
CV = CVP*TAU**2
CP = H[k][N-1]+Q[k][N-1]*(B-R*abs(Q[k][N-1]))
QP[k][N] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
HP[k][N] = CP-B*QP[k][N]
for i in range(NS):
H[k][i] = HP[k][i]
Q[k][i] = QP[k][i]
Remember i is for rows and k is for columns
What I am expecting is that for all k number of columns the values should be calculated until T<=Tmax condition is met. I cannot figure out what my mistake is, I am getting the following errors:
RuntimeWarning: divide by zero encountered in true_divide
CVP = .5*Q0**2/H[N]
RuntimeWarning: invalid value encountered in multiply
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
ValueError: setting an array element with a sequence.
Looking at your first iteration:
H = np.zeros((NS,50))
Q = np.zeros((NS,50))
for i in range(NS):
H[0][i] = HR-i*R*Q0**2
Q[0][i] = Q0
The shape of H is (NS,50), but when you iterate over a range(NS) you apply that index to the 2nd dimension. Why? Shouldn't it apply to the dimension with size NS?
In numpy arrays have 'C' order by default. Last dimension is inner most. They can have a F (fortran) order, but let's not go there. Thinking of the 2d array as a table, we typically talk of rows and columns, though they don't have a formal definition in numpy.
Lets assume you want to set the first column to these values:
for i in range(NS):
H[i, 0] = HR - i*R*Q0**2
Q[i, 0] = Q0
But we can do the assignment whole rows or columns at a time. I believe new versions of Fortran also have these 'whole-array' functions.
Q[:, 0] = Q0
H[:, 0] = HR - np.arange(NS) * R * Q0**2
One point of caution when translating to Python. Indexing starts with 0; so does ranges and np.arange(...).
H[0][i] is functionally the same as H[0,i]. But when using slices you have to use the H[:,i] format.
I suspect your other iterations have similar problems, but I'll stop here for now.
Regarding the errors:
The first:
RuntimeWarning: divide by zero encountered in true_divide
CVP = .5*Q0**2/H[N]
You initialize H as zeros so it is normal that it complains of division by zero. Maybe you should add a conditional.
The third:
QP[N][k] = -CV*B+np.sqrt(CV**2*(B**2)+2*CV*CP)
ValueError: setting an array element with a sequence.
You define CVP = .5*Q0**2/H[N] and then CV = CVP*TAU**2 which is a sequence. And then you try to assign a derivate form it to QP[N][K] which is an element. You are trying to insert an array to a value.
For the second error I think it might be related to the third. If you could provide more information I would like to try to understand what happens.
Hope this has helped.

pairwise subtraction of rows of an array and division in for loop

I have three arrays called flow_stress(4x3) and strain_rate(1x4) and T(1x3). I have interpolated the log of flow_stress with respect to [Temp = 1000/(T+273) which is line spaced into 50 terms] and with respect to srate (which is log of strain rate line-spaced into 100 terms) such that the flow_stress3 is an array of (100x50)
I am trying to create an array m which is equal to the (difference of consecutive rows of flow_stress3) divided by (difference of consecutive terms of srate).
Though the array flow_stress3 and array srate have correct values . values of m are wrong.
>import numpy as np
>import math
>from scipy import interpolate as sp
>import matplotlib as plt
>T = np.array([750,800,850])
## Enter temperature value in degree##
>strain_rate = np.array([0.0003,0.001,0.01,0.1])
>flow_stress = np.array([[95.96, 49.46,28.16],\
[126.62,80.51,46.45],\
[235.14,151.46,107.94],\
[319.15,228.77,165.63]])
>Temp = 1000/(273+T)
>k = (max(T)-min(T))/2
>TT = np. linspace(max(Temp), min(Temp),k)
## divide the temp into k##
>S = np.log10(flow_stress)
>flow_stress1 = np.empty(shape=[len(strain_rate),len(TT)])
## makes an empty array of dim ##
>(len(strain_rate),len(TT))
>SR= np.log10(strain_rate)
>n = (max(SR)-min(SR))/0.025
## divide SR by 0.025 to get number of terms in matrix##
>l = n//1
## operator // converts the fraction into integer##
>srate= np.linspace(min(SR), max(SR),l)
## divides SR into l equal no of parts##
>len_srate = len(srate)
>for i in range(len(strain_rate)):
## first interpolate between temp and log flow stress ##
>> f_linear = sp.interp1d(Temp,S[i,:])
>> flow_stress1[i,:] = f_linear(TT)
## interpolate at values given by TT ##
>flow_stress2 = np.empty(shape=[len(TT),len(srate)])
>for i in range(len(TT)):
>>f_linear = sp.interp1d(SR,flow_stress1[:,i])
>>flow_stress2[i,:] = f_linear(srate)
>flow_stress3 = flow_stress2.T
>print(len(flow_stress3))
>print(len(flow_stress3[0,:]))
>print(len(srate))
>print(len(TT))
>srate = srate.T
>m = np.zeros(shape=[len(srate),len(TT)],dtype=np.ndarray)
>for i in range(len(srate)-1):
>>m[i,:]= np.array((flow_stress3[i+1,:]-flow_stress3[i,:])/(srate[i+1]-srate[i]))
>m[len(srate)-1,:] = m[len(srate)-2,:]
I get a contour plot of m with respect to srate and T as like under in fig1 .
A plot with same data done in MatLab is also shown in next fig 2 . we know for sure that MatLab data is correct. With Python, as can be seen value of many raws and columns are identical which should not be the case.
fig1
fig2
If I understand correctly, your requirement is simply to calculate the first derivative of flow_stress3 wrt srate. The code seems rather complex for that. In particular, I don't understand what purpose the last line serves.
Since you're already using scipy, I would suggest using the UnivariateSpline function. Your code will shrink to something like:
from scipy.interpolate import UnivariateSpline
splinecoeff=UnivariateSpline(srate,flow_stress3)
m=splinecoeff.derivative()

Return value from DataFrame at maximum of a function

For narrow banded processing I want the complex pressure at the peak frequency bin. To find the peak frequency bin I use the frequency with the highest absolute value, within a small range of frequencies.
I have come up with the following code, borrowing heavily from
Use idxmax for indexing in pandas
This seems to me bulky, and hard to generalize. Ideally I hope to be able to be able to make fBins into an array, and return many frequencies at once. Its OK to make maxAbsIndex into a list, but I can't see the next step.
import numpy as np
import pandas as pd
# Construct fake frequency data on multiple channels
np.random.seed(0)
numF = 1000
f = np.arange(numF) / (numF * 2)
y = np.random.randn(numF, 2) + 1j * np.random.randn(numF, 2)
# Put time series into a DataFrame, indexed by frequency
yFrame = pd.DataFrame(y, index = f)
fBins = 0.1
tol = 0.01
# Find the index of the maxium absolute value within a given frequency window
absMaxIndex = yFrame[(fBins - tol) : (fBins + tol)].abs().idxmax()
# Return the value at this index
value = [yFrame.ix[items[1], items[0]] for items in absMaxIndex.iteritems()]
print(value)
Value should have the complex value
[(-2.0946030712061448-1.0585718976053677j), (-2.7396771671895563+0.79204149842297422j)]
Which have the largest absolute value in yFrame between 0.09 and 0.11 Hz for each channel.

Resources