I created a sample data frame as below:
data = {'X': [10,20,30,40,50,60,70,80,90],
'Y': [1,2,3,4,5,6,7,8,9],
}
df_test = pd.DataFrame(data)
Now I need to find the distance between two points for the above sample data set, to that, I had already defined a function
import numpy as np
def dist(X1,Y1,X2,Y2):
#find the distance between 2 points
d = np.sqrt((X2 - X1) * (X2 - X1) + (Y2 - Y1) * (Y2 - Y1))
return d
I need to find the distance between 2 points, as example, if I used the 1st row I need to compare that row with all the other rows in the same data frame column, Then I need to use 2nd row and compare that row with all the other rows in the same column and so on until all the rows had compared with each other
To do this I used a nested for loop as below
res = []
for i in range(0,len(df_test)):
X1 = df_test.A[i]
Y1 = df_test.B[i]
for j in range(0,len(df_test)):
X2 = df_test.A[j]
Y2 = df_test.B[j]
res = dist(X1,Y1,X2,Y2)
print(res)
print( )
How can I do the above-mentioned process without using a nested for loop , since this method will work with a small number of rows in a data frame but when dealing with large datasets terminal will get killed
Related
In a dataframe I have 4 variables that are the X, Y, Z and W orientations of a robot. Each line represents a measurement with these four values.
x = [-0.75853, -0.75853, -0.75853, -0.75852]
y = [-0.63435, -0.63434, -0.63435, -0.63436]
z = [-0.10488, -0.10490, -0.10492, -0.10495]
w = [-0.10597, -0.10597, -0.10597, -0.10596]
df = pd.DataFrame([x, y, z, w], columns=['x', 'y', 'z', 'w'])
I wrote the function below that returns three differences between two quaternions:
from pyquaternion import Quaternion
def quaternion_distances(w1, x1, y1, z1, w2, x2, y2, z2):
""" Create two Quaternions objects and calculate 3 distances between them """
q1 = Quaternion(w1, x1, y1, z1)
q2 = Quaternion(w2, x2, y2, z2)
dist_by_signal = Quaternion.absolute_distance(q1, q2)
dist_geodesic = Quaternion.distance(q1, q2)
dist_sim_geodec = Quaternion.sym_distance(q1, q2)
return dist_by_signal, dist_geodesic, dist_sim_geodec
This difference is calculated based on the values of the second line by the values of the first line. Thus, I cannot use the Pandas apply function.
I have already added three columns to the dataframe, so that I receive each of the values returned by the function:
df['dist_by_signal'] = 0
df['dist_geodesic'] = 0
df['dist_sim_geodec'] = 0
The problem is: how to apply the above function to each row and include the result in these new columns? Can you give me a suggestion?
Consider shift to create adjacent columns, w2, x2, y2, z2, of next row values then run rowwise apply which does require axis='columns' (not index):
df[[col+'2' for col in list('wxyz')]] = df[['x', 'y', 'z', 'w']].shift(-1)
def quaternion_distances(row):
""" Create two Quaternions objects and calculate 3 distances between them """
q1 = Quaternion(row['w'], row['x'], row['y'], row['z'])
q2 = Quaternion(row['w2'], row['x2'], row['y2'], row['z2'])
row['dist_by_signal'] = Quaternion.absolute_distance(q1, q2)
row['dist_geodesic'] = Quaternion.distance(q1, q2)
row['dist_sim_geodec'] = Quaternion.sym_distance(q1, q2)
return row
df = df.apply(quaternion_distances, axis='columns')
print(df)
You can use.
Quaternions=df.apply(lambda x: Quaternion(x), axis=1)
df['dist_by_signal'] = 0
df['dist_geodesic'] = 0
df['dist_sim_geodec'] = 0
df.reset_index(drop=True)
for i in df.index:
q1=Quaternions[i]
if i+1<len(df.index):
q2=Quaternions[i+1]
df.loc[i,['dist_by_signal','dist_geodesic','dist_sim_geodec']]=[Quaternion.absolute_distance(q1, q2), Quaternion.distance(q1, q2),Quaternion.sym_distance(q1, q2)]
print(df)
x y z w dist_by_signal dist_geodesic \
0 -0.75853 -0.75853 -0.75853 -0.75852 0.248355 0.178778
1 -0.63435 -0.63434 -0.63435 -0.63436 1.058875 1.799474
2 -0.10488 -0.10490 -0.10492 -0.10495 0.002111 0.010010
3 -0.10597 -0.10597 -0.10597 -0.10596 0.000000 0.000000
dist_sim_geodec
0 0.178778
1 1.799474
2 0.010010
3 0.000000
Here I wrote the summation equation to find the Y value according to my csv file.
Here I want to write for range 0 with time difference. When I wrote it gave me this error, ("'str' object cannot be interpreted as an integer", 'occurred at index 0')
my summation equation,
n = time difference in between two rows
my code:
def y_convert(X,time):
Y=0
if x == 10:
for k in range(0,time):
Y=np.sum(X*k)
else:
for k in range(0,time):
Y=np.sum(X*k)
return Y
Then convert time difference into minute and then apply this code to find y
df1['time_diff'] = pd.to_datetime(df1["time"])
df1['delta'] = (df1['time_diff']-df1['time_diff'].shift()).fillna(0)
df1['t'] = df1['delta'].apply(lambda x: x / np.timedelta64(1,'m')).astype('int64') % (24*60)
X = df1['X'].astype(int)
time=df1['t'].astype(int)
Y = df1.apply(lambda x: y_convert(x.X,x.time), axis=1)
Then I tried to plot the graph after getting the correct answer provided by jezrael
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(time, df1['Y'])
ax.set_xlabel
ax.set_ylabel
plt.show()
Plot graph:
my csv file:
I think you need pass column t, not column time:
df1['t'] = df1['delta'].dt.total_seconds().div(60).astype(int)
Y = df1.apply(lambda x: y_convert(x.X,x.t), axis=1)
Reason is if use range function in time, second argument is no time, but integer.
In your solution is used:
range(0,'6:15:00')
Also seems your solution should be simlify a lot:
Y = df1['X'] * (df1['t'] - 1)
Though code runs well, but results not displayed. Did I miss anything?
# Independent variables
x1 = 'a'
x2 = 'b'
# Dependent variable
y = 'c'
# Pairings
x1y = cor.loc[ x1, y ]
x2y = cor.loc[ x2, y ]
x1x2 = cor.loc[ x1, x2 ]
Rx1x2y = math.sqrt((abs(x1y**2) + abs(x2y**2) - 2*x1y*x2y*x1x2) / (1-
abs(x1x2**2)) )
R2 = Rx1x2y**2
# Calculate adjusted R-squared
n = len(data) # Number of rows
k = 2 # Number of independent variables
R2_adj = 1 - ( ((1-R2)*(n-1)) / (n-k-1) )
Above codes ran well nut no results displayed here. I was expecting R2,R2_adj values in output
I am running a back-testing program on python. However, even though the maths/logic is simple, python seems to be taking an extremely long time to calculate the FOR loop.
For each row/line, it takes on average 1-sec; and when I have thousands to potentially ten-of-thousands of rows-of-data, the time-taken is impractical.
I use a panda dataframe as the base, and generate forward calculations by for-loop. Is there a more efficient way, or what could I do to reduce the computational time?
def signal_TA1(data, periods):
columns = ['x1', 'x2', 'x3', .......]
pd_Append = pd.DataFrame((np.zeros((len(data.index),len(columns)))), columns = columns) #create and initialize as zeros needed columns
data = data.join(pd_Append)
data['Size'] = data.bidQ + data.askQ
data['prx'] = (data.bid * data.askQ + data.ask * data.bidQ)/data.Size
for i in range(1, len(data.index), 1):
data.emaX.iloc[i] = data.lambda_.iloc[i] * data.Size.iloc[i] + (1 - data.lambda_.iloc[i]) * data.emaX.iloc[i-1]
xxxxxx
xxxxx
xxxxx
return data
It seems (well, it seems to be relatively known) that numpy processes looped calculations much more effectively than pandas (as it has to re-built the whole array each time).
Basically, I create a numpy array [x,y] within the function. Then, I calculate via a for-loop and populate the numpy array, row-by-row. Finally, I merely convert the finished numpy array into a pandas dataframe (for easier display and plotting).
The time difference is forever versus < 1 second for about 2,500 rows of data-and-calculation.
def signal_M2(data, weight, pandas = True):
bid = np.array(data.bid)
ask = np.array(data.ask)
askQ = np.array(data.askQ)
bidQ = np.array(data.bidQ)
size = bidQ + askQ
VWAP = (bid * askQ + ask * bidQ)/(bidQ + askQ)
columns = [x1, x2, x3, x4, x5, .....]
datB = np.zeros((len(data.index), len(columns)))
datA = pd.DataFrame(index=[0], columns = columns)
lambda_ = 0.5
weight = 0.3
x1 = VWAP[0]
x2 = VWAP[0]
x3 = VWAP[0]
x4 = VWAP[0]
x5 = size[0]
....
....
datB[0] = (bid[0], ask[0], bidQ[0], askQ[0], size[0], ..........)
for row in range(1, len(data.index), 1):
x1 = lambda_ * size[row] + (1 - lambda_) * emaInertia
x2 = weight * VWAP[row] + (1 - weight) * emaPrx
x3 = weight * VWAP[row] + (1 -weight) * emaPrxSlow
x4 = weight * VWAP[row] + (1 -weight) * emaPrxFast
x5 = weight * VWAP[row] + (1 -weight) * emaPrxLead
if pandas == True:
datB[row] = (bid[row], ask[row], bidQ[row], ...........)
else:
print(................)
if pandas == True:
datB = pd.DataFrame(datB, columns = columns)
return datB
else :
print('no pandas dataframe was asked to be be stored')
firstly, apologize for little cryptic title to my question. Let me try to explain my need:-
I am reading two features namely X1, X2 from a CSV file. I have a training set of data in a csv file containing 1000 records with each line corresponding to the value of X1, X2. To make my training set fit better to my machine learning code, I want to do feature mapping that would take X1, X2 and create polynomial terms to the power of 4. for example if X1 =a, X2=b, I want to add newer features a^2, a*b, b^2, a^3,a^2*b,a*b^2,a^4...and so on.
Now if I read them as a numpy matrix , I want to see the data like this:
[ [ 1 a b a^2 a*b, b^2 a^3 a^2*b......]
[.... ............ ............ ]
[ ..
..] ]
Note that the number of rows are fixed , but the number of columns are determined by the degree selected. Also first three columns need to be
[[1 a b ..]
[1 c d ..]
..
..]
The pseudo code I am thinking of is as follows:-
def poly(X): # where X is a numpy matrix with X1, X2 columns,
degree = 4;
r= X.shape[0]
c=1 # number of columns
val_matrix= np.ones(shape=(r,c)) # creating a (r,1) matrix init with 1s
# *start of psuedo code*
while i<=degree:
while j <=i:
val_matrix[:, c+1] = (X1.^(i-j)).*(X2.^j)
I am not sure how to get this working in python?. would appreciate some suggestion. Note that ^ refers to the power of.
Starting with two vectors X1 and X2 you could create the monomials:
X1p = X1[:, None]**np.arange(max_deg + 1)
X2p = X2[:, None]**np.arange(max_deg + 1)
and then combine them using mgrid
i, j = np.mgrid[:max_deg + 1,:max_deg + 1]
m = i+j <= max_deg
result = X1p[:, i[m]]*X2p[:, j[m]]
Alternatively you could apply the indices directly to X1 and X2:
result = X1[:, None]**i[m] * X2[:, None]**j[m]
This requires fewer lines of code but uses more multiplications.
If the number of multiplications is a concern, X1p and X2p could also be computed cheaper; X1p:
X1p = np.empty((len(X1), max_deg + 1), X1.dtype)
X1p[:, 0] = 1
X1p[:, 1:] = X1[:, None]
np.multiply.accumulate(X1p[:,1:], axis=-1, out=X1p[:, 1:])
and similar for X2p