How to calculate difference between rows in Pandas DataFrame? - python-3.x

In a dataframe I have 4 variables that are the X, Y, Z and W orientations of a robot. Each line represents a measurement with these four values.
x = [-0.75853, -0.75853, -0.75853, -0.75852]
y = [-0.63435, -0.63434, -0.63435, -0.63436]
z = [-0.10488, -0.10490, -0.10492, -0.10495]
w = [-0.10597, -0.10597, -0.10597, -0.10596]
df = pd.DataFrame([x, y, z, w], columns=['x', 'y', 'z', 'w'])
I wrote the function below that returns three differences between two quaternions:
from pyquaternion import Quaternion
def quaternion_distances(w1, x1, y1, z1, w2, x2, y2, z2):
""" Create two Quaternions objects and calculate 3 distances between them """
q1 = Quaternion(w1, x1, y1, z1)
q2 = Quaternion(w2, x2, y2, z2)
dist_by_signal = Quaternion.absolute_distance(q1, q2)
dist_geodesic = Quaternion.distance(q1, q2)
dist_sim_geodec = Quaternion.sym_distance(q1, q2)
return dist_by_signal, dist_geodesic, dist_sim_geodec
This difference is calculated based on the values of the second line by the values of the first line. Thus, I cannot use the Pandas apply function.
I have already added three columns to the dataframe, so that I receive each of the values returned by the function:
df['dist_by_signal'] = 0
df['dist_geodesic'] = 0
df['dist_sim_geodec'] = 0
The problem is: how to apply the above function to each row and include the result in these new columns? Can you give me a suggestion?

Consider shift to create adjacent columns, w2, x2, y2, z2, of next row values then run rowwise apply which does require axis='columns' (not index):
df[[col+'2' for col in list('wxyz')]] = df[['x', 'y', 'z', 'w']].shift(-1)
def quaternion_distances(row):
""" Create two Quaternions objects and calculate 3 distances between them """
q1 = Quaternion(row['w'], row['x'], row['y'], row['z'])
q2 = Quaternion(row['w2'], row['x2'], row['y2'], row['z2'])
row['dist_by_signal'] = Quaternion.absolute_distance(q1, q2)
row['dist_geodesic'] = Quaternion.distance(q1, q2)
row['dist_sim_geodec'] = Quaternion.sym_distance(q1, q2)
return row
df = df.apply(quaternion_distances, axis='columns')
print(df)

You can use.
Quaternions=df.apply(lambda x: Quaternion(x), axis=1)
df['dist_by_signal'] = 0
df['dist_geodesic'] = 0
df['dist_sim_geodec'] = 0
df.reset_index(drop=True)
for i in df.index:
q1=Quaternions[i]
if i+1<len(df.index):
q2=Quaternions[i+1]
df.loc[i,['dist_by_signal','dist_geodesic','dist_sim_geodec']]=[Quaternion.absolute_distance(q1, q2), Quaternion.distance(q1, q2),Quaternion.sym_distance(q1, q2)]
print(df)
x y z w dist_by_signal dist_geodesic \
0 -0.75853 -0.75853 -0.75853 -0.75852 0.248355 0.178778
1 -0.63435 -0.63434 -0.63435 -0.63436 1.058875 1.799474
2 -0.10488 -0.10490 -0.10492 -0.10495 0.002111 0.010010
3 -0.10597 -0.10597 -0.10597 -0.10596 0.000000 0.000000
dist_sim_geodec
0 0.178778
1 1.799474
2 0.010010
3 0.000000

Related

Comparison of all the rows inside a column on the same table

I created a sample data frame as below:
data = {'X': [10,20,30,40,50,60,70,80,90],
'Y': [1,2,3,4,5,6,7,8,9],
}
df_test = pd.DataFrame(data)
Now I need to find the distance between two points for the above sample data set, to that, I had already defined a function
import numpy as np
def dist(X1,Y1,X2,Y2):
#find the distance between 2 points
d = np.sqrt((X2 - X1) * (X2 - X1) + (Y2 - Y1) * (Y2 - Y1))
return d
I need to find the distance between 2 points, as example, if I used the 1st row I need to compare that row with all the other rows in the same data frame column, Then I need to use 2nd row and compare that row with all the other rows in the same column and so on until all the rows had compared with each other
To do this I used a nested for loop as below
res = []
for i in range(0,len(df_test)):
X1 = df_test.A[i]
Y1 = df_test.B[i]
for j in range(0,len(df_test)):
X2 = df_test.A[j]
Y2 = df_test.B[j]
res = dist(X1,Y1,X2,Y2)
print(res)
print( )
How can I do the above-mentioned process without using a nested for loop , since this method will work with a small number of rows in a data frame but when dealing with large datasets terminal will get killed

Spark: Running Backwards Elimination By P-Value With Linear Regressions

I presently have a Spark Dataframe with 2 columns:
1) a column where each row contains a vector of predictive features
2) a column containing the value to be predicted.
To discern the most predictive features for use in a later model, I am using backwards elimination by P-value, as outlined by this article. Below is my code:
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
for i in range(0, num_vars):
model = LinearRegression(featuresCol="filtered_features", labelCol="averageScore")
model = model.fit(scoresDf)
p_values = model.summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
drop_max_index_udf = udf(lambda elem, drop_index, var_count:
Vectors.dense([elem[j] for j in range(var_count) if j not in [drop_index]]), VectorUDT())
scoresDfs = scoresDf.withColumn("filtered_features", drop_max_index_udf(scoresDf["filtered_features"],
lit(max_index), lit(num_vars)))
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
The code runs, but the only problem is that every iteration takes drastically longer than the last. Based on the answer to this question, it appears that the code is re-evaluating all prior iterations every time.
Ideally, I would like to feed the entire logic into some Pipeline structure that would store it all lazily and then execute sequentially with no repeats when called upon, but I am unsure as to whether that is even possible given that none of Spark's estimator / transformer functions seem to fit this use case.
Any guidance would be appreciated, thanks!
You are creating the model repeatedly inside a loop. It is a time consuming process and needs to be done once per training data set and a set of parameters. Try the following -
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
modelAlgo = LinearRegression(featuresCol="filtered_features", labelCol="averageScore")
model = modelAlgo.fit(scoresDf)
for i in range(0, num_vars):
p_values = model.summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
drop_max_index_udf = udf(lambda elem, drop_index, var_count:
Vectors.dense([elem[j] for j in range(var_count) if j not in [drop_index]]), VectorUDT())
scoresDfs = scoresDf.withColumn("filtered_features", drop_max_index_udf(scoresDf["filtered_features"],
lit(max_index), lit(num_vars)))
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
Once you are happy with the model you save it. When you need to evaluate your data just read this model and predict with it.
why are you doing
model = model.fit(scoresDf)
when scoredDfs contains your new df with one less independent variable?
If you change your code with the following:
independent_vars = ['x0', 'x1', 'x2', 'x3', 'x4']
def remove_element(array, index):
return Vectors.dense(np.delete(array, index, 0))
remove_element_udf = udf(lambda a, i: remove_element(a, i), VectorUDT())
max_p = 1
i = 0
while (max_p > 0.05):
model = LinearRegression(featuresCol="filtered_features",
labelCol="averageScore",
fitIntercept=False)
model = model.fit(scoresDf)
print('iteration: ', i)
summary = model.summary
summary_df = pd.DataFrame({
'var': independent_vars,
'coeff': model.coefficients,
'se': summary.coefficientStandardErrors,
'p_value': summary.pValues
})
print(summary_df)
print("r2: %f" % summary.r2)
p_values = summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
max_var = independent_vars[max_index]
print('-> max_index {max_index}, corresponding to var {var}'.format(max_index=max_index, var=max_var))
scoresDf = scoresDf.withColumn("filtered_features", remove_element_udf(scoresDf["filtered_features"],
lit(max_index)))
independent_vars = np.delete(independent_vars, max_index, 0)
print()
i += 1
you will get
iteration: 0
var coeff se p_value
0 x0 0.174697 0.207794 0.402616
1 x1 -0.448982 0.203421 0.029712
2 x2 -0.452940 0.233972 0.055856
3 x3 -3.213578 0.209935 0.000000
4 x4 3.790730 0.212917 0.000000
r2: 0.870330
-> max_index 0, corresponding to var x0
iteration: 1
var coeff se p_value
0 x1 -0.431835 0.202087 0.035150
1 x2 -0.460711 0.233432 0.051297
2 x3 -3.218725 0.209525 0.000000
3 x4 3.768661 0.210970 0.000000
r2: 0.869365
-> max_index 1, corresponding to var x2
iteration: 2
var coeff se p_value
0 x1 -0.479803 0.203592 0.020449
1 x3 -3.344830 0.202501 0.000000
2 x4 3.669419 0.207925 0.000000
r2: 0.864065
in first and second iteration, two independent variables with p-value greater than 0.05 are removed

How to find the slope: output format ((A,slope), B)

I'm enrolled in a Data Science course, and I'm trying to solve some programming problems, I haven't worked with Python in a long time, but I'm trying to improve my knowledge of the language.
Here is my problem:
def find_slope(x1, y1, x2, y2):
if (x1) == (x2):
return "inf"
else:
return ((float)(y2-y1)/(x2-x1))
Here is my driver code:
x1 = 1
y1 = 2
x2 = -7
y2 = -2
print(find_slope(x1, y1, x2, y2))
This is my output:
0.5
I'm not sure how to get it in the correct format, such as (((1, 2), .5), (3, 4))
NOTE: I wrote the code for the driver.
You can do this:
def find_slope(input):
x1 = input[0][0]
y1 = input[0][1]
x2 = input[1][0]
y2 = input[1][1]
if (x1) == (x2):
slope = "inf"
else:
slope = ((float)(y2-y1)/(x2-x1))
output = (((x1, y1), slope), (x2, y2))
return output
I changed the input to match the input format given in the screenshot.
Now the input is a single tuple, containing two tuples. Each of the inner tuples contain a x coordinate and a y coordinate.
You can call the function using
input = ((1, 2), (-7, -2))
output = find_slope(input)
The output will be in the format ((A, slope), B), where A and B are tuples containing the x and y coords.

Create Image using Matplotlib imshow meshgrid and custom colors

I am trying to create an image where the x axis is the width, and y axis is the height of the image. And where each point can be given a color based on a RBG mapping. From looking at imshow() from Matplotlib I guess I need to create a meshgrid on the form (NxMx3) where 3 is a tuple or something similar with the rbg colors.
But so far I have not managed to understand how to do that. Lets say I have this example:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
x_min = 1
x_max = 5
y_min = 1
y_max = 5
Nx = 5 #number of steps for x axis
Ny = 5 #number of steps for y axis
x = np.linspace(x_min, x_max, Nx)
y = np.linspace(y_min, y_max, Ny)
#Can then create a meshgrid using this to get the x and y axis system
xx, yy = np.meshgrid(x, y)
#imagine I have some funcion that does someting based on the x and y values
def somefunc(x_value, y_value):
#do something and return rbg based on that
return x_value + y_value
res = somefunc(xx, yy)
cmap = LinearSegmentedColormap.from_list('mycmap', ['white', 'blue', 'black'])
plt.figure(dpi=100)
plt.imshow(res, cmap=cmap, interpolation='bilinear')
plt.show()
And this creates a plot, but what would I have to do if my goal was to give spesific rbg values based on x and y values inside somefunc and make the resulting numpy array into a N x M x 3 array
I tried to make the somefunc function return a tuple of rbg values to use (r, b g) but that does not seem to work
It will of course completely depend on what you want to do with the values you supply to the function. So let's assume you just want to put the x values as the red channel and the y values as the blue channel, this could look like
def somefunc(x_value, y_value):
return np.dstack((x_value/5., np.zeros_like(x_value), y_value/5.))
Complete example:
import numpy as np
import matplotlib.pyplot as plt
x_min = 1
x_max = 5
y_min = 1
y_max = 5
Nx = 5 #number of steps for x axis
Ny = 5 #number of steps for y axis
x = np.linspace(x_min, x_max, Nx)
y = np.linspace(y_min, y_max, Ny)
#Can then create a meshgrid using this to get the x and y axis system
xx, yy = np.meshgrid(x, y)
#imagine I have some funcion that does someting based on the x and y values
def somefunc(x_value, y_value):
return np.dstack((x_value/5., np.zeros_like(x_value), y_value/5.))
res = somefunc(xx, yy)
plt.figure(dpi=100)
plt.imshow(res)
plt.show()
If you already have a (more complicated) function that returns an RGB tuple you may loop over the grid to fill an empty array with the values of the function.
#If you already have some function that returns an RGB tuple
def somefunc(x_value, y_value):
if x_value > 2 and y_value < 3:
return np.array(((y_value+1)/4., (y_value+2)/5., 0.43))
elif x_value <=2:
return np.array((y_value/5., (x_value+3)/5., 0.0))
else:
return np.array((x_value/5., (y_value+5)/10., 0.89))
# you may loop over the grid to fill a new array with those values
res = np.zeros((xx.shape[0],xx.shape[1],3))
for i in range(xx.shape[0]):
for j in range(xx.shape[1]):
res[i,j,:] = somefunc(xx[i,j],yy[i,j])
plt.figure(dpi=100)
plt.imshow(res)

matplotlib pcolormesh plot from x,y,z data

I have data in a textfile in tableform with three columns. I use np.genfromtxt to read all the columns into matplotlib as x, y, z.
I want to create a color meshplot where x and y are the coordinates and z represents the color, i think people refer to such a plot as heatmap.
My code is as follows:
x = np.genfromtxt('mesh.txt', dtype=float, delimiter=' ', usecols = (0))
y = np.genfromtxt('mesh.txt', dtype=float, delimiter=' ', usecols = (1))
z = np.genfromtxt('mesh.txt', dtype=float, delimiter=' ', usecols = (2))
xmesh, ymesh = np.meshgrid(x,y)
diagram1.pcolormesh(xmesh,ymesh,z)
But I get the following error message:
line 7154, in pcolormesh
C = ma.ravel(C[0:Ny-1, 0:Nx-1]) # data point in each cell is value at
IndexError: too many indices
The textfile is as follows:
1 1 5
2 1 4
3 1 2
4 1 6
1 2 6
2 2 2
3 2 1
4 2 9
1 3 7
2 3 4
3 3 3
4 3 5
1 4 3
2 4 4
3 4 7
4 4 6
How is this to solve.
In the example data provided above, x, y, and z can be easily reshaped to get 2D array. The answer below is for someone who is looking for more generalized answer with random x,y, and z arrays.
import matplotlib.pyplot as plt
from matplotlib.mlab import griddata
import numpy
# use your x,y and z arrays here
x = numpy.random.randint(1,30, 50)
y = numpy.random.randint(1,30, 50)
z = numpy.random.randint(1,30, 50)
yy, xx = numpy.meshgrid(y,x)
zz = griddata(x,y,z,xx,yy, interp='linear')
plt.pcolor(zz)
#plt.contourf(xx,yy,zz) # if you want contour plot
#plt.imshow(zz)
plt.pcolorbar()
plt.show()
My guess is that x, y and z will be read as one-dimensional vectors of the same length, let's say N. The problem is that when you create your xmesh and ymesh, they are N x N, which your z values should be as well. It's only N, which is why you are getting an error.
What is the layout of your file? I'm guessing each row is a (x,y,z) that you want to create a mesh from. In order to do this, you need to know how the points are ordered as a mesh (either as row-major or column-major). Once you know this, instead of creating xmesh and ymesh, you can do something like this:
N = np.sqrt(len(x)) # Only if squared, adjust accordingly
x = x.reshape((N, N))
y = y.reshape((N, N))
z = z.reshape((N, N))
pcolormesh(x, y, z)
Before doing this, I would start by doing this:
scatter(x, y, c=z)
which will give you the points of the mesh, which is a good starting point.
I had the same problem and agree with Gustav Larsson's suggestion to use
scatter(x, y, c=z)
In my particular case, I set the linewidths of the scatter points to zero:
scatter(x, y, c=z, linewidths=0)
of course, you can play around with other decorations, color schemes etc., the reference of matplotlib.pyplot.scatter will help you further.
It seems you are plotting X and Y as 2D arrays while Z is still a 1D array. Try something like:
Znew=np.reshape(z,(len(xmesh[:,0]),len(xmesh[0,:])))
diagram1.pcolormesh(xmesh,ymesh,Znew)
Update: Tou have a X/Y grid of size 4x4:
x = np.genfromtxt('mesh.txt', dtype=float, delimiter=' ', usecols = (0))
y = np.genfromtxt('mesh.txt', dtype=float, delimiter=' ', usecols = (1))
z = np.genfromtxt('mesh.txt', dtype=float, delimiter=' ', usecols = (2))
Reshape the arrays as suggestet by #Gustav Larsson and myself like this:
Xnew=np.reshape(x,(4,4))
Xnew=np.reshape(y,(4,4))
Znew=np.reshape(z,(4,4))
Which gives you three 4x4 arrays to plot using pcolormesh:
diagram1.pcolormesh(Xnew,Ynew,Znew)

Resources