how to remove element pairs from numpy array? - python-3.x

I have an array:
coordinates = np.asarray(list(product(seq, seq))) - fieldSize_va/2.0
This coordinates is numpy.ndarray type with 1600 elements (pairs). And can be seen as:
>>> array([[-4.5, -4.5], [-4.5, -4.26923077], [-4.5 , -4.03846154], ..., [4.5, 4.03846154], [4.5, 4.26923077], [4.5, 4.5]])
I have another array:
centralLines = np.asarray([(xa, ya),(xa, yb),(xb, ya),(xb, yb)])
which has values as:
>>> array([[ 0.11538462, 0.11538462], [ 0.11538462, -0.11538462], [-0.11538462, 0.11538462], [-0.11538462, -0.11538462]])
The coordinates variable contains all the pairs that are in centralLines variable. I want to remove centralLines pair elements from coordinates. How to do this??
The coordinates variable is computed using the following code:
import math
import numpy as np
from itertools import product
from numpy import linspace,degrees,random
N = 40 * 40
fieldSize_va = 9
seq = linspace(0, fieldSize_va, math.sqrt(N))
coordinates = np.asarray(list(product(seq, seq))) - fieldSize_va/2.0

Solution
One easy way to solve this would be to sweep the original array and keep the different pairs:
result = np.array([position for position in coordinates if position not in centralLines])
However, I must warn you that this solution is not optimized. Perhaps somebody else comes with a faster vectorized solution.
Sidenote 1
I would recommend you to follow some of the common guidelines of python syntax, namely PEP8.
Sidenote 2
Importing numpy just once improves readability of your code!
Repetitive:
import numpy as np
from numpy import linspace
seq = linspace(0, fieldSize_va, math.sqrt(N))
Better:
import numpy as np
seq = np.linspace(0, fieldSize_va, math.sqrt(N))
Sidenote 3
The square root is already included in numpy, as np.sqrt. You can then prescind of importing the math module.

Related

How can I interpolate a numpy array so that it becomes a certain length?

I have three numpy arrays each with different lengths:
A.shape = (3401,)
B.shape = (2200,)
C.shape = (4103,)
I would like to average the three arrays to produce a new array with size of the largest array (in this case C):
D.shape = (4103,)
Problem is, I don't think I can do this without adding "fake" data to A and B, by interpolation.
How can I perform interpolation on the first two numpy arrays so that they are of the same length as array C?
Do I even need to interpolate here?
First thing that comes to mind is zoom from scipy:
The array is zoomed using spline interpolation of the requested order.
Code:
import numpy as np
from scipy.ndimage import zoom
A = np.random.rand(3401)
B = np.random.rand(2200)
C = np.ones(4103)
for arr in [A, B]:
zoom_rate = C.shape[0] / arr.shape[0]
arr = zoom(arr, zoom_rate)
print(arr.shape)
Output:
(4103,)
(4103,)
I think the simplest option is to do the following:
D = np.concatenate([np.average([A[:2200], B, C[:2200]], axis=0),
np.average([A[2200:3401], C[2200:3401]], axis=0),
C[3401:]])

Reconstructing a matrix from an SVD in python 3

Hi so basically my question is I have a matrix which I've SVD decomposed and have it in the variables u, s, and v. I've made some alterations to the s matrix to make it diagonal, as well as altered some of the numbers. Now I'm basically trying to reconstruct it into a regular matrix from the 3 matrices back into the original matrix. Does anyone know of any functions that do this? I can't seem to find any examples of this within numpy.
The only mildly tricky bit would be "expanding" s If you have scipy installed it has scipy.linalg.diagsvd which can do that for you:
>>> import numpy as np
>>> import scipy.linalg as la
>>>
>>> rng = np.random.default_rng()
>>> A = rng.uniform(-1,1,(4,3))
>>> u,s,v = np.linalg.svd(A)
>>>
>>> B = u#la.diagsvd(s,*A.shape)#v
>>>
>>> np.allclose(A,B)
True
I figured it out, just using the np.matmul() function and then just multiplying the 3 matrices of u s and v together was enough to get them back into an original matrix.

How to interpret the model once a set of coefficient is obtained for Multivariable polynomial regression?

I was solving a Multivariable polynomial regression problem,as a part of an online course, where one must obtain a model (polynomial form) for determining 'price of a car' as a function of 'horsepower','curb-weight','engine-size','highway-mpg'. The code given in the course slide didn't work for me and hence I tried to solve the problem on my own using a little different approach and (not sure) I succedded.
Now I want to determine which coefficient belongs to which variable and to what power.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
lm=LinearRegression()
pr=PolynomialFeatures(degree=2, include_bias=False)
zi=df[['horsepower','curb-weight','engine-size','highway-mpg']]
y=df["price"]
x_poly=pr.fit_transform(zi)
lm.fit(x_poly,y)
y_poly_pred=lm.predict(x_poly)
print(lm.intercept_)
print(lm.coef_)
The output of the 'print(lm.coef_)' is an array:
[ 3.76158683e+02, 1.09866844e+01, -1.15342835e+02, 2.20081486e+02,
1.67487147e+00, -1.85925420e-01, -1.27963440e+00, -1.97616945e+00,
5.93872420e-04, 1.11397083e-01, -2.12935236e-01, 1.04605018e-01,
2.69312438e-01, 4.36657298e+00]
How can I assign or know to which variables and to which powers each of these coeffecients correspond to?
One way of doing is, You can get the ploymomialfeature column names like this
pr.get_feature_names(zi.columns)
and
pd.DataFrame(zip(pr.get_feature_names(zi.columns),lm.coef_),columns=["feature","coef_"])
Above should print the coef for each feature
Working example :
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict({
'x': np.random.randint(low=1, high=10, size=5),
'y': np.random.randint(low=-1, high=1, size=5),
})
lm=LinearRegression()
p = PolynomialFeatures(degree=2)
p_data = p.fit_transform(data)
lm.fit(p_data,data['y'])
print (p.get_feature_names(data.columns))
coefmapping = pd.DataFrame(zip(p.get_feature_names(data.columns),lm.coef_),columns=["feature","coef_"])
print(coefmapping)
output:
feature coef_
0 1 -1.204939e-14
1 x -1.165951e-15
2 y 5.000000e-01
3 x^2 -6.938894e-18
4 x y -3.156113e-16
5 y^2 -5.000000e-01

Multi Label Text Data Visualization

I have multi-label text data. I want to visualize this data in python in some good graph to get an idea how much overlapping exist in my data and also wanted to know if there is any pattern exist in overlapping like when 40% of times class_1 is coming then also class_40 is coming too.
Data is in this form:
paragraph_1 class_1
paragraph_11 class_2
paragraph_1 class_2
paragraph_1 class_3
paragraph_13 class_3
What is the best way to visualize such data? Which library can help in this case seaborn, matplotlib etc?
You can try this:
%matplotlib inline
import matplotlib.pylab as plt
from collections import Counter
x = ['paragraph1', 'paragraph1','paragraph1','paragraph1','paragraph2', 'paragraph2','paragraph3','paragraph1','paragraph4']
y = ['class1','class1','class1', 'class2','class3','class3', 'class1', 'class3','class4']
# count the occurrences of each point
c = Counter(zip(x,y))
# create a list of the sizes, here multiplied by 10 for scale
s = [10*c[(xx,yy)] for xx,yy in zip(x,y)]
plt.grid()
# plot it
plt.scatter(x, y, s=s)
plt.show()
The higher is the occurence, the bigger is the marker.
Different question, but same answer proposed by #James can be found here: How to have scatter points become larger for higher density using matplotlib?
Edit1 (if you have bigger dataset)
Different approach using heatmaps:
import numpy as np
from collections import Counter
import seaborn as sns
import pandas as pd
x = ['paragraph1', 'paragraph1','paragraph1','paragraph1','paragraph2', 'paragraph2','paragraph3','paragraph1','paragraph4']
y = ['class1','class1','class1', 'class2','class3','class3', 'class1', 'class3','class4']
# count the occurrences of each point
c = Counter(zip(x,y))
# fill pandas DataFrame with zeros
dff = pd.DataFrame(0,columns =np.unique(x) , index =np.unique(y))
# count occurencies and prepare data for heatmap
for k,v in c.items():
dff[k[0]][k[1]] = v
sns.heatmap(dff,annot=True, fmt="d")

1-D interpolation using python 3.x

I have a data that looks like a sigmoidal plot but flipped relative to the vertical line.
But the plot is a result of plotting 1D data instead of some sort of function.
My goal is to find the x value when the y value is at 50%. As you can see, there is no data point when y is exactly at 50%.
Interpolate comes to my mind. But I'm not sure if interpolate enable me to find the x value when the y value is 50%. So my question is 1) can you use interpolate to find the x when the y is 50%? or 2)do you need to fit the data to some sort of a function?
Below is what I currently have in my code
import numpy as np
import matplotlib.pyplot as plt
my_x = [4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66]
my_y_raw=np.array([0.99470977497817203, 0.99434995886145172, 0.98974611323163653, 0.961630837657524, 0.99327633558441175, 0.99338952769251909, 0.99428263292577534, 0.98690514212711611, 0.99111667721533181, 0.99149418924880861, 0.99133773062680464, 0.99143506380003499, 0.99151080464011454, 0.99268261743308517, 0.99289757252812316, 0.99100207861144063, 0.99157171773324027, 0.99112571824824358, 0.99031608691035722, 0.98978104266076905, 0.989782674787969, 0.98897835092187614, 0.98517540405423909, 0.98308943666187076, 0.96081810781994603, 0.85563541881892147, 0.61570811548079107, 0.33076276040577052, 0.14655134838124245, 0.076853147122142126, 0.035831324928136087, 0.021344669212790181])
my_y=my_y_raw/np.max(my_y_raw)
plt.plot(my_x, my_y,color='k', markersize=40)
plt.scatter(my_x,my_y,marker='*',label="myplot", color='k', edgecolor='k', linewidth=1,facecolors='none',s=50)
plt.legend(loc="lower left")
plt.xlim([4,102])
plt.show()
Using SciPy
The most straightforward way to do the interpolation is to use the SciPy interpolate.interp1d function. SciPy is closely related to NumPy and you may already have it installed. The advantage to interp1d is that it can sort the data for you. This comes at the cost of somewhat funky syntax. In many interpolation functions it is assumed that you are trying to interpolate a y value from an x value. These functions generally need the "x" values to be monotonically increasing. In your case, we swap the normal sense of x and y. The y values have an outlier as #Abhishek Mishra has pointed out. In the case of your data, you are lucky and you can get away with the the leaving the outlier in.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
my_x = [4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,
48,50,52,54,56,58,60,62,64,66]
my_y_raw=np.array([0.99470977497817203, 0.99434995886145172,
0.98974611323163653, 0.961630837657524, 0.99327633558441175,
0.99338952769251909, 0.99428263292577534, 0.98690514212711611,
0.99111667721533181, 0.99149418924880861, 0.99133773062680464,
0.99143506380003499, 0.99151080464011454, 0.99268261743308517,
0.99289757252812316, 0.99100207861144063, 0.99157171773324027,
0.99112571824824358, 0.99031608691035722, 0.98978104266076905,
0.989782674787969, 0.98897835092187614, 0.98517540405423909,
0.98308943666187076, 0.96081810781994603, 0.85563541881892147,
0.61570811548079107, 0.33076276040577052, 0.14655134838124245,
0.076853147122142126, 0.035831324928136087, 0.021344669212790181])
# set assume_sorted to have scipy automatically sort for you
f = interp1d(my_y_raw, my_x, assume_sorted = False)
xnew = f(0.5)
print('interpolated value is ', xnew)
plt.plot(my_x, my_y_raw,'x-', markersize=10)
plt.plot(xnew, 0.5, 'x', color = 'r', markersize=20)
plt.plot((0, xnew), (0.5,0.5), ':')
plt.grid(True)
plt.show()
which gives
interpolated value is 56.81214249272691
Using NumPy
Numpy also has an interp function, but it doesn't do the sort for you. And if you don't sort, you'll be sorry:
Does not check that the x-coordinate sequence xp is increasing. If xp
is not increasing, the results are nonsense.
The only way I could get np.interp to work was to shove the data in to a structured array.
import numpy as np
import matplotlib.pyplot as plt
my_x = np.array([4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,
48,50,52,54,56,58,60,62,64,66], dtype = np.float)
my_y_raw=np.array([0.99470977497817203, 0.99434995886145172,
0.98974611323163653, 0.961630837657524, 0.99327633558441175,
0.99338952769251909, 0.99428263292577534, 0.98690514212711611,
0.99111667721533181, 0.99149418924880861, 0.99133773062680464,
0.99143506380003499, 0.99151080464011454, 0.99268261743308517,
0.99289757252812316, 0.99100207861144063, 0.99157171773324027,
0.99112571824824358, 0.99031608691035722, 0.98978104266076905,
0.989782674787969, 0.98897835092187614, 0.98517540405423909,
0.98308943666187076, 0.96081810781994603, 0.85563541881892147,
0.61570811548079107, 0.33076276040577052, 0.14655134838124245,
0.076853147122142126, 0.035831324928136087, 0.021344669212790181],
dtype = np.float)
dt = np.dtype([('x', np.float), ('y', np.float)])
data = np.zeros( (len(my_x)), dtype = dt)
data['x'] = my_x
data['y'] = my_y_raw
data.sort(order = 'y') # sort data in place by y values
print('numpy interp gives ', np.interp(0.5, data['y'], data['x']))
which gives
numpy interp gives 56.81214249272691
As you said, your data looks like a flipped sigmoidal. Can we make the assumption that your function is a strictly decreasing function? If that is the case, we can try the following methods:
Remove all the points where the data is not strictly decreasing.For example, for your data that point will be near 0.
Use the binary search to find the location where y=0.5 should be put in.
Now you know two (x, y) pairs where your desired y=0.5 should lie.
You can use simple linear interpolation if (x, y) pairs are very close.
Otherwise, you can see what is the approximation of sigmoid near those pairs.
You might not need to fit any functions to your data. Simply find the following two elements:
The largest x for which y<50%
The smallest x for which y>50%
Then use interpolation and find the x*. Below is the code
my_x = np.array([4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66])
my_y=np.array([0.99470977497817203, 0.99434995886145172, 0.98974611323163653, 0.961630837657524, 0.99327633558441175, 0.99338952769251909, 0.99428263292577534, 0.98690514212711611, 0.99111667721533181, 0.99149418924880861, 0.99133773062680464, 0.99143506380003499, 0.99151080464011454, 0.99268261743308517, 0.99289757252812316, 0.99100207861144063, 0.99157171773324027, 0.99112571824824358, 0.99031608691035722, 0.98978104266076905, 0.989782674787969, 0.98897835092187614, 0.98517540405423909, 0.98308943666187076, 0.96081810781994603, 0.85563541881892147, 0.61570811548079107, 0.33076276040577052, 0.14655134838124245, 0.076853147122142126, 0.035831324928136087, 0.021344669212790181])
tempInd1 = my_y<.5 # This will only work if the values are monotonic
x1 = my_x[tempInd1][0]
y1 = my_y[tempInd1][0]
x2 = my_x[~tempInd1][-1]
y2 = my_y[~tempInd1][-1]
scipy.interp(0.5, [y1, y2], [x1, x2])

Resources