Value error while generating indexes using PCA in scikit-learn - python-3.x

Using the following function i am trying to generate index from the data:
Function:
import numpy as np
from sklearn.decomposition import PCA
def pca_index(data,components=1,indx=1):
corrs = np.asarray(data.cov())
pca = PCA(n_components = components).fit(corrs)
trns = pca.transform(data)
index=np.dot(trns[0:indx],pca.explained_variance_ratio_[0:indx])
return index
Index: generation from principal components
index = pca_index(data=mydata,components=3,indx=2)
Following error is being generated when i am calling the function:
Traceback (most recent call last):
File "<ipython-input-411-35115ef28e61>", line 1, in <module>
index = pca_index(data=mydata,components=3,indx=2)
File "<ipython-input-410-49c0174a047a>", line 15, in pca_index
index=np.dot(trns[0:indx],pca.explained_variance_ratio_[0:indx])
ValueError: shapes (2,3) and (2,) not aligned: 3 (dim 1) != 2 (dim 0)
Can anyone help with the error.
According to my understanding there is some error at the following point when i am passing the subscript indices as variable (indx):
trns[0:indx],pca.explained_variance_ratio_[0:**indx**]

In np.dot you are trying to multiply a matrix having dimensions (2,3) with a matrix having dimensions (2,), i.e. a vector.
However, you can only multiply NxM to MxP, e.g. (3,2) to (2,1) or (2,3) to (3,1).
In your example the second matrix have dimensions of (2,) which, in numpy terms, is similar but not the same as (2,1). You can reshape a vector into a matrix with vector.reshape([2,1])
You might also transpose you first matrix, thus converting its dimensions from (2,3) to (3,2).
However, make sure that you multiply appropriate matrices as the result will differ from you might expect.

Related

Stacking up dataframes in a 3-dimenional numpy array

I have several pandas dataframe that I would like to stack them up using numpy as a three-dimensional numpy array. I could manually do the job using the following code:
arr = np.array([df1.values, df2.values], dtype="object")
However, since I have many dataframes, I can neither write this line for all the dataframes nor automate it.
I tried to use append function (np.append(df1.values, df2['1002'].values)) but it flattens dataframes and ignores their structure. What I want is a three-dimensional numpy array where the first dimension is the number of dataframes (that I have), the second one is the number of rows in each dataframe, and the third one is the number of columns. In the first example that I mentioned earlier, I get a three-dimensional numpy array. In fact when I run arr.shape the result is (2,) and when I run arr[0].shape and arr[1].shape, I get (26, 7) and (24, 7), respectively which are the structure of their corresponding dataframe.
I even ran np.append(df1.values, df2['1002'].values, axis=0) but I received the error of ValueError: all the input array dimensions for the concatenation axis must match exactly. Is there any way that I can fix this problem and stack up all my dataframes in a 3-dimensional numpy array?
Looks like you start with 2 frames with 7 columns, but different numbers of rows. The equivalent of:
In [1]: arr1 = np.ones((26,7)); arr2 = np.zeros((24,7))
...:
In [2]: arr = np.array([arr1, arr2], object)
In [3]: arr.shape
Out[3]: (2,)
In [4]: arr[0].shape
Out[4]: (26, 7)
You probably tried this without the object and got a 'ragged array' warning. In any case, this is not a 3d array. It is 1d (2,), with two arrays. It's roughly the same as the list
[arr1, arr2]
The np.append docs should make it clear that it flattens the arguments, when you don't specify an axis.
In [6]: np.append(arr1,arr2).shape
Out[6]: (350,)
You could specify an axis, and get a 2d array, where the 50 is the sum of 26 and 24.
In [7]: np.append(arr1,arr2,axis=0).shape
Out[7]: (50, 7)
This is the same as:
In [8]: np.concatenate((arr1,arr2), axis=0).shape
Out[8]: (50, 7)
np.append is poorly name cover for np.concatenate. It is not a list append clone. Learn to use concatenate and its stack derivatives. In
With different dataframe shapes, you cannot make a 3d array. Arrays cannot be 'ragged'.
As for working with more than 2 dataframes, if you can make a list of all the frames, you can use the initial syntax.
alist = []
for a in frame_list:
alist.append(a.values)
arr = np.array(alist, object)
But make such array doesn't do much for you.
If the frames are all the same size, then you can make a 3d array
In [10]: np.array([arr1[:10,:],arr2[:10,:]]).shape
Out[10]: (2, 10, 7)
In [11]: np.stack([arr1[:10,:],arr2[:10,:]]).shape
Out[11]: (2, 10, 7)
But if they differ, stack will complain about that:
In [12]: np.stack([arr1, arr2])
Traceback (most recent call last):
File "<ipython-input-12-23d05d0422dc>", line 1, in <module>
np.stack([arr1, arr2])
File "<__array_function__ internals>", line 180, in stack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

only one element tensors can be converted to Python scalars error with pca.fit_transform

I am trying to perform dimensionality reduction using PCA, where outputs is a list of tensors where each tensor has a shape of (1, 3, 32,32). Here is the code:
from sklearn.decomposition import PCA
pca = PCA(10)
pca_result = pca.fit_transform(output)
But I keep getting this error, regardless of whatever I tried:
ValueError: only one element tensors can be converted to Python scalars
I know that the tensors with size(1,3, 32,32) is making the issue, since its looking for 1 element as the error puts it, but do not know how to solve it.
I have tried flattening each tensor with looping over output (don't know if its the right way of solving this issue), using the following code but it leads to error in pca:
new_outputs = []
for i in outputs:
for j in i:
j = j.cpu()
j = j.detach().numpy()
j = j.flatten()
new_outputs.append(j)
pca_result = pca.fit_transform(new_output)
I would appreciate if anybody can help with this error whether the flattening approach I took, is correct.
PS:I have read the existing posts (post1,post2) discussing this error but none of them could solve my problem.
Assuming your Tensors are stored in a matrix with shape like (10, 3, 32, 32) where 10 corresponds to number of Tensors, you should flatten each like that:
import torch
from sklearn.decomposition import PCA
data= torch.rand((10, 3, 32, 32))
pca = PCA(10)
pca_result = pca.fit_transform(data.flatten(start_dim=1))
data.flatten(start_dim=1) makes your data to be in shape (10, 3*32*32)
The error you posted is actually related to one of the post you linked. The PCA estimator expects array-like object with fit() method and you provided a list of Tensors.

numpy code works in REPL, script says type error

Copy and pasting this code into the python3 REPL works, but when I run it as a script, I get a type error.
"""Softmax."""
scores = [3.0, 1.0, 0.2]
import numpy as np
from math import e
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
results = []
x = np.transpose(x)
for j in range(len(x)):
exps = [np.exp(s) for s in x[j]]
_sum = np.sum(np.exp(x[j]))
softmax = [i / _sum for i in exps]
results.append(softmax)
final = np.vstack(results)
return np.transpose(final)
# pass # TODO: Compute and return softmax(x)
print(softmax(scores))
# Plot softmax curves
import matplotlib.pyplot as plt
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
plt.plot(x, softmax(scores).T, linewidth=2)
plt.show()
The error I get running the script via CLI is the following:
bash$ python3 softmax.py
Traceback (most recent call last):
File "softmax.py", line 22, in <module>
print(softmax(scores))
File "softmax.py", line 13, in softmax
exps = [np.exp(s) for s in x[j]]
TypeError: 'numpy.float64' object is not iterable
This kind of crap makes me so nervous about running interpreted code in production with libraries like these, seriously unreliable and undefined behaviour is totally unacceptable IMO.
At the top of your script, you define
scores = [3.0, 1.0, 0.2]
This is the argument in your first call of softmax(scores). When converted to a numpy array, scores is 1-d array with shape (3,).
You pass scores into the function, and then it is converted to a numpy array by the call
x = np.transpose(x)
However, it is still 1-d, with shape (3,). The transpose function swaps dimensions, but it does not add a dimension to a 1-d array. In effect, transpose is a "no-op" when applied to a 1-d array.
Then, in the loop that follows, x[j] is a scalar of type numpy.float64, so it does not make sense to write [np.exp(s) for s in x[j]]. x[j] is a scalar, not a sequence, so you can't iterate over it.
In the bottom part of your script, you redefine scores as
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
Now scores is 2-d array (scores.shape is (3, 80)), so you don't get an error when you call softmax(scores).

Creating mathematical equations using numpy in python

I want to create equations using numpy array multiplication ie I want to keep all variables in an array and its coefficients in other array and multiply both with each other to produce an expression so that I can use m.Equation() method of GEKKO. I tried the mentioned code but failed, please let me know how I can achieve my goal.
By "it failed" I meant that it just gave an error and did not let me use x*y==1 as equation in m.Equation() method available in GEKKO. My target is that I want to keep variables in one array and their coefficients in the other array and I multiply them to get mathematical equations to be used as input in m.Equation() method.
import numpy as np
from gekko import GEKKO
X = np.array([x,y,z])
y = np.array([4,5,6])
m = GEKKO(remote=False)
m.Equation(x*y==1)
# I wanted to get a result like 4x+5y+6z=1
The error I get is below
Traceback (most recent call last):
File "C:\Users\kk\AppData\Local\Programs\Python\Python37\MY WORK FILES\numpy practise.py", line 5, in <module>
X = np.array([x,y,z])
NameError: name 'x' is not defined
You need to define variables and make the coefficients into a Gekko object. You can use an array to make the variables and a parameter for the coefficients:
from gekko import GEKKO
m = GEKKO(remote=False)
X = m.Array(m.Var, 3)
y = m.Param([4, 5, 6])
eq = m.Equation(X.dot(y) == 1)
print(eq.value)
Output:
((((v1)*(4))+((v2)*(5)))+((v3)*(6)))=1

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

Resources