Apply MinMaxScaler() on a pandas column

Apply MinMaxScaler() on a pandas column - python-3.x

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?

The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.

Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

Related

Why I am getting matrices are not aligned error for DataFrame dot function?

I am trying to implement simple linear regression in Python using Numpy and Pandas. But I am getting a ValueError: matrices are not aligned error for calling the dot function which essentially calculates the matrix multiplication as the documentation says. Following is the code snippet:
import numpy as np
import pandas as pd
#initializing the matrices for X, y and theta
#dataset = pd.read_csv("data1.csv")
dataset = pd.DataFrame([[6.1101,17.592],[5.5277,9.1302],[8.5186,13.662],[7.0032,11.854],[5.8598,6.8233],[8.3829,11.886],[7.4764,4.3483],[8.5781,12]])
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
X.insert(0, "x_zero", np.ones(X.size), True)
print(X)
print(f"\n{y}")
theta = pd.DataFrame([[0],[1]])
temp = pd.DataFrame([[1],[1]])
print(X.shape)
print(theta.shape)
print(X.dot(theta))
And this is the output for the same:
x_zero 0
0 1.0 6.1101
1 1.0 5.5277
2 1.0 8.5186
3 1.0 7.0032
4 1.0 5.8598
5 1.0 8.3829
6 1.0 7.4764
7 1.0 8.5781
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
5 11.8860
6 4.3483
7 12.0000
Name: 1, dtype: float64
(8, 2)
(2, 1)
Traceback (most recent call last):
File "linear.py", line 16, in <module>
print(X.dot(theta))
File "/home/tejas/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1063, in dot
raise ValueError("matrices are not aligned")
ValueError: matrices are not aligned
As you can see the output of shape attributes for both of them, the second axis has same dimension (2) and dot function should return a 8*1 DataFrame. Then, why the error?

This misalignment is not a one coming from shapes, but the one coming from pandas indexes. You have 2 options to fix your problem:
Tweak theta assignment:
theta = pd.DataFrame([[0],[1]], index=X.columns)
So the indexes you multiply will match.
Remove indexes relevancy, by moving second df to numpy:
X.dot(theta.to_numpy())
This functionality is actually useful in pandas - that it tries to match smart the indexes, your case is just the quite specific one, when it becomes counterproductive ;)

How to predict unseen data?

Hi I am practicing ML models and facing issue while trying to predict the unseen data.
The error is coming while doing the onehotencoding for categorical data.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_x_1 = LabelEncoder() #will encode country
X[:,1] = labelencoder_x_1.fit_transform(X[:,1])
labelencoder_x_2 = LabelEncoder() #will encode Gender
X[:,2] = labelencoder_x_2.fit_transform(X[:,2])
onehotencoder_x = OneHotEncoder(categorical_features=[1])
X= onehotencoder_x.fit_transform(X).toarray()
X = X[:,1:]
My X has 11 columns and column 2 and 3 are categorical type(Country and Gender).
Model running fine but while trying to test the model against a random input its failing at onehotencoding.
input = [[619], ['France'], ['Male'], [42], [2], [0.0], [1], [1], [1],[101348.88]]
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
input= onehotencoder_x.fit_transform(input).toarray()
Error:
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:451:
DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20
and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-44-44a43edf17aa>", line 1, in <module>
input= onehotencoder_x.fit_transform(input).toarray()
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in
fit_transform
self._handle_deprecations(X)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in
_handle_deprecations
n_features = X.shape[1]
AttributeError: 'list' object has no attribute 'shape'

I believe this is because you have nested lists.
You should flatten your input list and use that for the prediction.
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
intput = [item for sublist in input for item in sublist]
input= onehotencoder_x.fit_transform(input).toarray()
If you have a nested list, then each element in the list will be considered an item that needs to go through the fit_transform function, but since it's a single element, it does not match the shape that fit_transform looks for, which is [1, 10] (1 row, 10 columns).

numpy code works in REPL, script says type error

Copy and pasting this code into the python3 REPL works, but when I run it as a script, I get a type error.
"""Softmax."""
scores = [3.0, 1.0, 0.2]
import numpy as np
from math import e
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
results = []
x = np.transpose(x)
for j in range(len(x)):
exps = [np.exp(s) for s in x[j]]
_sum = np.sum(np.exp(x[j]))
softmax = [i / _sum for i in exps]
results.append(softmax)
final = np.vstack(results)
return np.transpose(final)
# pass # TODO: Compute and return softmax(x)
print(softmax(scores))
# Plot softmax curves
import matplotlib.pyplot as plt
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
plt.plot(x, softmax(scores).T, linewidth=2)
plt.show()
The error I get running the script via CLI is the following:
bash$ python3 softmax.py
Traceback (most recent call last):
File "softmax.py", line 22, in <module>
print(softmax(scores))
File "softmax.py", line 13, in softmax
exps = [np.exp(s) for s in x[j]]
TypeError: 'numpy.float64' object is not iterable
This kind of crap makes me so nervous about running interpreted code in production with libraries like these, seriously unreliable and undefined behaviour is totally unacceptable IMO.

At the top of your script, you define
scores = [3.0, 1.0, 0.2]
This is the argument in your first call of softmax(scores). When converted to a numpy array, scores is 1-d array with shape (3,).
You pass scores into the function, and then it is converted to a numpy array by the call
x = np.transpose(x)
However, it is still 1-d, with shape (3,). The transpose function swaps dimensions, but it does not add a dimension to a 1-d array. In effect, transpose is a "no-op" when applied to a 1-d array.
Then, in the loop that follows, x[j] is a scalar of type numpy.float64, so it does not make sense to write [np.exp(s) for s in x[j]]. x[j] is a scalar, not a sequence, so you can't iterate over it.
In the bottom part of your script, you redefine scores as
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
Now scores is 2-d array (scores.shape is (3, 80)), so you don't get an error when you call softmax(scores).

slicing error in numpy array

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23

You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).

You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

Value error while generating indexes using PCA in scikit-learn

Using the following function i am trying to generate index from the data:
Function:
import numpy as np
from sklearn.decomposition import PCA
def pca_index(data,components=1,indx=1):
corrs = np.asarray(data.cov())
pca = PCA(n_components = components).fit(corrs)
trns = pca.transform(data)
index=np.dot(trns[0:indx],pca.explained_variance_ratio_[0:indx])
return index
Index: generation from principal components
index = pca_index(data=mydata,components=3,indx=2)
Following error is being generated when i am calling the function:
Traceback (most recent call last):
File "<ipython-input-411-35115ef28e61>", line 1, in <module>
index = pca_index(data=mydata,components=3,indx=2)
File "<ipython-input-410-49c0174a047a>", line 15, in pca_index
index=np.dot(trns[0:indx],pca.explained_variance_ratio_[0:indx])
ValueError: shapes (2,3) and (2,) not aligned: 3 (dim 1) != 2 (dim 0)
Can anyone help with the error.
According to my understanding there is some error at the following point when i am passing the subscript indices as variable (indx):
trns[0:indx],pca.explained_variance_ratio_[0:**indx**]

In np.dot you are trying to multiply a matrix having dimensions (2,3) with a matrix having dimensions (2,), i.e. a vector.
However, you can only multiply NxM to MxP, e.g. (3,2) to (2,1) or (2,3) to (3,1).
In your example the second matrix have dimensions of (2,) which, in numpy terms, is similar but not the same as (2,1). You can reshape a vector into a matrix with vector.reshape([2,1])
You might also transpose you first matrix, thus converting its dimensions from (2,3) to (3,2).
However, make sure that you multiply appropriate matrices as the result will differ from you might expect.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Apply MinMaxScaler() on a pandas column - python-3.x

Try to do with this way: import pandas as pd from sklearn import preprocessing x = df.values #returns a numpy array min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) df = pd.DataFrame(x_scaled)

Related

Why I am getting matrices are not aligned error for DataFrame dot function?

How to predict unseen data?

numpy code works in REPL, script says type error

slicing error in numpy array

Value error while generating indexes using PCA in scikit-learn

Categories

Resources