One Hot Encoding a single column - python-3.x

I am trying to use one hot encoder on the target column('Species') in the Iris dataset.
But I am getting the following errors:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
I did google the issue and i found that most of the scikit learn estimators need a 2D array rather than a 1D array.
At the same time, I also found that we can try passing the dataframe with its index to encode single columns, but it didn't work
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')
X = dataset.iloc[:,1:5].values
y = dataset.iloc[:, 5].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder= LabelEncoder()
y = labelencoder.fit_transform(y)
onehotencoder = OneHotEncoder(categorical_features=[0])
y = onehotencoder.fit_transform(y)
I am trying to encode a single categorical column and split into multiple columns (the way the encoding usually works)

ValueError: Expected 2D array, got 1D array instead: Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
Says that you need to convert your array to a vector.
You can do that by:
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import numpy as np
# load iris dataset
>>> iris = datasets.load_iris()
>>> iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
>>> y = iris.target.values
>>> onehotencoder = OneHotEncoder(categories='auto')
>>> y = onehotencoder.fit_transform(y.reshape(-1,1))
# y - will be sparse matrix of type '<class 'numpy.float64'>
# if you want it to be a array you need to
>>> print(y.toarray())
[[1. 0. 0.]
[1. 0. 0.]
. . . .
[0. 0. 1.]
[0. 0. 1.]]
Also you can use get_dummies function (docs)
>>> pd.get_dummies(iris.target).head()
0.0 1.0 2.0
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
Hope that helps!

For your case, since it looks like you are using the kaggle dataset, I would just use
import pandas as pd
pd.get_dummies(df.Species).head()
Out[158]:
Iris-setosa Iris-versicolor Iris-virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
Note that the default here encodes all the classes (3 species), where it is common to use just two and compare differences in the means to the baseline group, (eg. the default in R or typically when doing regression/ANOVA which can be accomplished using the drop_first argument).

I came across similar situation and found the below method to be working :
Use two square brackets for the column name in the fit or fit_transform command
one_hot_enc = OneHotEncoder()
arr = one_hot_enc.fit_transform(data[['column']])
df = pd.DataFrame(arr)
The fit_transform gives you an array and you can convert this to pandas dataframe. You may append this to the original dataframe or directly assign to an existing column.

Related

Python 3.6 adjacency Matrix: How to obtain it in a better way

The problem starts with a classical csv file. An example can be:
date;origing;destiny;minutes;distance
19-02-2020;A;B;36;4
20-02-2020;A;B;33;4
24-02-2020;B;A;37;4
25-02-2020;A;C;20;7
27-02-2020;C;B;20;3
28-02-2020;A;B;37.2;4
28-02-2020;A;Z;44;10
My first idea consist in solving it in a classical programmaing way:
Loop + counter variables and represent de counter variables in a matrix like:
A B C Z
A 0 3 1 1
B 1 0 0 0
C 0 1 0 0
Z 0 0 0 0
My first question is if there is a better automatic way of implementing this in python instead os use classical programming algorithm based on loops and counters.
and what about obtaining more complex adjacence matrixes like the one that give you for example and average of times in the values?
There are packages like networkx, but you could use the groupby of pandas.
I don't think pandas with groupby is the fastest. I think networkx would be faster, but at least groupby is better than a loop (is my guess).
import pandas as pd
import numpy as np
M = pd.read_csv('../sample_data.csv', sep=';')
M['constant'] = 1
print(M)
date origing destiny minutes distance constant
0 19-02-2020 A B 36.0 4 1
1 20-02-2020 A B 33.0 4 1
2 24-02-2020 B A 37.0 4 1
3 25-02-2020 A C 20.0 7 1
4 27-02-2020 C B 20.0 3 1
5 28-02-2020 A B 37.2 4 1
6 28-02-2020 A Z 44.0 10 1
With groupby we can get counts;
counts = M.groupby(['origing','destiny']).count()[['constant']]
counts
constant
origing destiny
A B 3
C 1
Z 1
B A 1
C B 1
And store those values in a zero matrix
def key_map(key):
a,b = key
return (ord(a) - ord('A'),ord(b)-ord('A'))
will get the indicis, like
counts['constant'].keys().map(key_map).values
and we set those indicis to any values, i do the counts here, but you can use the same groupby to aggregate sum,average, or anything from other columns;
indici = np.array( [tuple(x) for x in counts['constant'].keys().map(key_map).values] )
indici = tuple(zip(*indici))
and store with
Z = np.zeros((26,26))
Z[ indici ] = counts['constant']
I only print first few with
print(Z[:3,:3])
[[0. 3. 1.]
[1. 0. 0.]
[0. 1. 0.]]

How to change one column of data to multiple column based on row using Python Pandas?

I dont know if I put the question correctly..
For example, I want
1
0
1
0
1
0
1
0
change into
1 0 1
0 1 0
1 0 x
The first list should not be changed..
and change the type to DataFrame..
I try use numpy.array, flatten the array. and reshape to columns using reshape(-1,3).T ..
but since there are some missing value to it.. I cannot reshape the array properly..
A possible solution would be to add the missing values to the array before resizing.
Starting point:
import numpy as np
import pandas as pd
# I assume you flattened the array.
data = np.array([1, 0, 1, 0, 1, 0, 1, 0, ])
Adding the new data based on the required shape and fill value:
new_shape = (3, 3)
fill_value = np.NaN
missing_length = np.product(new_shape) - data.size
missing_array = np.full(missing_length, fill_value)
data = np.hstack([data, missing_array])
Then apply the reshape and convert it to a dataframe:
data = data.reshape(new_shape)
df = pd.DataFrame(data)
print(df)
output:
0 1 2
0 1.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 NaN

What's the best way of converting a numeric array in a text file to a numpy array?

So I'm trying to create an array from a text file, the text file is laid out as follows. The numbers in the first two columns both go to 165:
0 0 1.0 0.0
1 0 0.0 0.0
1 1 0.0 0.0
2 0 -9.0933087157900000E-5 0.0000000000000000E+00
2 1 -2.7220323615900000E-09 -7.5751829208300000E-10
2 2 3.4709851601400000E-5 1.6729490538300000E-08
3 0 -3.2035914003000000E-06 0.0000000000000000E+00
3 1 2.6327440121800000E-05 5.4643630898200000E-06
3 2 1.4188179329400000E-05 4.8920365004800000E-06
3 3 1.2286058944700000E-05 -1.7854480816400000E-06
4 0 3.1973095717200000E-06 0.0000000000000000E+00
4 1 -5.9966018301500000E-06 1.6619345194700000E-06
4 2 -7.0818069269700000E-06 -6.7836271726900000E-06
4 3 -1.3622983381300000E-06 -1.3443472287100000E-05
4 4 -6.0257787358300000E-06 3.9396371953800000E-06
I'm trying to write a function where an array is made using the numbers in the 3rd columns, taking their positions in the array from the first two columns, and the empty cells are 0s. For example:
1 0 0 0
0 0 0 0
-9.09330871579000e-05 -2.72203236159000e-09 3.47098516014000e-05 0
-3.20359140030000e-06 2.63274401218000e-05 1.41881793294000e-05 1.22860589447000e-05
At the same time, I'm also trying to make a second array but using the numbers from the 4th column not the 3rd. The code that I've written so far is as follows and this is the array produced, I'm not even sure where the 4.41278e-08 comes from:
import numpy as np
def createarray(filepath,maxdegree):
Cnm = np.zeros((maxdegree+1,maxdegree+1))
Snm = np.zeros((maxdegree+1,maxdegree+1))
fid = np.genfromtxt(filepath)
for row in fid:
for n in range(0,maxdegree):
for m in range(0,maxdegree):
Cnm[n+1,m+1]=row[2]
Snm[n+1,m+1]=row[3]
return [Cnm, Snm]
0 0 0 0
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
I'm not getting any errors but I'm also not getting the right array. Can anyone shed some light on what I'm doing wrong?
Your data appear to be in a COO sparse matrix format already. This means, that you could use your own function, but you could also capitalize on the work done in the scipy.sparse package.
For example this code creates a function that would generate one of your matrices at a time. You could modify it to return both matrices.
import numpy as np
from scipy import sparse
def createarray(filepath, maxdegree, value_column):
"""Create single array from file"""
# load sparse data into numpy array
data = np.loadtxt(filepath)
# use coo_matrix to create the sparse matrix where the
# values are found in the value_column column of data
M = sparse.coo_matrix((data[:,value_column], (data[:,0], data[:,1])), shape=(maxdegree+1, maxdegree+1))
# if you need a numpy array call toarray() otherwise you
# can return M which is sparse and more memory efficient
return M.toarray()
Then for the first matrix you wanted to create you would set value_column to 2, and for the second you would set value_column to 3.
# first matrix
Cnm = createarray(filepath, maxdegree, 2)
# second matrix
Snm = createarray(filepath, maxdegree, 3)

Is there any way to reverse categorical value to original string or text value?

I have applied below is the data frame
cc temp
0 US 37.0
1 CA 12.0
2 US 35.0
3 AU 20.0
Now convert into category using
df = df.cc.cat.codes
I'm getting this as output
cc temp
0 2 37.0
1 1 12.0
2 2 35.0
3 0 20.0
My requirement is that how can I reverse it as origin any idea?
You could use LabelEncoder from sklearn.preprocessing, which offers a similar functionality to what you've done.
Here's how to do it with your dataframe:
# Assuming you've created the dataframe already
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Transforming categorical variable to label encoded form
df['cc'] = le.fit_transform(df['cc'])
# Converting back from label encoded form to labels
df['cc'] = le.inverse_transform(df['cc'])
You can read about label encoder and scikit-learn's implementation at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html.
Might also help to read about other forms of encoding categorical variables such as one hot encoding and target encoding, and which to use where.

Sparse vector to dataframe in pyspark

I have sparsevector in pyspark which looks like this
SparseVector(5,{1:5,2:3,3:5,4:3,5:2})
How can I convert it to pandas dataframe with two columns which loks like this
ID VALUE
1 5
2 3
3 5
4 3
5 2
I tried sparsevector.zipWithIndex() but it did not work
Your example array is malformed, as you've specified 5 levels so there can not be an index 5. After you fix that issue, you can simply call toArray() which will return a numpy.ndarray. Just pass that into the constructor for a pandas.DataFrame.
from pyspark.mllib.linalg import SparseVector # code works the same
#from pyspark.ml.linalg import SparseVector # code works the same
import pandas as pd
a = SparseVector(5,{0:5,1:3,2:5,3:3,4:2}) # note the index starts at 0
df = pd.DataFrame(a.toArray())
print(df)
# 0
#0 5.0
#1 3.0
#2 5.0
#3 3.0
#4 2.0
The code works the same whether you're working with pyspark.mllib.linalg.SparseVector or pyspark.ml.linalg.SparseVector.

Resources