How to reverse one hot encoded value to Label? - keras

I am working on simple dataset to detect rock or mine with class names 'R' and 'M'. I have one hot encoded R to 1 and M to 0. Now I want to revese it.
I have tried many ways but couldn't find approach to convert back 1 to R and 0 to M
import numpy as np
import pandas as pd
import keras
from sklearn.preprocessing import LabelEncoder
df=pd.read_csv('D:\\Datasets\\node-fussy-examples-master\\node-fussy-
examples-master\\sonar\\training.csv')
ds=df.values
x_train=df[df.columns[0:60]].values
y_train=df[df.columns[60]]
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_Y = encoder.transform(y_train)
I expect 1 to be R and 0 to be M

You can use inverse_transform method:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
print(le.transform([1, 1, 2, 6]))
print(le.inverse_transform([0, 0, 1, 2]))
If you need to do the same thing in Tensorflow, look at this thread.

I just came across a use case today where I needed to convert an onehot-encoded tensor back to a normal label tensor. I know you can use np.argmax(probs, axis=1) or something to reverse an onehot-encoded probability tensor but that didn't work in my case as my data was not a soft probability tensor but rather a label tensor filled with either 0 or 1. I know this is not entirely relevant to OP's question but I thought someone might need to do something similar so I will just write my solution down here.
def reverse_onehot(onehot_data):
# onehot_data assumed to be channel last
data_copy = np.zeros(onehot_data.shape[:-1])
for c in range(onehot_data.shape[-1]):
img_c = onehot_data[..., c]
data_copy[img_c == 1] = c
return data_copy

Let's say y is your one-hot-encoded array. Then the following should give you the labels back:
unique_classes[np.argmax(y, axis=1)]
assuming you used unique_classes for encoding too (order is important).

Related

Why does kmeans give exactly the same results everytime?

I have re-run kmeans 4 times and get
From other answers, I got that
Everytime K-Means initializes the centroid, it is generated randomly.
Could you please explain why the results are exactly the same each time?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
They are not the same. They are similar. K-means is an algorithm that is in a way moving centroids iteratively so that they become better and better at splitting data and while this process is deterministic, you have to pick initial values for those centroids and this is usually done at random. Random start, doesn't mean that final centroids will be random. They will converge to something relatively good and often similar.
Have a look at your code with this simple modification:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
if you have a look at exact values of those centroids, they will look like that:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
Similar, but different sets of values.
Also:
Have a look at default arguments to KMeans. There is one called n_init:
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
By default it is equal to 10. Which means every time you run k-means it actually run 10 times and picked the best result. Those best results will be even more similar, than results of a single run of k-means.
I post #AEF's comment to remove this question from unanswered list.
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.
Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.
The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). random_state’s value may be:
for reference :
https://scikit-learn.org/stable/glossary.html#term-random_state

Statistical tests: how do (perception; actual results; and next) interact?

What is the interaction between perception, outcome, and outlook?
I've brought them into categorical variables to [potentially] simplify things.
import pandas as pd
import numpy as np
high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
'age': np.random.randint(0, high, size),
'smokes_cat': pd.Categorical(np.tile(['lots', 'little', 'not'],
size//3+1)[:size]),
'outcome': np.random.randint(0, high, size),
'outlook_cat': pd.Categorical(np.tile(['positive', 'neutral',
'negative'],
size//3+1)[:size])
})
df.insert(2, 'age_cat', pd.Categorical(pd.cut(df.age, range(0, high+5, size//2),
right=False, labels=[
"{0} - {1}".format(i, i + 9)
for i in range(0, high, size//2)])))
def tierify(i):
if i <= 25:
return 'lowest'
elif i <= 50:
return 'low'
elif i <= 75:
return 'med'
return 'high'
df.insert(1, 'perception_cat', df['perception'].map(tierify))
df.insert(6, 'outcome_cat', df['outcome'].map(tierify))
np.random.shuffle(df['smokes_cat'])
Run online: http://ideone.com/fftuSv or https://repl.it/repls/MicroLeftSequences
This is faked data but should present the idea. The individual have a perceived view perception, then they are presented with actual outcome, and from that can decide their outlook.
Using Python (pandas, or anything open-source really), how do I show the probability—and p-value—of the interaction between these 3 dependent columns (possibly using the age, smokes_cat as potential confounders)?
You can use interaction plots for this particular purpose. This fits pretty well to your case. I would use such plot for your data. I've tried it for your dummy data generated in the question, and you can write your code like below. Think it as a pseudo-code though, you must tailor the code to your need.
In its simple form:
If the lines in the plot have an intersection or likely to have for other values, then you may assume that there is an interaction effect.
If the lines are parellel or not likely to have an intersection, then you assume there is no interaction effect.
Yet, for additional and deeper understanding, I placed some links that you can check out.
Code
... # The rest of the code in the question.
# Interaction plot
import matplotlib.pyplot as plt
from statsmodels.graphics.factorplots import interaction_plot
p = interaction_plot(
x = df['perception'],
trace=df['outlook_cat'],
response= df['outcome']
)
plt.savefig('./my_interaction_plot.png') # or plt.show()
You can find the documentation of interaction_plot() here. Besides, I also suggest you run an ANOVA.
Further reading
You can check out these links:
(A paper) titled Interaction Effects in ANOVA.
(A case) in practice case.
(Another case) in practice case.
One option is a Multinomial logit model:
# Create one-hot encoded version of categorical variables
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
all_enc_df = pd.DataFrame({column: enc.fit_transform(df[column])
for column in ('perception_cat', 'age_cat',
'smokes_cat', 'outlook_cat')})
# Regression
from sklearn.linear_model import LogisticRegression
X, y = (all_enc_df[['age_cat', 'smokes_cat', 'outlook_cat']],
all_enc_df[['perception_cat']])
#clf = LogisticRegression(random_state=0, solver='lbfgs',
# multi_class='multinomial').fit(X, y)
import statsmodels.api as sm
mullogit = sm.MNLogit(y,X)
mulfit = mullogit.fit(method='bfgs', maxiter=100)
print(mulfit.summary())
https://repl.it/repls/MicroLeftSequences

how to find the class labels after the one hot encoding in LabelBinarizer sklearn

I am working on cifar data set for classification of images. where i used one hot encoding the class labels as follows:
lists = ['frog',
'truck',
'deer',
'automobile',
'bird',
'horse',
'ship',
'cat',
'dog',
'airplane']
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
label_binarizer.fit(lists)
def one_hot_encode(x):
return label_binarizer.transform(x)
//here y_train is list of training labels
y_train = one_hot_encode(y_train)
print(y_train[0])
// output as [0 0 0 0 0 0 1 0 0 0]
which means ship from the list or its something else? If any thing else can anyone help me to get the class.
As far as i know my first element in the list of train label is frog not a ship thanks.
LabelBinarizer has a inverse_transform function which can be used to get the original value back from the one-hot encoded value.
Check the documentation here
And by the way, the values will be stored in alphabetic order in the LabelBinarizer.
Example:
label_binarizer.inverse_transform([y_train[0]])
Output: 'frog'
they may not be ordered as you wish. You can build a mapping table(dict) for further use like:
dict={}
for i in range(len(lists)):
dict[lists[i]]=y_train[i]
print("your mapping table:\n",dict)
print("\any value is then accessable by value:\n",'ship',dict['ship'])

Python Shape function for k-means clustering

I have one geotiff grey scale image which gave me the (4377, 6172) 2D array. In the first part, I am considering (:1024, :1024) values(Total values are -> 1024 * 1024 = 1048576) for my compression algorithm. Through this algorithm, I am getting total 4 values in finalmatrix list var through the algorithm. After this, I am applying K-means algorithm on that values. A program is below :
import numpy as np
from osgeo import gdal
from sklearn import cluster
import matplotlib.pyplot as plt
dataset =gdal.Open("1.tif")
band = dataset.GetRasterBand(1)
img = band.ReadAsArray()
finalmat = [255, 0, 2, 2]
#Converting list to array for dimensional change
ay = np.asarray(finalmat).reshape(-1,1)
fig = plt.figure()
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(ay)
cluster_means = k_means.cluster_centers_.squeeze()
a_clustered = k_means.labels_
print('# of observation :',ay.shape)
print('Cluster Means : ', cluster_means)
a_clustered.shape= img.shape
fig=plt.figure(figsize=(125,125))
ax = plt.subplot(2,4,8)
plt.axis('off')
xlabel = str(1) , ' clusters'
ax.set_title(xlabel)
plt.imshow(a_clustered)
plt.show()
fig.savefig('kmeans-1 clust ndvi08jan2010_guj 12 .png')
In the above Program I am getting error in the line a_clustered.shape= img.shape. The error which I am getting is below:
Error line:
a_clustered.shape= img.shape
ValueError: cannot reshape array of size 4 into shape (4377,6172)
<matplotlib.figure.Figure at 0x7fb7c63975c0>
Actually, I want to visualize the clustering on Original image through compressed value which I am getting. Can you please give suggestion what to do
It does not make a lot of sense to use KMeans on 1 dimensional data.
And it makes even less sense to use it on a 4 x 1 array!
Your site then comes from the fact that you can't just resize a 4 x 1 integer array into a large picture.
Just print the array a_clustered you are trying to plot. It probably contains [0, 1, 1, 1].

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

Resources