Scikit code explanation needed - scikit-learn

Machine learning/Python Noob here. Can someone explain the below code to me? I don't understand how the below line works.
# This line in the code below, what does it do?
label_encoder.append(preprocessing.LabelEncoder())
label_encoder = []
X_encoded = np.empty(X.shape)
for i,item in enumerate(X[0]):
if item.isdigit():
X_encoded[:, i] = X[:, i]
else:
label_encoder.append(preprocessing.LabelEncoder())
X_encoded[:, i] = label_encoder[-1].fit_transform(X[:, i])
Thanks!

label_encoder is a list, which in python is an ordered collection that you can use to store any kind of object. It is named incorrectly, it should be label_encoders, plurl.
We first create an empty one:
label_encoders = []
Then when we encounter the need to encode a column
if item.isdigit():
# Don't need to endcode.
else:
# Do need to encode.
we create a new preprocessing.LabelEncoder() object and save it for later use
label_encoders.append(preprocessing.LabelEncoder())
Finally, we use the most recently created LabelEncoder object to actually encode the column
X_encoded[:, i] = label_encoders[-1].fit_transform(X[:, i])
We need to store the new LabelEncoder objects somewhere, since we almost certainly will encounter a test set or new production data in the future, and will need to encode that data in the same way we encoded our training data.
I probably would have written the code like this, which is slightly clearer
label_encoders = []
X_encoded = np.empty(X.shape)
for i, item in enumerate(X[0]):
if item.isdigit():
X_encoded[:, i] = X[:, i]
else:
label_encoder = preprocessing.LabelEncoder()
X_encoded[:, i] = label_encoder.fit_transform(X[:, i])
label_encoders.append(label_encoder)
Thank you! I didn't realize that preprocessing.LabelEncoder() returned a list.
It doesn't! The list comes from the line
label_encoders = []
The preprocessing.LabelEncoder() call returns a LabelEncoder type object. This implements the sklearn transformation interface, which allows you to to use the fit_transform and transform methods to encode your features.

Related

How to correctly inverse_transform TFIDF vectorizer

am trying to oversample data using imblearn using the below code
def oversample(df):
description = df['DESCRIPTION']
labels = df['LABEL']
vec = TfidfVectorizer(
norm='l2',
lowercase=True,
strip_accents=None,
encoding='utf-8',
preprocessor=None,
token_pattern=r"(?u)\S\S+")
desc = vec.fit_transform(description)
encoder = LabelEncoder()
encoder.fit(labels)
labels = encoder.transform(labels)
over = RandomOverSampler(random_state=0)
X, y = over.fit_resample(desc, labels)
oversampled_descriptions = vec.inverse_transform(X)
label = encoder.inverse_transform(y)
yet, am having an issue in text ordering, after I inverse_transform the data, I get the text in wrong order.
How can I maintain same order ?
You can't.
inverse_transform() does not reconstruct back the document- It only return the n-grams that each document had and that were extracted during the fit. The only information it can use is the information that was stored in the vocabulary_ attribute.
You can add the indices of description to desc before the resample and then use them to assign oversampled_descriptions.

ValueError: not enough values to unpack

I am trying to learn (on Python3) how to do sentiment analysis for NLP and I am using the "UMICH SI650 - Sentiment Classification" Database available on Kaggle: https://www.kaggle.com/c/si650winter11
At the moment I am trying to generate a vocabulary with some loops, here is the code:
import collections
import nltk
import os
Directory = "../Databases"
# Read training data and generate vocabulary
max_length = 0
freqs = collections.Counter()
num_recs = 0
training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
for line in training:
if not line:
continue
label, sentence = line.strip().split("\t".encode())
words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
if len(words) > max_length:
max_length = len(words)
for word in words:
freqs[word] += 1
num_recs += 1
training.close()
I keep getting this error, that I don't fully understand:
in label, sentence = line.strip().split("\t".encode())
ValueError: not enough values to unpack (expected 2, got 1)
I tried to add
if not line:
continue
like suggested in here: ValueError : not enough values to unpack. why?
But it didn't work for my case. How can I solve this error?
Thanks a lot in advance,
Here's a cleaner way to read the dataset from https://www.kaggle.com/c/si650winter11
Firstly, context manager is your friend, use it, http://book.pythontips.com/en/latest/context_managers.html
Secondly, if it's a text file, avoid reading it as a binary, i.e. open(filename, 'r') not open(filename, 'rb'), then there's no need to mess with str/byte and encode/decode.
And now:
from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
for line in fin:
label, text = line.strip().split('\t')
# Avoid lowercasing before tokenization.
# lowercasing after tokenization is much better
# just in case the tokenizer uses captialization as cues.
word_counts.update(map(str.lower, word_tokenize(text)))
print(word_counts)
The easiest way to resolve this would be to put the unpacking statement into a try/except block. Something like:
try:
label, sentence = line.strip().split("\t".encode())
except ValueError:
print(f'Error line: {line}')
continue
My guess is that some of your lines have a label with nothing but whitespace afterwards.
You should check for the case where you have the wrong number of fields:
if not line:
continue
fields = line.strip().split("\t".encode())
if len(fields) != 2:
# you could print(fields) here to help debug
continue
label, sentence = fields

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

How to correctly encode labels with tensorflow's one-hot encoding?

I've been trying to learn Tensorflow with python 3.6 and decided on building a facial recognition program using data from the University of Essex's face data base (http://cswww.essex.ac.uk/mv/allfaces/index.html). So far I've been following Tensorflow's MNIST Expert guide, but when I start testing, my accuracy is 0 for every epoch, so I know something is wrong. I feel most shaky on how I'm handling the labels, so I figure that's where the problem is.
The labels in the dataset are either numeric IDs, like 987323, or someone's name, like "fordj". My idea to deal with this was to create a "pre-encoding" encode_labels function, which gives each unique label in the test and training sets their own unique integer value. I checked to make sure each unique label in the test and train sets have the same unique value. It also returns a dictionary so that I can easily map back to the original label from the encoded version. If I don't do this step and pass the labels as I retrieve them (i.e "fordj"), I get an error saying
UnimplementedError (see above for traceback): Cast string to int32 is not supported
[[Node: Cast = CastDstT=DT_INT32, SrcT=DT_STRING, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
The way I'm interpreting this is that since many of the labels are people's names, tensorflow can't convert a label like "fordj" to a tf.int32. The code to grab labels and paths is here:
def get_paths_and_labels(path):
""" image_paths : list of relative image paths
labels : mix of alphanumeric characters """
image_paths = [path + image for image in os.listdir(path)]
labels = [i.split(".")[-3] for i in image_paths]
labels = [i.split("/")[-1] for i in labels]
return image_paths, labels
def encode_labels(train_labels, test_labels):
""" Assigns a numeric value to each label since some are subject's names """
found_labels = []
index = 0
mapping = {}
for i in train_labels:
if i in found_labels:
continue
mapping[i] = index
index += 1
found_labels.append(i)
return [mapping[i] for i in train_labels], [mapping[i] for i in test_labels], mapping
Here is how I assign my training and testing labels. I then want to use tensorflow's one-hot encoder to encode them again for me.
def main():
# Grabs the labels and each image's relative path
train_image_paths, train_labels = get_paths_and_labels(TRAIN_PATH)
# Smallish dataset so I can read it all into memory
train_images = [cv2.imread(image) for image in train_image_paths]
test_image_paths, test_labels = get_paths_and_labels(TEST_PATH)
test_images = [cv2.imread(image) for image in test_image_paths]
num_classes = len(set(train_labels))
# Placeholders
x = tf.placeholder(tf.float32, shape=[None, IMAGE_SIZE[0] * IMAGE_SIZE[1]])
y_ = tf.placeholder(tf.float32, shape=[None, num_classes])
x_image = tf.reshape(x, [-1, IMAGE_SIZE[0], IMAGE_SIZE[1], 1])
# One-hot labels
train_labels, test_labels, mapping = encode_labels(train_labels, test_labels)
train_labels = tf.one_hot(indices=tf.cast(train_labels, tf.int32), depth=num_classes)
test_labels = tf.one_hot(indices=tf.cast(test_labels, tf.int32), depth=num_classes)
I'm sure I'm doing something wrong. I know sklearn has a LabelEncoder, though I haven't tried it out yet. Thanks for any advice on this, all help is appreciated!
The way I'm interpreting this is that since many of the labels are people's names, tensorflow can't convert a label like "fordj" to a tf.int32.
You're right. Tensorflow can't do that. Instead, you can create a mapping function from a nome to a unique (and progressive) ID. Once you did that, you can correctly one-encode every numeric ID with its one-hot representation.
You already have the relation between the numeric ID and the string label, hence you can do something like:
train_labels, test_labels, mapping = encode_labels(train_labels, test_labels)
numeric_train_ids = [labels[idx] for idx in train_labels]
numeric_test_ids = [labels[idx] for idx in test_labels]
one_hot_train_labels = tf.one_hot(indices=numeric_train_ids, depth=num_classes)
one_hot_test_labels = tf.one_hot(indices=numeric_test_ids, depth=num_classes)

Tensorflow reset or clear collection

In tensorflow, I find the API tf.add_to_collcetion to add some value to collection like code bellow.
def accuracy_rate(logits, labels):
correct = tf.nn.in_top_k(logits, labels, 1)
# Return the accuracy of true entries.
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
return accuracy
with tf.Session() as sess:
logits, labels = ...
accuracy = accuracy_rate(logits, labels)
tf.add_to_collection('total_accuracy', sess.run(accuracy))
What I can't find in the API is that, how can I clear all values that I've already stored in one collection?
You can use tf.get_collection_ref to get a mutable reference to the collection which you can clear (it's just a python list).
I think this might be what you are looking for?
In [2]: import tensorflow as tf
In [3]: w = tf.Variable([[1,2,3], [4,5,6], [7,8,9], [3,1,5], [4,1,7]], collections=[tf.GraphKeys.WEIGHTS, tf.GraphKeys.GLOBAL_VARIABLES], dtype=tf.float32)
In [4]: params = tf.get_collection_ref(tf.GraphKeys.WEIGHTS)
In [5]: del params[:]
In [6]: tf.get_collection_ref(tf.GraphKeys.WEIGHTS)
Out[6]: []
In [10]: params = tf.get_collection_ref(tf.GraphKeys.GLOBAL_VARIABLES)
In [11]: params
Out[11]: [<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32_ref>]
Find an alternative solution, using different tf.Graph().

Resources