How to correctly inverse_transform TFIDF vectorizer - scikit-learn

am trying to oversample data using imblearn using the below code
def oversample(df):
description = df['DESCRIPTION']
labels = df['LABEL']
vec = TfidfVectorizer(
norm='l2',
lowercase=True,
strip_accents=None,
encoding='utf-8',
preprocessor=None,
token_pattern=r"(?u)\S\S+")
desc = vec.fit_transform(description)
encoder = LabelEncoder()
encoder.fit(labels)
labels = encoder.transform(labels)
over = RandomOverSampler(random_state=0)
X, y = over.fit_resample(desc, labels)
oversampled_descriptions = vec.inverse_transform(X)
label = encoder.inverse_transform(y)
yet, am having an issue in text ordering, after I inverse_transform the data, I get the text in wrong order.
How can I maintain same order ?

You can't.
inverse_transform() does not reconstruct back the document- It only return the n-grams that each document had and that were extracted during the fit. The only information it can use is the information that was stored in the vocabulary_ attribute.
You can add the indices of description to desc before the resample and then use them to assign oversampled_descriptions.

Related

after using an Embedding layer in Keras why do I get: Input to reshape is a tensor with 2 values, but the requested shape has 4 [Op:Reshape]

So I created below sample polars data frame. I want to use Keras's normalisation and Embedding layers to preprocess my data. sum_cost and sum_gmv are my numerical columns and I normalize each individual column by using normalization layer.category is my categorical column and I want to use embedding layer to get embedding vectors for each category.
import polars as pl
import tensorflow as tf
df = pl.DataFrame(
{'sum_cost':[1.,4.,7.,3.,2.],
'category':[311,210,450,311,567],
'sum_gmv':[-4.,-2.,0.,2.,4.],
}
)
numeric_col = ['sum_cost','sum_gmv']
categorical_col = ['category']
all_inputs = []
inputs = {}
for col in numeric_col + categorical_col:
if col in numeric_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.float32)
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(df[col].to_numpy())
all_inputs.append(normalizer(inputs[col])[:,tf.newaxis])
elif col in categorical_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.int32)
embedding = tf.keras.layers.Embedding(
567 + 1,
output_dim=2,
name='cat_embedding')(inputs[col])
em_model = tf.keras.layers.Reshape((2,))(embedding)
all_inputs.append(em_model)
outputs = tf.keras.layers.Concatenate(axis=1)(all_inputs)
model=tf.keras.Model([inputs[col] for col in numeric_col+categorical_col],outputs)
When I want to test my preprocessing model by using a single data point model(dict(df.to_pandas().iloc[1,:])) I receive the following error.
on the other hand when I pass this input:
model({'sum_cost':1.,'sum_gmv':1,'category':np.array([[1.]])})
it works well. I dont understand why i should provide an array for category but scalar for numerical columns. In the original dataset they are all scalar. Also I dont deifne a shape for my Input tensors. Why does this happening and how can I solve it?
Thanks!

ValueError: Number of features of the model must match the input. Model n_features is 464 and input n_features is 2

I am tryin to predict the cost of perfumes but I got error in the line "answer = (clf.predict(result))"
cursor.execute('SELECT * FROM info')
info = cursor.fetchall()
for line in info:
z.append(line[0:2])
y.append(line[2])
enc.fit(z)
x = enc.transform(z).toarray()
result = []
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
new = input('enter the Model, Volume of your perfume to see the cost: example = Midnighto,50 ').split(',')
enc.fit([new])
result = enc.transform([new]).toarray()
answer = (clf.predict(result))
print(answer)
You don't have to fit enc again with your new input, only transform with the function fitted with X (what you do is OneHotEncoding the new sample taking in consideration only one possible value of the feature, you have to consider all the possible categories of the feature you have in your X data). So delete the next row:
enc.fit([new])
After that, please check if X and results has the same number of features. You can use function shape.
Furthermore, I recommend you use training and test data to see if your model is overfitted or not. Then, you can apply your personal predict.

for-loop issue (how to use i-parameter on the obj name created within the for-loop)

i'd like to create few Random forest model within a for
loop in which i move the number of estimators. Train each of them on the same data sample and measure the accuracy of each.
This is my beginning code:
r = range(0, 100)
for i in r:
RF_model_%i = RandomForestClassifier(criterion="entropy", n_estimators=i, oob_score=True)
RF_model_%i.fit(X_train, y_train)
y_predict = RF_model_%i.predict(X_test)
accuracy_%i = accuracy_score(y_test, y_predict)
what i like to understand is:
how can i put the i-parameter on the name of each model (in
order to recognise them)?
after tained and calculated the
accuracy score each of the i-models how can i put the result on
a list within the for-loop?
You can do something like this:
results = [] # init
r = range(0, 100)
for i in r:
RF_model_%i = RandomForestClassifier(criterion="entropy", n_estimators=i, oob_score=True)
RF_model_%i.id = i # dynamically add fields to objects
RF_model_%i.fit(X_train, y_train)
y_predict = RF_model_%i.predict(X_test)
accuracy_%i = accuracy_score(y_test, y_predict)
results.append(accuracy_%i) # put the result on a list within the for-loop

How to correctly encode labels with tensorflow's one-hot encoding?

I've been trying to learn Tensorflow with python 3.6 and decided on building a facial recognition program using data from the University of Essex's face data base (http://cswww.essex.ac.uk/mv/allfaces/index.html). So far I've been following Tensorflow's MNIST Expert guide, but when I start testing, my accuracy is 0 for every epoch, so I know something is wrong. I feel most shaky on how I'm handling the labels, so I figure that's where the problem is.
The labels in the dataset are either numeric IDs, like 987323, or someone's name, like "fordj". My idea to deal with this was to create a "pre-encoding" encode_labels function, which gives each unique label in the test and training sets their own unique integer value. I checked to make sure each unique label in the test and train sets have the same unique value. It also returns a dictionary so that I can easily map back to the original label from the encoded version. If I don't do this step and pass the labels as I retrieve them (i.e "fordj"), I get an error saying
UnimplementedError (see above for traceback): Cast string to int32 is not supported
[[Node: Cast = CastDstT=DT_INT32, SrcT=DT_STRING, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
The way I'm interpreting this is that since many of the labels are people's names, tensorflow can't convert a label like "fordj" to a tf.int32. The code to grab labels and paths is here:
def get_paths_and_labels(path):
""" image_paths : list of relative image paths
labels : mix of alphanumeric characters """
image_paths = [path + image for image in os.listdir(path)]
labels = [i.split(".")[-3] for i in image_paths]
labels = [i.split("/")[-1] for i in labels]
return image_paths, labels
def encode_labels(train_labels, test_labels):
""" Assigns a numeric value to each label since some are subject's names """
found_labels = []
index = 0
mapping = {}
for i in train_labels:
if i in found_labels:
continue
mapping[i] = index
index += 1
found_labels.append(i)
return [mapping[i] for i in train_labels], [mapping[i] for i in test_labels], mapping
Here is how I assign my training and testing labels. I then want to use tensorflow's one-hot encoder to encode them again for me.
def main():
# Grabs the labels and each image's relative path
train_image_paths, train_labels = get_paths_and_labels(TRAIN_PATH)
# Smallish dataset so I can read it all into memory
train_images = [cv2.imread(image) for image in train_image_paths]
test_image_paths, test_labels = get_paths_and_labels(TEST_PATH)
test_images = [cv2.imread(image) for image in test_image_paths]
num_classes = len(set(train_labels))
# Placeholders
x = tf.placeholder(tf.float32, shape=[None, IMAGE_SIZE[0] * IMAGE_SIZE[1]])
y_ = tf.placeholder(tf.float32, shape=[None, num_classes])
x_image = tf.reshape(x, [-1, IMAGE_SIZE[0], IMAGE_SIZE[1], 1])
# One-hot labels
train_labels, test_labels, mapping = encode_labels(train_labels, test_labels)
train_labels = tf.one_hot(indices=tf.cast(train_labels, tf.int32), depth=num_classes)
test_labels = tf.one_hot(indices=tf.cast(test_labels, tf.int32), depth=num_classes)
I'm sure I'm doing something wrong. I know sklearn has a LabelEncoder, though I haven't tried it out yet. Thanks for any advice on this, all help is appreciated!
The way I'm interpreting this is that since many of the labels are people's names, tensorflow can't convert a label like "fordj" to a tf.int32.
You're right. Tensorflow can't do that. Instead, you can create a mapping function from a nome to a unique (and progressive) ID. Once you did that, you can correctly one-encode every numeric ID with its one-hot representation.
You already have the relation between the numeric ID and the string label, hence you can do something like:
train_labels, test_labels, mapping = encode_labels(train_labels, test_labels)
numeric_train_ids = [labels[idx] for idx in train_labels]
numeric_test_ids = [labels[idx] for idx in test_labels]
one_hot_train_labels = tf.one_hot(indices=numeric_train_ids, depth=num_classes)
one_hot_test_labels = tf.one_hot(indices=numeric_test_ids, depth=num_classes)

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources