Keras Custom Standardization - keras

I am trying to add a custom function to a tensor of dtype string. The function removes html tags and replaces email addresses with the string email. This codes works perfectly fine.
This part is working
import tensorflow as tf
from tensorflow.keras import preprocessing as pp
from textacy import preprocessing as prep
from tensorflow.keras.layers import TextVectorization
import string
import re
def repl_email(text):
text=prep.replace.emails(text.numpy().decode('UTF-8'), '_EMAIL_')
return text
def custom_standardization(input_data):
input_data=tf.map_fn(fn=repl_email,elems=input_data,fn_output_signature=None)
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, "<br />", "")
return tf.strings.regex_replace(
stripped_html, f"[{re.escape(string.punctuation)}]", ""
)
txt_lyr=tf.constant(['This is my email address some_email#yahoo.com<br /><br />',
'This is another email address another_email#yahoo.com<br /><br />'])
j=custom_standardization(txt_lyr)
print(j)
>>>tf.Tensor([b'this is my email address email' b'this is another email address email'], shape=(2,), dtype=string)
Moving on, I try to do this on the IMDB Move dataset as used in an example here, it produces text_ds ( a tensorflow.python.data.ops.dataset_ops.MapDataset ). I am trying to run the custom function containing the tf.map_fn throws an error.
This part is not working
batch_size = 32
max_features = 20000
embedding_dim = 128
sequence_length = 500
raw_train_ds = pp.text_dataset_from_directory(
"aclImdb/train",
batch_size=batch_size,
validation_split=0.2,
subset="training",
seed=1337,
)
vectorize_layer = TextVectorization(
standardize=custom_standardization,
max_tokens=max_features,
output_mode="int",
output_sequence_length=sequence_length,
)
text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)
I get the this error.
AttributeError: 'Tensor' object has no attribute 'numpy'
I understand , if I were to iterate through text_ds , it generates batches of tensors of size 32, like "i" below , which happens to be a tensorflow.python.framework.ops.EagerTensor
n=0
for i in text_ds:
print(n,len(i),i.numpy())
n+=1
if n>3:
break

Related

Huggingface Transformers NER - Offset Mapping Causing ValueError in NumPy boolean array indexing assignment

I was trying out the NER tutorial Token Classification with W-NUT Emerging Entities (https://huggingface.co/transformers/custom_datasets.html#tok-ner) in google colab using the Annotated Corpus for Named Entity Recognition data on Kaggle (https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv).
I will outline my process in detail to facilitate an understanding of what I was doing and to let the community help me figure out the source of the indexing assignment error.
To load the data from google drive where I have saved it, I used the following code
# import pandas library
import pandas as pd
# columns to select
cols_to_select = ["Sentence #", "Word", "Tag"]
# google drive data path
data_path = '/content/drive/MyDrive/Colab Notebooks/ner/ner_dataset.csv'
# load the data from google colab
dataset = pd.read_csv(data_path, encoding="latin-1")[cols_to_select].fillna(method = 'ffill')
I run the following code to parse the sentences and tags
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def retrieve(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
# get full data
getter = SentenceGetter(dataset)
# get sentences
sentences = [[s[0] for s in sent] for sent in getter.sentences]
# get tags/labels
tags = [[s[1] for s in sent] for sent in getter.sentences]
# take a look at the data
print(sentences[0][0:5], tags[0][0:5], sep='\n')
I then split the data into train, val, and test sets
# import the sklearn module
from sklearn.model_selection import train_test_split
# split data in to temp and test sets
temp_texts, test_texts, temp_tags, test_tags = train_test_split(sentences,
tags,
test_size=0.20,
random_state=15)
# split data into train and validation sets
train_texts, val_texts, train_tags, val_tags = train_test_split(temp_texts,
temp_tags,
test_size=0.20,
random_state=15)
After splitting the data, I created encodings for tags and the tokens
unique_tags=dataset.Tag.unique()
# create tags to id
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
# create id to tags
id2tag = {id: tag for tag, id in tag2id.items()}
I then installed the transformer library in colab
# install the transformers library
! pip install transformers
Next I imported the small bert model
# import the transformers module
from transformers import BertTokenizerFast
# import the small bert model
model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
I then created the encodings for the tokens
# create train set encodings
train_encodings = tokenizer(train_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create validation set encodings
val_encodings = tokenizer(val_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create test set encodings
test_encodings = tokenizer(test_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
In the tutorial, it uses offset-mapping to handle the problem that arise with word-piece tokenization, specifically, the mismatch between tokens and labels. It is when running the offset-mapping code in the tutorial that I get the error. Below is the offset mapping function used in the tutorial:
# the offset function
import numpy as np
def encode_tags(tags, encodings):
labels = [[tag2id[tag] for tag in doc] for doc in tags]
encoded_labels = []
for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
# create an empty array of -100
doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
arr_offset = np.array(doc_offset)
# set labels whose first offset position is 0 and the second is not 0
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())
return encoded_labels
# return the encoded labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
test_labels = encode_tags(test_tags, test_encodings)
After running the above code, it gives me the following error, and I can't figure out where the source of the error lies. Any help and pointers would be appreciated.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-afdff0186eb3> in <module>()
17
18 # return the encoded labels
---> 19 train_labels = encode_tags(train_tags, train_encodings)
20 val_labels = encode_tags(val_tags, val_encodings)
21 test_labels = encode_tags(test_tags, test_encodings)
<ipython-input-19-afdff0186eb3> in encode_tags(tags, encodings)
11
12 # set labels whose first offset position is 0 and the second is not 0
---> 13 doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
14 encoded_labels.append(doc_enc_labels.tolist())
15
ValueError: NumPy boolean array indexing assignment cannot assign 38 input values to the 37 output values where the mask is true

Using a `tf.Tensor` as a Python `bool` is not allowed. Use `if t is not None:` instead of `if t:` to test if a tensor is defined,

I am trying to define func(x) in order to use the genetic algs library here:
https://github.com/bobirdmi/genetic-algorithms/tree/master/examples
However, when I try and use sga.init_random_population(population_size, params, interval) the code complains of me using tf.Tensors as python bools.
However, I am only referencing one bool in the entire code (Elitism) so I have no idea why this error is even showing. Asked around others who used sga.init_... and my inputs/setup is fine. Any suggestions would be greatly appreciated.
Full traceback:
Traceback (most recent call last):
File "C:\Users\Eric\eclipse-workspace\hw1\ga2.py", line 74, in <module>
sga.init_random_population(population_size, params, interval)
File "C:\Program Files\Python36\lib\site-packages\geneticalgs\real_ga.py", line 346, in init_random_population
self._sort_population()
File "C:\Program Files\Python36\lib\site-packages\geneticalgs\standard_ga.py", line 386, in _sort_population
self.population.sort(key=lambda x: x.fitness_val, reverse=True)
File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 671, in __bool__
raise TypeError("Using a `tf.Tensor` as a Python `bool` is not allowed. "
TypeError: Using a `tf.Tensor` as a Python `bool` is not allowed. Use `if t is not None:` instead of `if t:` to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.
code
import hw1
#import matplotlib
from geneticalgs import BinaryGA, RealGA, DiffusionGA, MigrationGA
#import numpy as np
#import csv
#import time
#import pickle
#import math
#import matplotlib.pyplot as plt
from keras.optimizers import Adam
from hw1 import x_train, y_train, x_test, y_test
from keras.losses import mean_squared_error
#import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Dropout
# GA standard settings
generation_num = 50
population_size = 16
elitism = True
selection = 'rank'
tournament_size = None # in case of tournament selection
mut_type = 1
mut_prob = 0.05
cross_type = 1
cross_prob = 0.95
optim = 'min' # minimize or maximize a fitness value? May be 'min' or 'max'.
interval = (-1, 1)
# Migration GA settings
period = 5
migrant_num = 3
cloning = True
def func(x):
#dimensions of weights and biases
#layer0weights = [10][23]
#layer0biases = [10]
#layer1weights = [10][20]
#layer1biases = [20]
#layer2weights = [1][20]
#layer2biases = [1]
#split up x for weights and biases
lay0 = x[0:230]
bias0 = x[230:240]
lay1 = x[240:440]
bias1 = x[440:460]
lay2 = x[460:480]
bias2 = x[480:481]
#fit to the shape of the actual model
lay0 = lay0.reshape(23,10)
bias0 = bias0.reshape(10,)
lay1 = lay1.reshape(10,20)
bias1 = bias1.reshape(20,)
lay2 = lay2.reshape(20,1)
bias2 = bias2.reshape(1,)
#set the newly shaped object to layers
hw1.model.layers[0].set_weights([lay0, bias0])
hw1.model.layers[1].set_weights([lay1, bias1])
hw1.model.layers[2].set_weights([lay2, bias2])
res = hw1.model.predict(x_train)
error = mean_squared_error(res,y_train)
return error
ga_model = Sequential()
ga_model.add(Dense(10, input_dim=23, activation='relu'))
ga_model.add(Dense(20, activation='relu'))
ga_model.add(Dense(1, activation='sigmoid'))
sga = RealGA(func, optim=optim, elitism=elitism, selection=selection,
mut_type=mut_type, mut_prob=mut_prob,
cross_type=cross_type, cross_prob=cross_prob)
params = 481
sga.init_random_population(population_size, params, interval)
optimal = sga.best_solution[0]
predict = func(optimal)
print(predict)
Tensorflow generates a computational graph of operations to be executed in an Tensorflow session.
geneticalgs.RealGA.init_random_population is an operation that uses the numpy.random.uniform to generate a numpy array. 1
The generated population being a Tensor object could mean maybe:
numpy.random.uniform invoked in geneticalgs.RealGA.init_random_population was decorated to return Tensors
numpy.random.uniform was added in the computation graph to be executed in a session.
I'll try executing the program eagerly by enabling eager execution. 2
tf.enable_execution()
You can also in a way execute the parts that you care about eagerly.
size = tf.placeholder(tf.int64)
dim = tf.placeholder(tf.int64)
interval = tf.placeholder(tf.int64, shape=(2,))
init_random_population = tf.py_func(
sga.init_random_population, [size, dim, interval], [])
with tf.Session() as session:
session.run(
init_random_population,
{size: population_size, dim: params, interval: interval})

scikit-learn - Using a single string with RandomForestClassifier.predict()?

I'm an sklearn dummy... I'm trying to predict the label for a given string from a RandomForestClassifier() fitted with text, labels.
It's obvious I don't know how to use predict() with a single string. The reason I'm using reshape() is because I got this error some time ago "Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
How can I predict the label of a single text string?
The script:
#!/usr/bin/env python
''' Read a txt file consisting of '<label>: <long string of text>'
to use as a model for predicting the label for a string
'''
from argparse import ArgumentParser
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
def main(args):
'''
args: Arguments obtained by _Get_Args()
'''
print('Loading data...')
# Load data from args.txtfile and split the lines into
# two lists (labels, texts).
data = open(args.txtfile).readlines()
labels, texts = ([], [])
for line in data:
label, text = line.split(': ', 1)
labels.append(label)
texts.append(text)
# Print a list of unique labels
print(json.dumps(list(set(labels)), indent=4))
# Instantiate a CountVectorizer class and git the texts
# and labels into it.
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
prediction = Predict_Label(args.string, cv, rf)
print(prediction)
def Predict_Label(string, cv, rf):
'''
string: str() - A string of text
cv: The CountVectorizer class
rf: The RandomForestClassifier class
'''
matrix = cv.fit_transform([string])
matrix = matrix.reshape(1, -1)
try:
prediction = rf.predict(matrix)
except Exception as E:
print(str(E))
else:
return prediction
def _Get_Args():
parser = ArgumentParser(description='Learn labels from text')
parser.add_argument('-t', '--txtfile', required=True)
parser.add_argument('-s', '--string', required=True)
return parser.parse_args()
if __name__ == '__main__':
args = _Get_Args()
main(args)
The actual learning data text file is 43663 lines long but a sample is in small_list.txt which consists of lines each in the format: <label>: <long text string>
The error is noted in the Exception output:
$ ./learn.py -t small_list.txt -s 'This is a string that might have something to do with phishing or fraud'
Loading data...
[
"Vulnerabilities__Unknown",
"Vulnerabilities__MSSQL Browsing Service",
"Fraud__Phishing",
"Fraud__Copyright/Trademark Infringement",
"Attacks and Reconnaissance__Web Attacks",
"Vulnerabilities__Vulnerable SMB",
"Internal Report__SBL Notify",
"Objectionable Content__Russian Federation Objectionable Material",
"Malicious Code/Traffic__Malicious URL",
"Spam__Marketing Spam",
"Attacks and Reconnaissance__Scanning",
"Malicious Code/Traffic__Unknown",
"Attacks and Reconnaissance__SSH Brute Force",
"Spam__URL in Spam",
"Vulnerabilities__Vulnerable Open Memcached",
"Malicious Code/Traffic__Sinkhole",
"Attacks and Reconnaissance__SMTP Brute Force",
"Illegal content__Child Pornography"
]
Number of features of the model must match the input. Model n_features is 2070 and input n_features is 3
None
You need to get the vocabulary of the first CountVectorizer (cv) and use to transform the new single text before predict.
...
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
cv_new = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
vocabulary=cv.vocabulary_
)
prediction = Predict_Label(args.string, cv_new, rf)
print(prediction)
...

Pytorch nn.embedding error

I was reading pytorch documentation on Word Embedding.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(5)
word_to_ix = {"hello": 0, "world": 1, "how":2, "are":3, "you":4}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor(word_to_ix["hello"], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)
Output:
tensor([-0.4868, -0.6038, -0.5581, 0.6675, -0.1974])
This looks good but if I replace line lookup_tensor by
lookup_tensor = torch.tensor(word_to_ix["how"], dtype=torch.long)
I am getting the error as:
RuntimeError: index out of range at /Users/soumith/minicondabuild3/conda-bld/pytorch_1524590658547/work/aten/src/TH/generic/THTensorMath.c:343
I don't understand why it gives RunTime error on line hello_embed = embeds(lookup_tensor).
When you declare embeds = nn.Embedding(2, 5) the vocab size is 2 and embedding size is 5. i.e each word will be represented by a vector of size 5 and there are only 2 words in vocab.
lookup_tensor = torch.tensor(word_to_ix["how"], dtype=torch.long) embeds will try to look up vector corresponding to the third word in vocab, but embedding has vocab size of 2. and that is why you get the error.
If you declare embeds = nn.Embedding(5, 5) it should work fine.

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources