What is the tfidf matrix giving ideally - python-3.x

when i ran the tfidf for a set of documents it returned me a tfidf matrix which looked like this.
(1, 12) 0.656240233446
(1, 11) 0.754552023393
(2, 6) 1.0
(3, 13) 1.0
(4, 2) 1.0
(7, 9) 1.0
(9, 4) 0.742540927053
(9, 5) 0.66980069547
(11, 19) 0.735138466738
(11, 7) 0.677916982176
(12, 18) 1.0
(13, 14) 0.697455191865
(13, 11) 0.716628394177
(14, 5) 1.0
(15, 8) 1.0
(16, 17) 1.0
(18, 1) 1.0
(19, 17) 1.0
(22, 13) 1.0
(23, 3) 1.0
(25, 6) 1.0
(26, 19) 0.476648253537
(26, 7) 0.879094103268
(28, 10) 0.532672175403
(28, 7) 0.523456282204
I want to know what is this, I am not able to understand how this is provided.
when i was in debug mode I got to know about indices, indptr and data... this things are some where co-relating with the data given. what are these?
there are lots of confusion in the numbers, I dont see 0th, 5th, 6th document if I say the first element in the parenthesis is the documents based on my prediction.
Kindly help me figure out how it is working here. However I know the general working of tfidf from wiki, taking the log on inverse documents and other stuff. I just want to know what are these 3 different kind of numbers here, what are the refering it to?
The source code is :
#This contains the list of file names
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():
def kmeansClusters(self):
global _report
self.num_clusters = 5
km = KMeans(n_clusters=self.num_clusters)
vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
km.fit(self.tfidf_matrix)
self.clusters = km.labels_.tolist()
joblib.dump(km, 'doc_cluster2.pkl')
km = joblib.load('doc_cluster2.pkl')
class TokenizingAndPanda():
def tokenize_only(self,text):
'''
This function tokenizes the text
:param text: Give the text that you want to tokenize
:return: it gives the filter tokes
'''
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
def tokenize_and_stem(self,text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [_stemmer.stem(t) for t in filtered_tokens]
return stems
def getFilnames(self):
'''
:return:
'''
global _path
global _filenames
path = _path
_filenames = FileAccess().read_all_file_names(path)
def getContentsForFilenames(self):
global _contents
global _file_contents
for filename in _filenames:
content = FileAccess().read_the_contents_from_files(_path, filename)
_contents.append(content)
_file_contents[filename] = content
def createPandaVocabFrame(self):
global _totalvocab_stemmed
global _totalvocab_tokenized
#Enable this if you want to load the filenames and contents from a file structure.
# self.getFilnames()
# self.getContentsForFilenames()
# for name, i in _file_contents.items():
# print(name)
# print(i)
for i in _contents:
allwords_stemmed = self.tokenize_and_stem(i)
_totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = self.tokenize_only(i)
_totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
print(vocab_frame)
return vocab_frame
class TfidfProcessing():
def getTfidFPropertyData(self):
tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
min_df=0.02, stop_words='english',
use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
# print(_contents)
tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
return tfidf_matrix, terms, dist

Result of tfidf applied to data is usually a 2D matrix A, where A_ij is the normalized j'th term (word) frequency in i'th document. What you see in your output is sparse representation of this matrix, in other words - only elements which are non-zero are printed out, so:
(1, 12) 0.656240233446
means that 12th word (according to some vocabulary that has been built by sklearn) has a normalized frequency 0.656240233446 in the first document. The "missing" bits are zero, meaning that for example 3rd word cannot be find in 1st document (since there is no (1,3)) and so on.
The fact that some documents are missing is a result of your particular code/data (which you did not include), maybe you set the vocabulary by hand? Or maximum number of features considered? There are many parameters in TfidfVectorizer that can cause that, but without your exact code (and some exemplary data) nothing else can be said. For example setting min_df can cause that (as it drops very rare words) similarly max_features (the same effect)

Related

Try to concatenate tensors of inconsistent batch size in tensorflow2

I trie to concatenate 2 tensors of different batch size within a function decorated with #tf.function. I tried 2 methods and the first one is listed as below:
import tensorflow as tf
#tf.function # indispensable
def fun1(tensors, indices):
results = []
for i in tf.range(2): # batch size = 2
pos = tf.where(indices==i)
emb = tf.gather_nd(tensors, pos)
# do something to emb, but do nothing here for simplicity.
results += [emb]
results = tf.concat(results, axis=0)
return results
tensors = tf.random.uniform((5, 2))
fun1(tensors, indices=[0, 0, 1, 1, 1])
But it raises errors as following:
TypeError: 'results' does not have the same nested structure after one iteration.
The two structures don't have the same nested structure.
First structure: type=list str=[]
Second structure: type=list str=[<tf.Tensor 'while/GatherNd:0' shape=(None, 2) dtype=float32>]
More specifically: The two structures don't have the same number of elements. First structure: type=list str=[]. Second structure: type=list str=[<tf.Tensor 'while/GatherNd:0' shape=(None, 2) dtype=float32>]
Entire first structure:
[]
Entire second structure:
[.]
So I tried the second method:
import tensorflow as tf
#tf.function # indispensable
def fun2(tensors, indices):
results = tf.reshape(tf.constant([], dtype=tf.float32), (0, 2)) # make empty tensors
for i in tf.range(2): # batch size = 2
pos = tf.where(indices==i)
emb = tf.gather_nd(tensors, pos)
# do something to emb, but do nothing here for simplicity
results = tf.concat([results, emb], axis=0)
return results
tensors = tf.random.uniform((5, 2))
fun2(tensors, indices=[0, 0, 1, 1, 1])
But it raises errors:
ValueError: 'results' has shape (0, 2) before the loop, but shape (None, 2) after one iteration. Use tf.autograph.experimental.set_loop_options to set shape invariants.
How should I resolve the problems? Thanks
I found I could achieve it by adding one line of codes to the second method as following:
#tf.function
def fun2(tensors, indices):
results = tf.reshape(tf.constant([], dtype=tf.float32), (0, 2)) # make empty tensors
for i in tf.range(2): # batch size = 2
tf.autograph.experimental.set_loop_options(shape_invariants=[(results, tf.TensorShape([None, 2]))])
pos = tf.where(indices==i)
emb = tf.gather_nd(tensors, pos)
# do something to emb, but do nothing here for simplicity
results = tf.concat([results, emb], axis=0)
return results

Could someone please help me with sklearn.metrics.roc_curve's use and what does the function expect?

I am trying to construct 2 numpy ndarray-s from a networkx Graph's data structures that look like a list of tuples and a simple list. I would like to make a roc curve where
the validation set is the above mentioned list of tuples of the edges of a G graph that I was trying to construct like this:
x = []
for i in G_orig.nodes():
for j in G_orig.nodes():
if j > I and (i, j) not in G.edges():
if (i, j) in G_orig.edges():
x.append((i, j, 1))
else:
x.append((i, j, 0))
y_validation = np.array(x)
It looks something like this: [(1, 344, 1), (2, 23, 0), (3, 5, 0), ...... (333, 334, 1)].
The first 2 numbers mean 2 nodes, the 3rd one means whether there is an edge between them. 1 means edge, 0 means no edge.
Then roc_curve expects something called y_score in the documentation. I have a list for that made with a method called preferential attachment, therefore I named it pref_att_types. I tried to make a numpy array of it in case the roc_curve expects only it.
positive_class_predicted_probabilities = np.array(pref_att_types)
3.Then I just did what we used in class.
FPRs, TPRs, thresholds = roc_curve(y_validation,
positive_class_predicted_probabilities,
pos_label=1)
It is literally just Ctrl C + Ctrl V. But it says Value error and 'multiclass-multioutput format is not supported'. Please note that I am not a programmer just someone who studies to be a mathematics analyst.
The first argument, y_true, needs to be just the true labels, in this case 0/1 without the pair of nodes. Just be sure that the indices of the arrays y_validation and pref_att_types match
The code below draws the ROC curves for two RF models:
from sklearn.metrics import roc_curve
#create array of probabilities
y_test_predict1_probaRF = rf1.predict_proba(X_test)
y_test_predict2_probaRF = rf2.predict_proba(X_test)
RFfpr1, RFtpr1, thresholds = roc_curve(y_test, y_test_predict1_probaRF[:,1])
RFfpr2, RFtpr2, thresholds = roc_curve(y_test, y_test_predict2_probaRF[:,1])
def plot_roc_curve (fpr, tpr, label = None):
plt.plot(fpr, tpr, linewidth = 2, label = label)
plt.plot([0,1], [0,1], "k--")
plt.axis([0,1,0,1])
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plot_roc_curve (RFfpr1,RFtpr1,"RF1")
plot_roc_curve (RFfpr2,RFtpr2,"RF2")
plt.legend()
plt.show()

Separate the words in the sentence for text classification problem

I am solving a text classification problem and while annotating my data I found very long words which are sentence itself but are not separated by space.
One of the example which I found while annotating my data point is:
Throughnumerousacquisitionsandtransitions,Anacompstillexiststodaywithagreaterfocusondocumentmanagement
Desired output:
Through numerous acquisitions and transitions, Anacomp still exists today with a greater focus on document management.
I have looked upon various frameworks such as Keras, PyTorch to see if they provide any functionality to solve this issue but I couldn't find anything.
The problem that you are trying to solve is text/word segmentation. It is possible to approach this based on ML using a sequence model (such as LSTM) and a word embedding (such as BERT).
This link details such an approach for Chinese language. Chinese language does not adopt spaces, so this sort of approach is necessary as a preprocessing component in Chinese NLP processing tasks.
I would like to describe an automaton based approach using Aho-Corasick Algorithm.
First do a pip install pyahocorasick
I'm resorting to use only the words in your input string for the sake of demonstration. In a real world scenario you could just use a dictionary of words from something like Wordnet.
import ahocorasick
automaton = ahocorasick.Automaton()
input = 'Throughnumerousacquisitionsandtransitions, Anacompstillexiststodaywithagreaterfocusondocumentmanagement'
# Replace this with a large dictionary of words
word_dictionary = ['Through', 'numerous', 'acquisition', 'acquisitions', 'and', 'transitions', 'Anacomp', 'still',
'exists', 'today', 'with', 'a', 'greater', 'focus', 'on', 'document', 'management']
# add dictionary words to automaton
for idx, key in enumerate(word_dictionary):
automaton.add_word(key, (idx, key))
# Build aho-corasick automaton for search
automaton.make_automaton()
# to check for ambiguity, if there is a longer match then prefer that
previous_rng = range(0, 0)
previous_rs = set(previous_rng)
# Holds the end result dictionary
result = {}
# search the inputs using automaton
for end_index, (insert_order, original_value) in automaton.iter(input):
start_index = end_index - len(original_value) + 1
current_rng = range(start_index, end_index)
current_rs = set(current_rng)
# ignore previous as there is a longer match available
if previous_rs.issubset(current_rs):
# remove ambiguous short entry in favour of the longer entry
if previous_rng in result:
del result[previous_rng]
result[current_rng] = (insert_order, original_value)
previous_rng = current_rng
previous_rs = current_rs
# if there is no overlap of indices, then its a new token, add to result
elif previous_rs.isdisjoint(current_rs):
previous_rng = current_rng
previous_rs = current_rs
result[current_rng] = (insert_order, original_value)
# ignore current as it is a subset of previous
else:
continue
assert input[start_index:start_index + len(original_value)] == original_value
for x in result:
print(x, result[x])
Produces results :
range(0, 6) (0, 'Through')
range(7, 14) (1, 'numerous')
range(15, 26) (3, 'acquisitions')
range(27, 29) (4, 'and')
range(30, 40) (5, 'transitions')
range(43, 49) (6, 'Anacomp')
range(50, 54) (7, 'still')
range(55, 60) (8, 'exists')
range(61, 65) (9, 'today')
range(66, 69) (10, 'with')
range(71, 77) (12, 'greater')
range(78, 82) (13, 'focus')
range(83, 84) (14, 'on')
range(85, 92) (15, 'document')
range(93, 102) (16, 'management')

Keras One Hot Encoding Memory Management - best Possible way out

I know this problem has been answered in different ways in the past. But I am not able to figure out and fit in my code and need help. I am using the cornell movie corpus as my dataset. Trying to train a LSTM model for chatbot is the final expectation. But I am stuck with initial one hot encoding and is getting out of memory. Note the VM I am training is 86GB memory but still having issues.In nmt_special_utils_mod.py the one hot encoding is going beyond allocated memory and I am not able to pass the stage. Any alternative way to do these line will be helpful without loosing the functionality
Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X)))
Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y)))
All codes as below to make the question clear
import_corpus_mod.py -
Change 1: updated less frequent word removal
def data_load():
TrainDataSetPath = 'D:\\Script\\Python\\NLP\\chatbotSeq2SeqWithAtt\\ChatBot\\'
####initializing libraries####
#import numpy as np
#import tensorflow as tf
import re
#import time
########### Data Pre-processing Part 1##########
def clean_text(text):
'''The function will clean known texts and make it more meaningful'''
text = text.lower()
text = re.sub(r"i'm", "i am", text)
text = re.sub(r"he's", "he is", text)
text = re.sub(r"she's", "she is", text)
text = re.sub(r"it's", "it is", text)
text = re.sub(r"let's", "let us", text)
text = re.sub(r"that's", "that is", text)
text = re.sub(r"what's", "what is", text)
text = re.sub(r"where's", "where is", text)
text = re.sub(r"how's", "how is", text)
text = re.sub(r"howz", "how is", text)
text = re.sub(r"\'ll", " will", text)
text = re.sub(r"\'ve", " have", text)
text = re.sub(r"\'re", " are", text)
text = re.sub(r"\'d", " would", text)
text = re.sub(r"don't", "do not", text)
text = re.sub(r"won't", "will not", text)
text = re.sub(r"can't", "cannot", text)
text = re.sub(r"wouldn't", "would not", text)
text = re.sub(r"wasn't", "was not", text)
text = re.sub(r"haven't", "have not", text)
text = re.sub(r"\s+"," ",text)
text = re.sub(r"[-()\"#/#;:<>+=~|{}.?,]", "", text)
#####Add more below this line######
#####Add more above this line######
return text
lines = open(TrainDataSetPath+'movie_lines.txt', encoding='utf-8', errors='ignore').read().split('\n')
conversations = open(TrainDataSetPath+'movie_conversations_short.txt', encoding='utf-8', errors='ignore').read().split('\n')
#Create dictionary which maps each line with its corresponding ID
id2line = {}
for line in lines:
_line = line.split(' +++$+++ ')
if len(_line) == 5:
id2line[_line[0]] = _line[4]
#Create list of all conversation
conversations_ids = []
for conversation in conversations[:-1]: #the last line in conversation is blank hence -1
#Split then pick last part[-1] which is conversation. Then Removing square bracket by [1:-1] and then replacing quotes and space
_conversation = conversation.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
# Append to form a list of list separating by comma
conversations_ids.append(_conversation.split(","))
#Separating the question and answer - assuming the first is the question second is the answer in a conversation
questions = []
answers = []
threshold = 5 #If more than 15 counts of words
for conversation in conversations_ids:
for i in range(len(conversation)-1):
questions.append(id2line[conversation[i]])
answers.append(id2line[conversation[i+1]])
# Cleaning all questions
clean_questions = []
for question in questions:
clean_questions.append(clean_text(question))
# Cleaning all answers
clean_answers = []
for answer in answers:
clean_answers.append(clean_text(answer))
# Creating a dictionary that maps each word to its number of occurrence
word2count = {}
for question in clean_questions:
for word in question.split():
if word not in word2count:
word2count[word] = 1
else:
word2count[word] += 1
for answer in clean_answers:
for word in answer.split():
if word not in word2count:
word2count[word] = 1
else:
word2count[word] += 1
#Create dictionary of words which has more occurrence than threshold
for k in list(word2count):
if word2count[k] < threshold:
del word2count[k]
cleanest_questions, cleanest_answers, keys_list = [], [], list(word2count.keys())
for answers in clean_answers:
ans = []
for word in answers.split():
if word in keys_list:
ans.append(word)
else:
ans.append('<unk>')
cleanest_answers.append(' '.join(ans))
for question in clean_questions:
ques = []
for word in question.split():
if word in keys_list:
ques.append(word)
else:
ques.append('<unk>')
cleanest_questions.append(' '.join(ques))
return cleanest_questions, cleanest_answers
nmt_data_load_asmain_words.py
Change 1 : update less frequent word removal
from tqdm import tqdm
from import_corpus_mod import data_load
def load_dataset(clean_questions, clean_answers):
"""
Loads a dataset with m examples and vocabularies
:m: the number of examples to generate
"""
human_vocab = set()
machine_vocab = set()
dataset = []
lines = len(clean_questions)
for i in tqdm(range(lines)):
hu, mc = clean_questions[i], clean_answers[i]
if hu is not None:
dataset.append((hu, mc))
human_vocab.update(set(hu.split()))
machine_vocab.update(set(mc.split()))
human = dict(zip(sorted(human_vocab) + ['<pad>'],
list(range(len(human_vocab) + 1))))
#human = dict(zip(sorted(human_vocab) + ['<pad>'],
#list(range(len(human_vocab) + 1))))
#human = dict(zip(sorted(human_vocab),
#list(range(len(human_vocab)))))
machine = dict(zip(sorted(machine_vocab) + ['<pad>'],
list(range(len(machine_vocab) + 1))))
#machine = dict(zip(sorted(machine_vocab) + ['<pad>'],
#list(range(len(machine_vocab) + 1))))
inv_machine = {v:k for k,v in machine.items()}
inv_human = {p:q for q,p in human.items()}
return dataset, human, machine, inv_machine, inv_human
clean_questions, clean_answers = data_load()
dataset, human_vocab, machine_vocab, inv_machine_vocab, inv_human_vocab = load_dataset(clean_questions, clean_answers)
nmt_special_utils_mod.py
import numpy as np
from keras.utils import to_categorical
import keras.backend as K
import matplotlib.pyplot as plt
import sys
# Initiate a list to store integer version of sentences
X_into_int = []
Y_into_int = []
def preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty):
X, Y = zip(*dataset)
X = np.asarray([string_to_int(i, Tx, human_vocab) for i in X])
Y = [string_to_int(t, Ty, machine_vocab) for t in Y]
Xoh, Yoh = [], []
Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X)))
Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y)))
return X, np.array(Y), Xoh, Yoh
def string_to_int(line, length, vocab):
#print("hello- inside function")
"""
Converts all strings in the vocabulary into a list of integers representing the positions of the
input string's characters in the "vocab"
Arguments:
string -- input string, e.g. 'Hello how are you'
length -- the number of time steps you'd like, determines if the output will be padded or cut
vocab -- vocabulary, dictionary used to index every character of your "string"
Returns:
rep -- list of integers (or '<unk>') (size = length) representing the position of the string's character in the vocabulary
"""
'''
#make lower to standardize
for string in listofstring:
string = string.lower()
string = string.replace(',','')
if len(string) > length:
string = string[:length]
rep = list(map(lambda x: vocab.get(x, '<unk>'), string))
if len(string) < length:
rep += [vocab['<pad>']] * (length - len(string))
#print (rep)
return rep
'''
newlist = []
if len(line.split()) > length:
line = line.split()
for i in range(length):
newlist.append(line[i])
line = ' '.join(newlist)
else:
line = line + ' <pad>' * (length - len(line.split()))
#print(line)
#print("hello- inside padded")
#words_into_int = []
ints = []
for word in line.split():
if word not in vocab:
ints.append(vocab['<unk>'])
else:
ints.append(vocab[word])
#print("hello- inside append if loop")
#words_into_int.append(ints)
#words_into_int = ",".join(x for x in words_into_int)
return ints
def int_to_string(ints, inv_vocab):
"""
Output a machine readable list of characters based on a list of indexes in the machine's vocabulary
Arguments:
ints -- list of integers representing indexes in the machine's vocabulary
inv_vocab -- dictionary mapping machine readable indexes to machine readable characters
Returns:
l -- list of characters corresponding to the indexes of ints thanks to the inv_vocab mapping
"""
l = [inv_vocab[i] for i in ints]
return l
EXAMPLES = ['3 May 1979', '5 Apr 09', '20th February 2016', 'Wed 10 Jul 2007']
def softmax(x, axis=1):
"""Softmax activation function.
# Arguments
x : Tensor.
axis: Integer, axis along which the softmax normalization is applied.
# Returns
Tensor, output of softmax transformation.
# Raises
ValueError: In case `dim(x) == 1`.
"""
ndim = K.ndim(x)
if ndim == 2:
return K.softmax(x)
elif ndim > 2:
e = K.exp(x - K.max(x, axis=axis, keepdims=True))
s = K.sum(e, axis=axis, keepdims=True)
return e / s
else:
raise ValueError('Cannot apply softmax to a tensor that is 1D')
def plot_attention_map(model, input_vocabulary, inv_output_vocabulary, text, n_s = 128, num = 6, Tx = 30, Ty = 10):
"""
Plot the attention map.
"""
attention_map = np.zeros((10, 30))
Ty, Tx = attention_map.shape
s0 = np.zeros((1, n_s))
c0 = np.zeros((1, n_s))
layer = model.layers[num]
encoded = np.array(string_to_int(text, Tx, input_vocabulary)).reshape((1, 30))
encoded = np.array(list(map(lambda x: to_categorical(x, num_classes=len(input_vocabulary)), encoded)))
f = K.function(model.inputs, [layer.get_output_at(t) for t in range(Ty)])
r = f([encoded, s0, c0])
for t in range(Ty):
for t_prime in range(Tx):
attention_map[t][t_prime] = r[t][0,t_prime,0]
# Normalize attention map
# row_max = attention_map.max(axis=1)
# attention_map = attention_map / row_max[:, None]
prediction = model.predict([encoded, s0, c0])
predicted_text = []
for i in range(len(prediction)):
predicted_text.append(int(np.argmax(prediction[i], axis=1)))
predicted_text = list(predicted_text)
predicted_text = int_to_string(predicted_text, inv_output_vocabulary)
text_ = list(text)
# get the lengths of the string
input_length = len(text)
output_length = Ty
# Plot the attention_map
plt.clf()
f = plt.figure(figsize=(8, 8.5))
ax = f.add_subplot(1, 1, 1)
# add image
i = ax.imshow(attention_map, interpolation='nearest', cmap='Blues')
# add colorbar
cbaxes = f.add_axes([0.2, 0, 0.6, 0.03])
cbar = f.colorbar(i, cax=cbaxes, orientation='horizontal')
cbar.ax.set_xlabel('Alpha value (Probability output of the "softmax")', labelpad=2)
# add labels
ax.set_yticks(range(output_length))
ax.set_yticklabels(predicted_text[:output_length])
ax.set_xticks(range(input_length))
ax.set_xticklabels(text_[:input_length], rotation=45)
ax.set_xlabel('Input Sequence')
ax.set_ylabel('Output Sequence')
# add grid and legend
ax.grid()
#f.show()
return attention_map
nmt_code_mod.py the main code
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 10 16:31:44 2018
#author: Anirban
"""
from keras.layers import Bidirectional, Concatenate, Dot, Input, LSTM
from keras.layers import RepeatVector, Dense, Activation
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import Model
import keras.backend as K
import numpy as np
from nmt_data_load_asmain_words import load_dataset
from import_corpus_mod import data_load
from nmt_special_utils_mod import *
epochs = 50
clean_questions, clean_answers = data_load()
dataset, human_vocab, machine_vocab, inv_machine_vocab, inv_human_vocab = load_dataset(clean_questions, clean_answers)
m = len(clean_questions)
Tx = 8
Ty = 8
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)
print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)
# Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(20, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded from nmt_special_utils
dotor = Dot(axes = 1)
def one_step_attention(a, s_prev):
"""
Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
"alphas" and the hidden states "a" of the Bi-LSTM.
Arguments:
a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
Returns:
context -- context vector, input of the next (post-attetion) LSTM cell
"""
### START CODE HERE ###
# Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
s_prev = repeator(s_prev)
# Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
concat = concatenator([a,s_prev])
# Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
e = densor1(concat)
# Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
energies = densor2(e)
# Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
alphas = activator(energies)
# Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
context = dotor([alphas,a])
### END CODE HERE ###
return context
n_a = 32
n_s = 64
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)
def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
"""
Arguments:
Tx -- length of the input sequence
Ty -- length of the output sequence
n_a -- hidden state size of the Bi-LSTM
n_s -- hidden state size of the post-attention LSTM
human_vocab_size -- size of the python dictionary "human_vocab"
machine_vocab_size -- size of the python dictionary "machine_vocab"
Returns:
model -- Keras model instance
"""
# Define the inputs of your model with a shape (Tx,)
# Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
X = Input(shape=(Tx, human_vocab_size))
s0 = Input(shape=(n_s,), name='s0')
c0 = Input(shape=(n_s,), name='c0')
s = s0
c = c0
# Initialize empty list of outputs
outputs = []
### START CODE HERE ###
# Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
a = Bidirectional(LSTM(n_a, return_sequences=True),input_shape=(m, Tx, n_a*2))(X)
# Step 2: Iterate for Ty steps
for t in range(Ty):
# Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)
context = one_step_attention(a, s)
# Step 2.B: Apply the post-attention LSTM cell to the "context" vector.
# Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)
s, _, c = post_activation_LSTM_cell(context,initial_state = [s, c])
# Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
out = output_layer(s)
# Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
outputs.append(out)
# Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
model = Model(inputs=[X,s0,c0],outputs=outputs)
### END CODE HERE ###
return model
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
opt = Adam(lr=0.05, beta_1=0.9, beta_2=0.999,decay=0.01)
model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))
model.fit([Xoh, s0, c0], outputs, epochs=epochs, batch_size=5)
EXAMPLES = ['can we make this quick roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad again'
,'the thing is cameron i am at the mercy of a particularly hideous breed of loser my sister i cannot date until she does'
,'Hello how are you']
#EXAMPLES = ['13 May 1979', 'Tue 11 Jul 2007','Saturday May 9 2018', 'March 3 2001','March 3rd 2001', '1 March 2001','23 May 2017']
for example in EXAMPLES:
source = np.asarray([string_to_int(example, Tx, human_vocab)])
#need a try block here to prevent errors if vocab is small and example has characters not in the vocab
source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))) #.swapaxes(0,1)
prediction = model.predict([source, s0, c0])
prediction = np.argmax(prediction, axis = -1)
output = [inv_machine_vocab[int(i)] for i in prediction]
pads = output.count('<pad>')
output = output[0:(len(output)-pads)]
print("source:", example)
print("output:", ' '.join(output))
Note: The code is as is code of very famous research paper in 2016 which coverts any date time to computer understandable date time. I was trying to re-use that for our Chatbot - Seq2Seq with Attention Model (bi-directional). The code is working - just that the movie corpus if loaded in 1000 conversation it works. When you load the full corpus it fails due to memory overload
EDIT
Thank You for collaboration efforts on this problem - Really appreciate the trouble you are taking to go through the code and trying to find out the best possible solution for this. As you instructed I have updated the import_corpus_mod.py to incorporate the threshold = 5 and at the very beginning converting the least frequent words less than 5 to < unk > without space. This change forced another small change in nmt_data_load_asmain_words.py to remove the addition of < unk > there.
Now based on the other point and the code shared by you - I hashed out the below lines in nmt_special_utils_mod.py
#Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X)))
#Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y)))
And straight away change the input based on your guidance?
Xi = Input(shape=(Tx,))
X = Embedding( human_vocab_size, 100, embeddings_initializer='uniform', input_length=Tx , trainable=True )(Xi)
s0 = Input(shape=(n_s,), name='s0')
c0 = Input(shape=(n_s,), name='c0')
s = s0
c = c0
Got lot of errors
runfile('D:/Script/Python/NLP/chatbotSeq2SeqWithAtt/ChatBot/nmt_code_mod.py', wdir='D:/Script/Python/NLP/chatbotSeq2SeqWithAtt/ChatBot')
Reloaded modules: nmt_data_load_asmain_words, import_corpus_mod, nmt_special_utils_mod
100%|██████████| 384/384 [00:00<00:00, 24615.06it/s]
100%|██████████| 384/384 [00:00<?, ?it/s]
X.shape: (384, 8)
Y.shape: (384, 8)
D:\Python\Anaconda3\lib\site-packages\keras\engine\topology.py:1592: UserWarning: Model inputs must come from a Keras Input layer, they cannot be the output of a previous non-Input layer. Here, a tensor specified as input to "model_2" was not an Input tensor, it was generated by layer embedding_1.
Note that input tensors are instantiated via `tensor = Input(shape)`.
The tensor that caused the issue was: embedding_1/Gather:0
str(x.name))
Traceback (most recent call last):
File "<ipython-input-44-addb6f9e6bc1>", line 1, in <module>
runfile('D:/Script/Python/NLP/chatbotSeq2SeqWithAtt/ChatBot/nmt_code_mod.py', wdir='D:/Script/Python/NLP/chatbotSeq2SeqWithAtt/ChatBot')
File "D:\Python\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "D:\Python\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "D:/Script/Python/NLP/chatbotSeq2SeqWithAtt/ChatBot/nmt_code_mod.py", line 138, in <module>
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
File "D:/Script/Python/NLP/chatbotSeq2SeqWithAtt/ChatBot/nmt_code_mod.py", line 132, in model
model = Model(inputs=[X,s0,c0],outputs=outputs)
File "D:\Python\Anaconda3\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "D:\Python\Anaconda3\lib\site-packages\keras\engine\topology.py", line 1652, in __init__
layer.__class__.__name__))
TypeError: Input layers to a `Model` must be `InputLayer` objects. Received inputs: [<tf.Tensor 'embedding_1/Gather:0' shape=(?, 8, 100) dtype=float32>, <tf.Tensor 's0_1:0' shape=(?, 64) dtype=float32>, <tf.Tensor 'c0_1:0' shape=(?, 64) dtype=float32>]. Input 0 (0-based) originates from layer type `Embedding`
So reverting back the code here for nmt_code_mod.py and nmt_special_utils_mod.py
The problem is not one-hot encoding but rather storing the entire dataset in memory. The wise choice is to a generator, or a Sequence which will allow you to load and encode the data on the fly. This is commonly done for large image datasets for example.
I would recommend to perform all your pre-processing and save input, output pairs without encoding as a csv file, then you can create a generator that lazily loads and encodes:
class MySequence(Sequence):
def __init__(self, data, batch_size):
self.data_file = data
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, batch_id):
# Get corresponding batch data...
# one-hot encode
return X, Y
Note the generators (or Sequence[i]) returns a single batch.
I would not recommend using one-hot encodings and a dense matrix.
If you have a vocabulary of 100.000 words a 100.000 x 100.000 consumes more than 70Gb of RAM.
You can try using sparse a sparse matrix. But I guess that changes the rest of your code. you may take a look at this answer.
You could use is word embeddings representations, which are compact, memory friendly and used by all the state of the art NLP systems.
In any case, one think you must do with your model is to handle embedding inputs using the proper embedding layer.
This layer stores the embedding matrix once, and then you can build your training samples giving only one integer that represents the index of the word in the vocabulary.
If you want one hot encodings, you can build an embedding layer with a NxN identity matrix using a Keras initializer. Where N is the size of the vocabulary. Then your can pass as input the indexes of the words as integers. This will increase the size of your model, but it will reduce the size of your batches.
If you want word2vec embeddings, you can load an embedding matrix with a NxV dimensions. Where N is the size of the vocabulary and V is the dimension of the embeddings. You will notice that V is normally set to 100 or 200 dimensions, which is much smaller than N. Saving you a lot of memory.
EDIT: to clarify the usage of embeddings in your case:
You do:
X = Input(shape=(Tx, human_vocab_size))
s0 = Input(shape=(n_s,), name='s0')
c0 = Input(shape=(n_s,), name='c0')
s = s0
c = c0
Instead you can do one-hot-encoding this way:
Xi = Input(shape=(Tx,))
X = Embedding( human_vocab_size, human_vocab_size, embeddings_initializer=keras.initializers.Identity, input_length=Tx )(Xi)
s0 = Input(shape=(n_s,), name='s0')
c0 = Input(shape=(n_s,), name='c0')
s = s0
c = c0
By doing this, you can build your training samples using only the word indexes and not the one hot vectors. This will make you save some space in the training samples, but your model will be larger in size.
If it is still too large, you won't have the choice but using dense embeddings. To do so, you can do the following:
Xi = Input(shape=(Tx,))
X = Embedding( human_vocab_size, 100, embeddings_initializer='uniform', input_length=Tx , trainable=True )(Xi)
s0 = Input(shape=(n_s,), name='s0')
c0 = Input(shape=(n_s,), name='c0')
s = s0
c = c0
This initializes embeddings randomly with a compact representation (dimension 100 instead of human_vocab_size). This would save you a lot of memory.
Finally you could reduce the size of your vocabulary by putting everything in lowercase or replacing rare words (that appear only once or twice in the corpus) with an special token "RARE"

Implementing word dropout in pytorch

I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).
Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!

Resources