Good-Turing Smoothing Results - nlp

My implementation of Good-Turing smoothing produced the perplexity numbers below. These don't seem correct, though. Any intuitions as to why? I am using a corpus of 1,000 movie reviews from NLTK. My implementation seems correct (reproduced below).
1gram ppl: 1057.398218919647
2gram ppl: 3262.444941553032
3gram ppl: 68.10224173098685
4gram ppl: 4.542117543343882
5gram ppl: 1.7044134004884632
def good_turing_prob(ngram_occurences,freq_of_freq,total_ngram_count):
# unseen gram
if ngram_occurences == 0:
N_1 = freq_of_freq[1]
N = total_ngram_count
return N_1/N
#ngram is present in model
else:
# take closest count if count+1 is not present
N_c_plus_1 = freq_of_freq[min(freq_of_freq, key= lambda x:abs(x-(ngram_occurences+1)))]
N_c = freq_of_freq[min(freq_of_freq, key= lambda x:abs(x-ngram_occurences))]
good_turing_count = (ngram_occurences+1) * (N_c_plus_1/N_c)
return good_turing_count/total_ngram_count

Related

What is an efficient way to make a dataset and dataloader for high frequency time series with multiple individuals?

I'm trying to forecast high frequency time series using LSTMs and PyTorch library. I'm going through PyTorch tutorial for creating custom datasets and models and figured out how to create my Dataset class and my Dataloader and they work perfectly fine but they take too much time to generate one batch.
I want to generate batches of fixed size, each batch contains time series from different individuals and the input window is of the same length as the output window (multi-step prediction).
I think the issue is due to the fact that I'm verifying the windows are correct.
My dataframe of a little bit more than 3M lines with 6 columns. I have some 100 individuals and for each individual I have 4 different time series $y_{1}$, $y_{2}$, $y_{3}$ and $y_{4}$. I have no missing values at all and the time steps are consecutive. For each individual I have the same time steps.
My code is:
class TSDataset(Dataset):
def __init__(self, train_data, unique_column = 'unique_id', input_length = 3840, target_length = 3840, targets = ['y1', 'y2', 'y3', 'y4'], transform = None):
self.train_data = train_data
self.unique_column = unique_column
self.input_length = input_length
self.target_length = target_length
self.total_window_length = input_length + target_length
self.targets = targets
def __len__(self):
return len(self.train_data)
def verify_time_steps(self, idx):
change = False
# Check if the window doesn't overlap over many individuals
num_individuals = self.train_data.iloc[np.arange(idx + self.total_window_length), :][self.unique_column].unique().shape[0]
if num_stations != 1:
change = True
if idx + self.total_window_length >= len(self.train_data):
change = True
return change
def reshuffle(self):
return np.random.randint(0, len(self.train_data))
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
change = self.verify_time_steps(idx)
if change == True:
while change != False:
idx = self.reshuffle()
change = self.verify_time_steps(idx)
sample = self.train_data.iloc[np.arange(idx, idx + self.input_length), :][self.targets].values
labels = self.train_data.iloc[np.arange(idx + self.input_length, idx + self.input_length + self.target_length), :][self.targets].values
sample = torch.from_numpy(sample)
labels = torch.from_numpy(labels)
return sample, labels
I've tried using the TimeSeriesDataset from PyTorchForecasting but I had a hard time creating models that suit it.
I've also tried creating the dataset outside, as a numpy array but my RAM can't handle it.
Hope you can help me figure out how to alleviate the computations.

How to efficient batch-process in huggingface?

I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?

How to remove key error from the program?

In the image you can see that i have ID still getting key error I am trying to do a recommendation algorithm so i got this error
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
KeyError: <built-in function id>
I am sharing link of article https://towardsdatascience.com/recommender-engine-under-the-hood-7869d5eab072
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") #you can plug in your own list of products or movies or books here as csv file#
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
#ngram explanation begins#
#ngram (1,3) can be explained as follows#
#ngram(1,3) encompasses uni gram, bi gram and tri gram
#consider the sentence "The ball fell"
#ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
#ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID :
(Score,item_id))#
for idx, row in ds.iterrows(): #iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and 1 means perfect similarity#
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
#stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
#below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe, then we convert it to a list#
def item(id):
return ds.loc[ds['ID'] == id]['Book Title'].tolist()[0]
def recommend(id, num):
if (num == 0):
print("Unable to recommend any book as you have not chosen the number of book to be
recommended")
elif (num==1):
print("Recommending " + str(num) + " book similar to " + item(id))
else :
print("Recommending " + str(num) + " books similar to " + item(id))
print("----------------------------------------------------------")
recs = results[id][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
recommend(5,2)
i have try and run successfully till results variable then getting error.
because python default id keyword is called when you call "def item(id):"
instead of id you have to declare another identifier....then i think this is the only reason for keyerror..
As the error suggests id is an build-in function in python-3. So if you change the name of the parameters id in def item(id) and def recommend(id, num) and all their references then the code should work.
After changing the id and correcting the indentation, an example could look like this:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") # you can plug in your own list of products or movies or books here as csv file
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
# ngram explanation begins#
# ngram (1,3) can be explained as follows#
# ngram(1,3) encompasses uni gram, bi gram and tri gram
# consider the sentence "The ball fell"
# ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
# ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID : (Score,item_id))
for idx, row in ds.iterrows(): # iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
# order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and
# 1 means perfect similarity
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
# stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
# below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe,
# then we convert it to a list#
def item(ID):
return ds.loc[ds['ID'] == ID]['Book Title'].tolist()[0]
def recommend(ID, num):
if num == 0:
print("Unable to recommend any book as you have not chosen the number of book to be recommended")
elif num == 1:
print("Recommending " + str(num) + " book similar to " + item(ID))
else:
print("Recommending " + str(num) + " books similar to " + item(ID))
print("----------------------------------------------------------")
recs = results[ID][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
# the first argument in the below function to be passed is the id of the book, second argument is the number of books
# you want to be recommended
recommend(5, 2)

How to separate two concatenaded words

I have a review dataset and I want to process it using NLP techniques. I did all the preprocessing stages (remove stop words, stemming, etc.). My problem is that there are some words, which are connected to each other and my function doesn't understand those. Here is an example:
Great services. I had a nicemeal and I love it a lot.
How can I correct it from nicemeal to nice meal?
Peter Norvig has a nice solution to the word segmentation problem that you are encountering. Long story short, he uses a large dataset of word (and bigram) frequencies and some dynamic programming to split long strings of connected words into their most likely segmentation.
You download the zip file with the source code and the word frequencies and adapt it to your use case. Here is the relevant bit, for completeness.
def memo(f):
"Memoize function f."
table = {}
def fmemo(*args):
if args not in table:
table[args] = f(*args)
return table[args]
fmemo.memo = table
return fmemo
#memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
You can also use the segment2 method as it uses bigrams and is much more accurate.

Training procedure

I noticed that the sklearn.linear_model.SGDClassifier implements gradient descent for a linear model, therefore one could state that the class combines the fitting procedure (SGD) and the model (linear model) in one class.
SGD is, however, not inherent to linear models, and linear_models can be trained using many other optimizations with particular pro's (memory usage, convergence speed, local optima avoidance*, ...). One could state that such optimization techniques implement how to iterate over the training data, whether to do it online or offline, in what feature-dimension to apply an update and when to stop (possibly based on an error function of a validation set).
In particular, I implemented a model using theano and wrapped it in a fit/predict interface. Theano is cool because it allows one to define a callable which applies gradient descent on one sample, or on a set of samples, as well as a callable which returns the error on a validation set. But this coolness is not inherent to theano, a lot more models can simply define an update and error-evaluation function, which can then be used by different iteration- and stopping policies for fitting.
The theano examples often use minibatch, and the minibatch code is copy-pasted or reimplemented a lot with just minor adjustments which can easily be factored out. So I was hoping that sklearn implements something that you initialize with some parameters and an update/error callable to fit 'any model'. Or possible there is some good practice on how to do this yourself (especially w.r.t. the interface of the fitter).
Is there anything like this (in sklearn), that is Fitters which do not define the model?
*In the particular case of linear models and a l2 cost function, local optima do not exist of course, but still.
EDIT
Fair enough, this calls for a suggestion. I coded these two classes, which are not 100% clean, but they given an idea of what I mean:
import numpy
class StochasticUpdate():
def __init__(self, model, update, n_epochs, n_data_points, error=None, test_fraction=None):
self.update = update
self.n_epochs = n_epochs
self.n_data_points = n_data_points
self.error = error
self.model = model
if self.error is None and test_fraction is not None:
raise ValueError('error parameter must be specified if a test_faction (value: %s) should be used.' % test_fraction)
self.do_test = test_fraction is not None
self.n_train_samples = int(n_data_points - test_fraction) if self.do_test else n_data_points
if self.do_test:
self.test_range = numpy.arange(self.n_train_samples, n_data_points)
self.n_test_samples = int(n_data_points * test_fraction)
self.train_range = numpy.arange(0, self.n_train_samples)
def fit(self):
if self.do_test: self.test_errors = []
self.train_errors = []
self.mean_cost_values = []
for epoch in range(self.n_epochs):
order = numpy.random.permutation(self.n_train_samples)
mean_cost_value = 0
for i in range(self.n_train_samples):
mean_cost_value += self.update([order[i]])
self.mean_cost_values.append(mean_cost_value/ self.n_data_points)
if self.error is not None:
self.train_errors.append(self.error(self.train_range))
if self.do_test:
self.test_errors.append(self.error(self.test_range))
return self.model
from math import ceil
class MinibatchStochasticUpdate(StochasticUpdate):
def __init__(self, model, update, n_epochs, n_data_points, error, batch_size, patience=5000, patience_increase=2,
improvement_threshold = 0.995, validation_frequency=None, validate_faction=0.1, test_fraction=None):
super().__init__(self, model, update, n_data_points, error, test_fraction)
self.update = update
self.n_epochs = n_epochs
self.n_data_points = n_data_points
self.model = model
self.batch_size = batch_size
self.patience = patience
self.patience_increase = patience_increase
self.improvement_threshold = improvement_threshold
self.n_validation_samples = int(n_data_points * validate_faction)
self.validation_range = numpy.arange(self.n_train_samples, self.n_train_samples + self.n_validation_samples)
self.n_train_batches = int(ceil(n_data_points / batch_size))
self.n_train_batches = int(ceil(self.n_train_samples / self.batch_size))
self.train_batch_ranges = [
numpy.arange(minibatch_index * self.batch_size, min((minibatch_index+1) * self.batch_size, self.n_train_samples))
for minibatch_index in range(self.n_train_batches)
]
self.validation_frequency = min(self.n_train_batches, patience/2) if validation_frequency is None else validation_frequency
def fit(self):
self.best_validation_error = numpy.inf
best_params = None
iteration = 0
for epoch in range(self.n_epochs):
for minibatch_index in range(self.n_train_batches):
self.update(self.train_batch_ranges[minibatch_index])
if (iter + 1) % self.validation_frequency == 0:
current_validation_error = self.error(self.validation_error)
if current_validation_error < self.best_validation_error:
if current_validation_error < self.best_validation_error * self.improvement_threshold:
patience = max(self.patience, iter * self.patience_increase)
best_params = self.model.copy_parameters()
self.best_validation_error = current_validation_error
if iteration <= patience:
self.model.set_parameters(best_params)
return self.model
iteration += 1
self.model.set_parameters(best_params)
return self.model
Then for in the fit of the model one could support different training approaches and stopping criteria like this:
def fit(self, X, y):
X_shared = theano.shared(X, borrow=True)
y_shared = theano.shared(y, borrow=True)
learning_rate = self.training_method_options['learning_rate']
trainer = {
'stochastic_gradient_descent': lambda: StochasticUpdate(
self,
update=self.update_stochastic_gradient_descent_function(X_shared, y_shared, learning_rate),
n_epochs=self.training_method_options['n_epochs'],
n_data_points=X.shape[0],
error=self.evaluation_function(X_shared, y_shared),
),
'minibatch_gradient_descent': lambda: MinibatchStochasticUpdate(
self,
update=self.update_stochastic_gradient_descent_function(X_shared, y_shared, learning_rate),
n_epochs=self.training_method_options['n_epochs'],
n_data_points=X.shape[0],
error=self.evaluation_function(X_shared, y_shared),
batch_size=self.training_method_options['batch_size']
)
}[self.training_method]()
trainer.fit()
return self
Obviosuly the hash-map part is hacky, and could be done more elegantly using a standardized interface for the two classes above (since the hashmaps are still O(N*M) in size for N fitters and M models).

Resources