Define and Use new smoothing method in nltk language models - nlp

I'm trying to provide and test new smoothing method for language models. I'm using nltk tools and don't want to redefine everything from scratch. So is there any way to define and use my own smoothing method in nltk models?
Edit:
I'm trying to do something like this :
def my_smoothing_method(model) :
# some code using model (MLE) count
model = nltk.lm.MLE(n, smoothing_method=my_smoothing_method)
model.fit(train)

Here, you can see the definition of MLE. As you can see, there is no option of a smoothing function (but there are others in the same file, probably some of them fits your needs?).
The InterpolatedLanguageModel (see same file above) does accept a smoothing classifier which needs to implement alpha_gamma(word, context) and unigram_score(word) and be a subclass of Smoothing:
model = nltk.lm.InterpolatedLanguageModel(smoothing_cls=my_smoothing_method, order)
So if you really need to add functionality to the MLE class, you could do something like that, but I am not sure if this is a good idea :
class MLE_with_smoothing(LanguageModel):
"""Class for providing MLE ngram model scores.
Inherits initialization from BaseNgramModel.
"""
def unmasked_score(self, word, context=None):
"""Returns the MLE score for a word given a context.
Args:
- word is expcected to be a string
- context is expected to be something reasonably convertible to a tuple
"""
freq = self.context_counts(context).freq(word)
#Do some smothing
return

Related

Serve online learning models with mlflow

It is not clear to me if one could use mlflow to serve a model that is evolving continuously based on its previous predictions.
I need to be able to query a model in order to make a prediction on a sample of data which is the basic use of mlflow serve. However I also want the model to be updated internaly now that it has seen new data.
Is it possible or does it need a FR ?
I think that you should be able to do that by implementing the custom python model or custom flavor, as it's described in the documentation. In this case you need to create a class that is inherited from mlflow.pyfunc.PythonModel, and implement the predict method, and inside that method you're free to do anything. Here is just simple example from documentation:
class AddN(mlflow.pyfunc.PythonModel):
def __init__(self, n):
self.n = n
def predict(self, context, model_input):
return model_input.apply(lambda column: column + self.n)
and this model is then could be saved & loaded again just as normal models:
# Construct and save the model
model_path = "add_n_model"
add5_model = AddN(n=5)
mlflow.pyfunc.save_model(path=model_path, python_model=add5_model)
# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(model_path)

Why do we need the custom dataset class and use of _getitem_ method in NLP, BERT fine tuning etc

I am a newbie in NLP and has been studying the usage of BERT for NLP tasks. In many notebooks, I See that a custom dataset class is defined and getitem method is defined (along with len).
Tweetdataset class in this notebook - https://www.kaggle.com/abhishek/roberta-inference-5-folds
and text_Dataset class in this notebook - https://engineering.wootric.com/when-bert-meets-pytorch
Can some one please explain the reason, need for defining the custom dataset class and the getitem (and len) method. thank you
It is a recommended abstraction in pytorch to define datasets by inheriting torch.utils.data.Dataset. Those objects define how many elements are there (__len__ method) and how to get a single item via specified index (__getitem__(index)).
Its source code:
class Dataset(object):
def __getitem__(self, index):
raise NotImplementedError
def __add__(self, other):
return ConcatDataset([self, other])
So it's basically a thin wrapper which adds possibility to concatenate two Dataset objects. For readability and API compatibility you should inherit from it (unlike the one provided in kaggle).
You can read more about PyTorch's data functionality here

example of doing simple prediction with pytorch-lightning

I have an existing model where I load some pre-trained weights and then do prediction (one image at a time) in pytorch. I am trying to basically convert it to a pytorch lightning module and am confused about a few things.
So currently, my __init__ method for the model looks like this:
self._load_config_file(cfg_file)
# just creates the pytorch network
self.create_network()
self.load_weights(weights_file)
self.cuda(device=0) # assumes GPU and uses one. This is probably suboptimal
self.eval() # prediction mode
What I can gather from the lightning docs, I can pretty much do the same, except not to do the cuda() call. So something like:
self.create_network()
self.load_weights(weights_file)
self.freeze() # prediction mode
So, my first question is whether this is the correct way to use lightning? How would lightning know if it needs to use the GPU? I am guessing this needs to be specified somewhere.
Now, for the prediction, I have the following setup:
def infer(frame):
img = transform(frame) # apply some transformation to the input
img = torch.from_numpy(img).float().unsqueeze(0).cuda(device=0)
with torch.no_grad():
output = self.__call__(Variable(img)).data.cpu().numpy()
return output
This is the bit that has me confused. Which functions do I need to override to make a lightning compatible prediction?
Also, at the moment, the input comes as a numpy array. Is that something that would be possible from the lightning module or do things always have to use some sort of a dataloader?
At some point, I want to extend this model implementation to do training as well, so want to make sure I do it right but while most examples focus on training models, a simple example of just doing prediction at production time on a single image/data point might be useful.
I am using 0.7.5 with pytorch 1.4.0 on GPU with cuda 10.1
LightningModule is a subclass of torch.nn.Module so the same model class will work for both inference and training. For that reason, you should probably call the cuda() and eval() methods outside of __init__.
Since it's just a nn.Module under the hood, once you've loaded your weights you don't need to override any methods to perform inference, simply call the model instance. Here's a toy example you can use:
import torchvision.models as models
from pytorch_lightning.core import LightningModule
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.resnet = models.resnet18(pretrained=True, progress=False)
def forward(self, x):
return self.resnet(x)
model = MyModel().eval().cuda(device=0)
And then to actually run inference you don't need a method, just do something like:
for frame in video:
img = transform(frame)
img = torch.from_numpy(img).float().unsqueeze(0).cuda(0)
output = model(img).data.cpu().numpy()
# Do something with the output
The main benefit of PyTorchLighting is that you can also use the same class for training by implementing training_step(), configure_optimizers() and train_dataloader() on that class. You can find a simple example of that in the PyTorchLightning docs.
Even though above answer suffices, if one takes note of following line
img = torch.from_numpy(img).float().unsqueeze(0).cuda(0)
One has to put both the model as well as image to the right GPU. On multi-gpu inference machine, this becomes a hassle.
To solve this, .predict was also recently produced, see more at https://pytorch-lightning.readthedocs.io/en/stable/deploy/production_basic.html

what is the difference between transformer and estimator in sklearn?

I saw both transformer and estimator were mentioned in the sklearn documentation.
Is there any difference between these two words?
The basic difference is that a:
Transformer transforms the input data (X) in some ways.
Estimator predicts a new value (or values) (y) by using the input data (X).
Both the Transformer and Estimator should have a fit() method which can be used to train them (they learn some characteristics of the data). The signature is:
fit(X, y)
fit() does not return any value, just stores the learnt data inside the object.
Here X represents the samples (feature vectors) and y is the target vector (which may have single or multiple values per corresponding sample in X). Note that y can be optional in some transformers where its not needed, but its mandatory for most estimators (supervised estimators). Look at StandardScaler for example. It needs the initial data X for finding the mean and std of the data (it learns the characteristics of X, y is not needed).
Each Transformer should have a transform(X, y) function which like fit() takes the input X and returns a new transformed version of X (which generally should have same number samples but may or may not have same features).
On the other hand, Estimator should have a predict(X) method which should output the predicted value of y from the given X.
There will be some classes in scikit-learn which implement both transform() and predict(), like KMeans, in that case carefully reading the documentation should solve your doubts.
Transformer is a type of Estimator that implements transform method.
Let me support that statement with examples I have come across in sklearn implementation.
Class sklearn.preprocessing.FunctionTransformer :
This inherits from two other classes TransformerMixin, BaseEstimator
Class sklearn.preprocessing.PowerTransformer :
This also inherits from TransformerMixin, BaseEstimator
From what I understand, Estimators just take data, do some processing, and store data based on logic implemented in its fit method.
Note: Estimator's aren't used to predict values directly. They don't even have predict method in them.
Before I give more explanation to the above statement, let me tell you about Mixin Classes.
Mixin Class: These are classes that implement a Mix-in design pattern. Wikipedia has very good explanation about it. You can read it here . To summarise, these are classes you write which have methods that can be used in many different classes. So, you write them in one class and just inherit in many different classes(A form of composition. Read These Links - Link1 Link2)
In Sklearn there are many mixin classes. To name a few
ClassifierMixin, RegressorMixin, TransformerMixin.
Here, TransformerMixin is the class that's inherited by every Transformer used in sklearn. TransformerMixin class has only one method which is reusable in every transformer and that is fit_transform.
All transformers inherit two classes, BaseEstimator(Which has fit method) and TransformerMixin(Which has fit_transform method). And, Each transformer has transform method based on its functionality
I guess that gives an answer to your question. Now, let me answer the statement I made regarding the Estimator for prediction.
Every Model Class has its own predict class that does prediction.
Consider LinearRegression, KNeighborsClassifier, or any other Model class. They all have a predict function declared in them. This is used for prediction. Not the Estimator.
The sklearn usage is perhaps a little unintuitive, but "estimator" doesn't mean anything very specific: basically everything is an estimator.
From the sklearn glossary:
estimator:
An object which manages the estimation and decoding of a model...
Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator.
transformer:
An estimator supporting transform and/or fit_transform...
As in #VivekKumar's answer, I think there's a tendency to use the word estimator for what sklearn instead calls a "predictor":
An estimator supporting predict and/or fit_predict. This encompasses classifier, regressor, outlier detector and clusterer...

How to add custom slangs into spaCy's norm_exceptions.py module?

SpaCy's documentation has some information on adding new slangs here.
However, I'd like to know:
(1) When should I call the following function?
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS)
The typical usage of spaCy, according to the introduction guide here, is something as follows:
import spacy
nlp = spacy.load('en')
# Should I call the function add_lookups(...) here?
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
(2) When in the processing pipeline are norm exceptions handled?
I'm assuming a typical pipeline as such: tokenizer -> tagger -> parser -> ner.
Are norm exceptions handled right before the tokenizer? And also, how is the norm exceptions component organized with respect to the other pre-processing components such as stop words, lemmatizer (see full list of components here)? What comes before what?
Am new to spaCy and much help would be appreciated. Thanks!
The norm exceptions are part of the language data and the attribute getter (the function that takes a text and returns the norm), is initialised with the language class, e.g. English. You can see an example of this here. This all happens before the pipeline is even constructed.
The assumption here is that the norm exceptions are usually language-specific and should thus be defined in the language data, independent of the processing pipeline. Norms are also lexical attributes, so their getters live on the underlying lexeme, the context-insensitive entry in the vocabulary (as opposed to a token, which is the word in context).
However, the nice thing about the token.norm_ is that it's writeable – so you can easily add a custom pipeline component that looks up the token's text in your own dictionary, and overwrites the norm if necessary:
def add_custom_norms(doc):
for token in doc:
if token.text in YOUR_NORM_DICT:
token.norm_ = YOUR_NORM_DICT[token.text]
return doc
nlp.add_pipe(add_custom_norms, last=True)
Keep in mind that the NORM attribute is also used as a feature in the model, so depending on the norms you want to add or overwrite, you might want to only apply your custom component after the tagger, parser or entity recognizer is called.
For example, by default, spaCy normalises all currency symbols to "$" to ensure that they all receive similar representations, even if one of them is less frequent in the training data. If your custom component now overwrites "€" with "Euro", this will also have an impact on the model's predictions. So you might see less accurate predictions for MONEY entities.
If you're planning on training your own model that takes your custom norms into account, you might want to consider implementing a custom language subclass. Alternatively, if you think that the slang terms you want to add should be included in spaCy by default, you can always submit a pull request, for example to the English norm_exceptions.py.

Resources