Ensure the presence of a word/token/noun in Encoder-Decoder text generation deep learning models - nlp

I am stuck with a problem where in I want to ensure that specific tokens/words are produced while decoding and generating abstractive-style sentences.
I am working with deep learning models like LSTM and transformer model for generating short sentences(100-200 characters). I want that some words like places or nouns(like brand names) be present in the generated texts.
I am not sure if there has been any research on this, I couldn't really find a paper after an extensive search on it.
TIA, any leads or suggestions are appreciated. :)

I am not sure but you can try to condition your output based on those specific words. Your trainer can be like a seq2seq decoder but instead of attending to the encoder outputs it can attend to those specific words.

Related

Text classification using BERT - how to handle misspelled words

I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.
I am working on a text multiclass classification problem.
I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.
For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.
Is there any way to enhance Bert tokenizer to be able to work with misspelled words?
You can leverage BERT's power to rectify the misspelled word.
The article linked below beautifully explains the process with code snippets
https://web.archive.org/web/20220507023114/https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/
To summarize, you can identify misspelled words via a SpellChecker function and get replacement suggestions. Then, find the most appropriate replacement using BERT.

Is there any way to classify text based on some given keywords using python?

i been trying to learn a bit of machine learning for a project that I'm working in. At the moment I managed to classify text using SVM with sklearn and spacy having some good results, but i want to not only classify the text with svm, I also want it to be classified based on a list of keywords that I have. For example: If the sentence has the word fast or seconds I would like it to be classified as performance.
I'm really new to machine learning and I would really appreciate any advice.
I assume that you are already taking a portion of your data, classifying it manually and then using the result as your training data for the SVM algorithm.
If yes, then you could just append your list of keywords (features) and desired classifications (labels) to your training data. If you are not doing it already, I'd recommend using the SnowballStemmer on your training data features.

Text Classification - what can you do vs. what are your capabilities?

Text Classification basically works on the input training sentences. Little or less number of variations of in the sentences do work. But when there is a scenario like
What can you do <<==>> What are your capabilities
This scenario does not work well with the regular classification or bot building platforms.
Are there any approaches for classification that would help me achieve this ?
What you are trying to solve is called Semantic Textual Similarity and is a known and well studied field.
There are many different ways to solve this even if your data is tagged or not.
For example, Google has published the Universal Sentence Encoder (code example) which is intended to tell if two sentences are similar like in your case.
Another example would be any solution you can find in Quora Question Pairs Kaggle competition.
There are also datasets for this problem, for example you can look for SemEval STS (STS for Semantic Textual Similarity), or the PAWS dataset

Do I need to provide sentences for training Spacy NER or are paragraphs fine?

I am trying to train a new Spacy model to recognize references to law articles. I start using a blank model, and train the ner pipe according to the example given in the documentation.
The performance of the trained model is really poor, even with several thousands on input points. I am tryong to figure out why.
One possible answer is that I am giving full paragraphs to train on, instead of sentences that are in the examples. Each of these paragraphs can have multiple references to law articles. Is this a possible issue?
Turns out I was making a huge mistake in my code. There is nothing wrong with paragraphs. As long as your code actually supplies them to spacy.
Paragraphs should be fine. Could you give an example input data point?

Features Vectors to build classifier to detect subjectivity

I am trying to build a classifier to detect subjectivity. I have text files tagged with subjective and objective . I am little lost with the concept of features creation from this data. I have found the lexicon of the subjective and objective tag. One thing that I can do is to create a feature of having words present in respective dictionary. Maybe the count of words present in subjective and objective dictionary. After that I intend to use naive bayes or SVM to develop the model
My problem is as follow
Is my approach correct ?
Can I create more features ? If possible suggest some or point me to some paper or link
Can I do some test like chi -sq etc to identify effective words from the dictionary ?
You are basically on the right track. I would try and apply classifier with features you already have and see how well it will work, before doing anything else.
Actually best way to improve your work is to google for subjectivity classification papers and read them (there are a quite a number of them). For example this one lists typical features for this task.
And yes Chi-squared can be used to construct dictionaries for text classification (other commonly used methods are TD*IDF, pointwise mutal information and LDA)
Also, recently new neural network-based methods for text classification such as paragraph vector and dynamic convolutional neural networks with k-max pooling demonstrated state-of-the-art results on sentiment analysis, thus they should probably be good for subjectivity classification as well.

Resources