training sentence tokenizer in spaCy - nlp

I'm trying to tokenize sentences using spacy.
The text includes lots of abbreviations and comments which ends with a period. Also, the text was obtained with OCR and sometimes there are line breaks in the middle of sentences. Spacy doesn't seem to be performing so well in these situations.
I have extracted some examples of how I want these sentences to be split. Is there any way to train spacy's sentence tokenizer?

Spacy is a little unusual in that the default sentence segmentation comes from the dependency parser, so you can't train a sentence boundary detector directly as such, but you can add your own custom component to the pipeline or pre-insert some boundaries that the parser will respect. See their documentation with examples: Spacy Sentence Segmentation
For the cases you're describing it would potentially be useful also be able to specify that a particular position is NOT a sentence boundary, but as far as I can tell that's not currently possible.

Related

Why FastText is not handling finding multi-word phrases?

FastText pre-trained model works great for finding similar words:
from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)
[('dogs', 0.8463464975357056),
('puppy', 0.7873005270957947),
('pup', 0.7692237496376038),
('canine', 0.7435278296470642),
...
However, it seems to fail for multi-word phrases, e.g.:
model.nearest_neighbors('Gone with the Wind', k=2000)
[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
0.71047443151474),
or
model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
0.5197194218635559),
Is it a limitation of FastText pre-trained models?
I'm not aware of FastText having any special ability to handle multi-word phrases.
So I expect your query is being interpreted as one long word that's not in the model, which includes many character n-grams that includes ' ' space characters.
And, as I don't expect the training data had any such n-grams with spaces, all such n-grams' vectors will be arbitrarily-random collisions in the model's n-gram buckets. Thus any such synthesized out-of-vocabulary vector for such a 'word' is likely to b even noisier than the usual OOV vectors.
But also: the pyfasttext wrapper is an abandoned unofficial interface to FastText that hasn't bee updated in over 2 years, and has a message at its PyPI page:
Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresearch/fastText/tree/master/python
You may find better results using it instead. See its doc/examples folder for example code for examples of how its can be queried for nearest-neighbors, and also consider its get_sentence_vector() as a way to split a string into words whose vectors are then averaged, rather than just treating the string as one long OOV word.
As described in the documentation, official fastText unsupervised embeddings are built after a phase of tokenization, in which the words are separated.
If you look at your model vocabulary (model.words in the official python binding), you won't find multi-word phrases containing spaces.
Therefore, as pointed out by gojomo, the generated vectors are synthetic, artificial and noisy; you can deduce it from the result of your queries.
In essence, fastText official embeddings are not suitable for this task.
In my experience this does not depend on the version / wapper used.

Is it necessary to do stopwords removal ,Stemming/Lemmatization for text classification while using Spacy,Bert?

Is stopwords removal ,Stemming and Lemmatization necessary for text classification while using Spacy,Bert or other advanced NLP models for getting the vector embedding of the text ?
text="The food served in the wedding was very delicious"
1.since Spacy,Bert were trained on huge raw datasets are there any benefits of apply stopwords removal ,Stemming and Lemmatization on these text before generating the embedding using bert/spacy for text classification task ?
2.I can understand stopwords removal ,Stemming and Lemmatization will be good when we use countvectorizer,tfidf vectorizer to get embedding of sentences .
You can test to see if doing stemming lemmatization and stopword removal helps. It doesn't always. I usually do if I gonna graph as the stopwords clutter up the results.
A case for not using Stopwords
Using Stopwords will provide context to the user's intent, so when you use a contextual model like BERT. In such models like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.
According to https://arxiv.org/pdf/1904.07531.pdf
"Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect inMRR performances. "
With BERT you don't process the texts; otherwise, you lose the context (stemming, lemmatization) or change the texts outright (stop words removal).
Some more basic models (rule-based or bag-of-words) would benefit from some processing, but you must be very careful with stop words removal: many words that change the meaning of an entire sentence are stop words (not, no, never, unless).
Do not remove SW, as they add new information(context-awareness) to the sentence (viz., text summarization, machine/language translation, language modeling, question-answering)
Remove SW if we want only general idea of the sentence (viz., sentiment analysis, language/text classification, spam filtering, caption generation, auto-tag generation, topic/document
It's not mandatory. Removing stopwords can sometimes help and sometimes not. You should try both.

Text classification using BERT - how to handle misspelled words

I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.
I am working on a text multiclass classification problem.
I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.
For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.
Is there any way to enhance Bert tokenizer to be able to work with misspelled words?
You can leverage BERT's power to rectify the misspelled word.
The article linked below beautifully explains the process with code snippets
https://web.archive.org/web/20220507023114/https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/
To summarize, you can identify misspelled words via a SpellChecker function and get replacement suggestions. Then, find the most appropriate replacement using BERT.

can I tokenize using spacy and then extract vectors for these token using pre trained word embeddings of fastext

I am tokenizing my text corpus which is in german language using the spacy's german model.
Since currently, spacy only has small german model, I am unable to extract the word vectors using spacy itself.
So, I am using fasttext's pre-trained word embeddings from here:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Now facebook has used ICU tokenizer for tokenization process before extracting word embeddings for it. and i am using spacy
Can someone tell me if this is okay?
I feel spacy and ICU tokenizer might behave differently and if so then many tokens in my text corpus would not have a corresponding word vector
Thank for your help!
UPDATE:
I tried the above method and after extensive testing, I found that this works well for my use case.
Most(almost all) of the tokens in my data matched the tokens present in fasttext ans I was able to obtain the word vectors representation for the same.

Stanford NLP: punctuation error identification

I've just started working with Stanford NLP core.
My problem is that many of the sentences in my corpus do not end with a period (full stop).
Frankly, a bit of string parsing with regular expressions could probably fix the issue, but with some degree of error.
I am curious as to whether Stanford NLP can identify missing periods.
It looks like edu.stanfordn.nlp.process.DocumentPreprocessor can be used to split paragraphs into sentences, though I am not sure how well it works without proper punctuation.
There are many other sentence-level tokenizers which you can use to preprocess your corpus , check out NLTK's nltk.tokenize.punkt module which uses a ML algorithm to make sentence tokens in the absence of good capitalization/punctuation.

Resources