Special characters training NLP models - nlp

I just started with NLP in bots, where a user ask a question that is classified by LUIS and then forwarded to QnAMaker to get an answer, and I have noticed that it behaves strangely with Spanish since we have accented characters and double question marks (¿?). For example:
[1] ¿qué es NLP?
[2] que es NLP
If I train my model with the first one and test it with the second one, the model won't identify both of them with the same intent. This is a very common way to communicate in Spanish since some people tend to save time by avoiding accented charactes and punctuation.
My questions are:
Should I normalize every utterance in my model (removing accents,
punctuation, etc.)? Or should I train it with every different example?
Are there any guidelines for training NLP models that I can base my work in?

Should I normalize every utterance in my model (removing accents,
punctuation, etc.)? Or should I train it with every different example?
That really depends on what you'd want, but to not have to duplicate a bunch of work, it'd probably just be better to just normalize every utterance in your model.
Then what you could do on your bot level, is strip away characters that have accents or are considered "special"/replace with normalized characters, before sending the utterance to LUIS to predict intent with

Related

Passing multiple sentences to BERT?

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.
I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?
There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).
I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as
an arbitrary span of contiguous text, rather than an actual linguistic sentence.
It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

Removing junk sentences

I have transcripts of phone calls with customers and agents. I'm trying to find promises which were made by an agent to a customer.
I already did punctuation restoration. But there are a lot of sentences that don't have any sense. I would like to remove them from the transcript. Most of them are just a set of not connected words.
I wonder what approach is the best for this task?
My ideas are:
• Use tf idf and word2vec to create vectors from all sentences. After that we can do some kind of anomaly detection e.g. look for and delete vectors that are highly deviated from most other vectors.
• Spam filters. Maybe is it possible to apply spam filters for this task?
• Crate some pattern of part of speech tags that proper sentence must include. For example, any good sentence must include noun + verb. Or we can use for example dependency tokens from spacy.
Examples
Example of a sentence that I want to keep:
There's no charge once sent that you'll get a ups tracking number.
Example of a junk sentence:
Kinder pr just have to type it in again, clock drives bethel.
Another junk sentence:
Just so you have it on and said this is regarding that.
One thing I would try is to treat this as a classification problem (junk vs non-junk). You can train a model based on a labelled set (i.e. you need to label some subset of your dataset) and then classify the rest of the corpus.
You could use a pre-trained language model like Bert and fine-tune it with you labeled set, as in here (https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).
The advantage of using a language model like this is that you don't have to worry too much about linguistic (pre-)processing, meaning you don't have to get the part-of-speech or syntactic structure.
Comments regarding your ideas:
Anomaly detection with tf-idf and word2vec: It depends on the proportion of the junk sentences in your corpus. If they it's more than 15%, I would think that they might not be so anomal. Also, I am assuming your junk sentences come from noisy automatic speech-to-text transcription. I am not sure, to what extent parts of these junk sentences are correctly transcribed and what the effect of the correctly transcribed portion might have on the extent of the anomaly.
If you mean pre-existing spam filters that are trained on spam email, I would guess that the spammyness of emails is quite different from junkiness of your transcripts.
Use POS tags or syntactic structure to manually create rules for valid sentences:
This seems a bit tedious too me and also I am not sure if you will discover all junk with this. For instance, in your junk examples, the syntactic structure does not strike me as too unusal, e.g. "clock drives bethel" might be tagged as , which is quite a common tag sequence. The junkiness in this case comes from the meaning of the words.

Google's BERT for NLP: replace foreign characters in vocab.txt to add words?

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?
The unused words weights are essentially randomly initialized only as they have not been used. If you just replace them with your own words but don't pretrain further on your domain specific corpus, then it would essentially remain random only. So, there won't be much benifit IMO, if you replace and continue with finetuning.
Let me point you to this github issue. According to author of the paper:
My recommendation would be to just use the existing wordpiece vocab
and run pre-trianing for more steps on the in-domain text, and it
should learn the compositionality "for free". Keep in mind that with a
wordpiece vocabulary there are basically no out-of-vocabulary words,
and you don't really know which words were seen in the pre-training
and not. Just because a word was split up by word pieces doesn't mean
it's rare, in fact many words which were split into wordpieces were
seen 5,000+ times in the pre-training data.
But if you want to add more vocab you can either: (a) Just replace the
"[unusedX]" tokens with your vocabulary. Since these were not used
they are effectively randomly initialized. (b) Append it to the end of
the vocab, and write a script which generates a new checkpoint that is
identical to the pre-trained checkpoint, but but with a bigger vocab
where the new embeddings are randomly initialized (for initialized we
used tf.truncated_normal_initializer(stddev=0.02)). This will likely
require mucking around with some tf.concat() and tf.assign() calls.
Hope this helps!

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
(http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf)
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
(https://arxiv.org/pdf/1607.05368.pdf)
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.
I think it really matters about what you want to solve with this. It depends on the task.
Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.
But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.
Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.
The current project I am working on involves identifying gene names within Biology papers abstracts using the vector space created by Word2Vec. When we run the algorithm without lemmatizing the Corpus mainly 2 problems arise:
The vocabulary gets way too big, since you have words in different forms which in the end have the same meaning.
As noted above, your space get less sparse, since you get more representatives of a certain "meaning", but at the same time, some of these meanings might get split among its representatives, let me clarify with an example
We are currently interest in a gene recognized by the acronym BAD. At the same time, "bad" is a english word which has different forms (badly, worst, ...). Since Word2vec build its vectors based on the context (its surrounding words) probability, when you don't lemmatize some of these forms, you might end up losing the relationship between some of these words. This way, in the BAD case, you might end up with a word closer to gene names instead of adjectives in the vector space.

Resources