Using spacy with archaich/old english words? - nlp

I am using en_core_web_lg to compare some texts for similarity and I am not getting the expected results.
The issue I guess is that my texts are mostly religious, for example:
"Thus hath it been decreed by Him Who is the Source of Divine inspiration."
"He, verily, is the Expounder, the Wise."
"Whoso layeth claim to a Revelation direct from God, ere the expiration of a full thousand years, such a man is assuredly a lying impostor. "
My question is, is there a way I can check spacy's "dictionary"? Does it include words like "whoso" "layeth" "decreed" or "verily"?

To check if spaCy knows about individual words you can check tok.is_oov ("is out of vocabulary"), where tok is a token from a doc.
spaCy is trained on a dataset called OntoNotes. While that does include some older texts, like the bible, it's mostly relatively recent newspapers and similar sources. The word vectors are trained on Internet text. I would not expect it to work well with documents of the type you are describing, which are very different from what it has seen before.
I would suggest you train custom word vectors on your dataset, which you can then load into spaCy. You could also look at the HistWords project.

Related

Latest Pre-trained Multilingual Word Embedding

Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?
I have looked at the following but they don't fit my needs:
FastText / MUSE (https://fasttext.cc/docs/en/aligned-vectors.html): this one seems too old, and the word vectors are not using subwords / wordpiece information.
LASER (https://github.com/yannvgn/laserembeddings): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences (https://github.com/facebookresearch/LASER/issues/69).
BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts.
Here is the problem I'm trying to solve:
I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:
keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot
keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football")
Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."
Examples of keywords: "football"
Check if this would do:
Multilingual BPE-based embeddings
Aligned multilingual sub-word vectors
If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)
Multilingual FastText
ConceptNet NumberBatch
If you're okay with contextual embeddings:
Multilingual ELMo
XLM-RoBERTa
You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)
I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right.
One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.
Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context.
Link
If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link
Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark
If anyone is still looking for some sort of multilingual word embedding.
This might help:
https://github.com/babylonhealth/fastText_multilingual
As far as I know, it doesn't support sentences, and my trials to do it weren't really successful.

Removing junk sentences

I have transcripts of phone calls with customers and agents. I'm trying to find promises which were made by an agent to a customer.
I already did punctuation restoration. But there are a lot of sentences that don't have any sense. I would like to remove them from the transcript. Most of them are just a set of not connected words.
I wonder what approach is the best for this task?
My ideas are:
• Use tf idf and word2vec to create vectors from all sentences. After that we can do some kind of anomaly detection e.g. look for and delete vectors that are highly deviated from most other vectors.
• Spam filters. Maybe is it possible to apply spam filters for this task?
• Crate some pattern of part of speech tags that proper sentence must include. For example, any good sentence must include noun + verb. Or we can use for example dependency tokens from spacy.
Examples
Example of a sentence that I want to keep:
There's no charge once sent that you'll get a ups tracking number.
Example of a junk sentence:
Kinder pr just have to type it in again, clock drives bethel.
Another junk sentence:
Just so you have it on and said this is regarding that.
One thing I would try is to treat this as a classification problem (junk vs non-junk). You can train a model based on a labelled set (i.e. you need to label some subset of your dataset) and then classify the rest of the corpus.
You could use a pre-trained language model like Bert and fine-tune it with you labeled set, as in here (https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).
The advantage of using a language model like this is that you don't have to worry too much about linguistic (pre-)processing, meaning you don't have to get the part-of-speech or syntactic structure.
Comments regarding your ideas:
Anomaly detection with tf-idf and word2vec: It depends on the proportion of the junk sentences in your corpus. If they it's more than 15%, I would think that they might not be so anomal. Also, I am assuming your junk sentences come from noisy automatic speech-to-text transcription. I am not sure, to what extent parts of these junk sentences are correctly transcribed and what the effect of the correctly transcribed portion might have on the extent of the anomaly.
If you mean pre-existing spam filters that are trained on spam email, I would guess that the spammyness of emails is quite different from junkiness of your transcripts.
Use POS tags or syntactic structure to manually create rules for valid sentences:
This seems a bit tedious too me and also I am not sure if you will discover all junk with this. For instance, in your junk examples, the syntactic structure does not strike me as too unusal, e.g. "clock drives bethel" might be tagged as , which is quite a common tag sequence. The junkiness in this case comes from the meaning of the words.

Gensim Keywords, how to load a german model?

I'm try to get started with the gensim library. My goal is pretty simple. I want to use the keywords extraction provided by gensim on a german text. Unfortunately, i'm failing hard.
Gensim comes with a keywords extraction build in, it is build on TextRank. While the results look good on english text, it seems not to work on german. I simple installed gensim via pypi and used it out of the box. Well such AI Products are usually driven by a model. My guess is that gensim comes with a english model. A word2vec model for german is available on a github page.
But here i'm stuck, i can't find a way how the summarization module of gensim, which provides the keywords function i'm looking for, can work with a external model.
So the basic question is, how do i load the german model and get keywords from german text?
Thanks
There's nothing in the gensim docs, or the original TextRank paper (from 2004), suggesting that algorithm requires a Word2Vec model as input. (Word2Vec was 1st published around 2013.) It just takes word-tokens.
See examples of its use in the tutorial notebook that's included with gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb
I'm not sure the same algorithm would work as well on German text, given the differing importance of compound words. (To my eyes, TextRank isn't very impressive with English, either.) You'd have to check the literature to see if it still gives respected results. (Perhaps some sort of extra stemming/intraword-tokenizing/canonicalization would help.)

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
(http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf)
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
(https://arxiv.org/pdf/1607.05368.pdf)
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.

Do I need to provide sentences for training Spacy NER or are paragraphs fine?

I am trying to train a new Spacy model to recognize references to law articles. I start using a blank model, and train the ner pipe according to the example given in the documentation.
The performance of the trained model is really poor, even with several thousands on input points. I am tryong to figure out why.
One possible answer is that I am giving full paragraphs to train on, instead of sentences that are in the examples. Each of these paragraphs can have multiple references to law articles. Is this a possible issue?
Turns out I was making a huge mistake in my code. There is nothing wrong with paragraphs. As long as your code actually supplies them to spacy.
Paragraphs should be fine. Could you give an example input data point?

Resources