Gensim Keywords, how to load a german model? - nlp

I'm try to get started with the gensim library. My goal is pretty simple. I want to use the keywords extraction provided by gensim on a german text. Unfortunately, i'm failing hard.
Gensim comes with a keywords extraction build in, it is build on TextRank. While the results look good on english text, it seems not to work on german. I simple installed gensim via pypi and used it out of the box. Well such AI Products are usually driven by a model. My guess is that gensim comes with a english model. A word2vec model for german is available on a github page.
But here i'm stuck, i can't find a way how the summarization module of gensim, which provides the keywords function i'm looking for, can work with a external model.
So the basic question is, how do i load the german model and get keywords from german text?
Thanks

There's nothing in the gensim docs, or the original TextRank paper (from 2004), suggesting that algorithm requires a Word2Vec model as input. (Word2Vec was 1st published around 2013.) It just takes word-tokens.
See examples of its use in the tutorial notebook that's included with gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb
I'm not sure the same algorithm would work as well on German text, given the differing importance of compound words. (To my eyes, TextRank isn't very impressive with English, either.) You'd have to check the literature to see if it still gives respected results. (Perhaps some sort of extra stemming/intraword-tokenizing/canonicalization would help.)

Related

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

Using spacy with archaich/old english words?

I am using en_core_web_lg to compare some texts for similarity and I am not getting the expected results.
The issue I guess is that my texts are mostly religious, for example:
"Thus hath it been decreed by Him Who is the Source of Divine inspiration."
"He, verily, is the Expounder, the Wise."
"Whoso layeth claim to a Revelation direct from God, ere the expiration of a full thousand years, such a man is assuredly a lying impostor. "
My question is, is there a way I can check spacy's "dictionary"? Does it include words like "whoso" "layeth" "decreed" or "verily"?
To check if spaCy knows about individual words you can check tok.is_oov ("is out of vocabulary"), where tok is a token from a doc.
spaCy is trained on a dataset called OntoNotes. While that does include some older texts, like the bible, it's mostly relatively recent newspapers and similar sources. The word vectors are trained on Internet text. I would not expect it to work well with documents of the type you are describing, which are very different from what it has seen before.
I would suggest you train custom word vectors on your dataset, which you can then load into spaCy. You could also look at the HistWords project.

What is the state of GPT-3 for text classification in spanish?

I would like to know if it is possible to reuse gpt-3 in a different language, Spanish in this case.
Do I need a gpt-3 model specifically trained with a Spanish corpus, or can I use transfer learning to produce Spanish text?
GPT-3 is only available via an API and only to people who apply for the access. The model is too big to run it locally on any reasonable hardware and fine-tuning is thus hardly an option.
Given how well GPT-3 works machine translation, my guess is that it will work reasonably well for Spanish by default. However, if your task is text classification, you can do a much better job when using a pre-trained BERT-like model, Hugginface's Transformers already have several models for Spanish.

Latest Pre-trained Multilingual Word Embedding

Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?
I have looked at the following but they don't fit my needs:
FastText / MUSE (https://fasttext.cc/docs/en/aligned-vectors.html): this one seems too old, and the word vectors are not using subwords / wordpiece information.
LASER (https://github.com/yannvgn/laserembeddings): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences (https://github.com/facebookresearch/LASER/issues/69).
BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts.
Here is the problem I'm trying to solve:
I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:
keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot
keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football")
Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."
Examples of keywords: "football"
Check if this would do:
Multilingual BPE-based embeddings
Aligned multilingual sub-word vectors
If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)
Multilingual FastText
ConceptNet NumberBatch
If you're okay with contextual embeddings:
Multilingual ELMo
XLM-RoBERTa
You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)
I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right.
One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.
Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context.
Link
If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link
Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark
If anyone is still looking for some sort of multilingual word embedding.
This might help:
https://github.com/babylonhealth/fastText_multilingual
As far as I know, it doesn't support sentences, and my trials to do it weren't really successful.

Entity Recognition and Sentiment Analysis using NLP

So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html

Resources