Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?
I have looked at the following but they don't fit my needs:
FastText / MUSE (https://fasttext.cc/docs/en/aligned-vectors.html): this one seems too old, and the word vectors are not using subwords / wordpiece information.
LASER (https://github.com/yannvgn/laserembeddings): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences (https://github.com/facebookresearch/LASER/issues/69).
BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts.
Here is the problem I'm trying to solve:
I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:
keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot
keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football")
Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."
Examples of keywords: "football"
Check if this would do:
Multilingual BPE-based embeddings
Aligned multilingual sub-word vectors
If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)
Multilingual FastText
ConceptNet NumberBatch
If you're okay with contextual embeddings:
Multilingual ELMo
XLM-RoBERTa
You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)
I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right.
One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.
Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context.
Link
If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link
Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark
If anyone is still looking for some sort of multilingual word embedding.
This might help:
https://github.com/babylonhealth/fastText_multilingual
As far as I know, it doesn't support sentences, and my trials to do it weren't really successful.
Related
I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.
I have a dataset that includes English, Spanish and German documents. I want to represent them using document embeddings techniques to compute their similarities. However, as the documents are in different languages and the length of each one is paragraph-sized, it is difficult to find a pre-trained model (I do not have enough data for training) .
I found some interesting models like Sent2Vec and LASER that also work on multilingual context. However, both of them have been implemented for sentence representation. The question is two folds:
Is there any model that could be used to represent multilingual paragraphs?
Is it possible to employ sent2vec (or LASER) to represent paragraphs (I mean to represent each paragraph using an embeddings vector)?
Any help would be appreciated.
we have a news website where we have to match news to a particular user.
We have to use for the matching only the user textual information, like for example the interests of the user or a brief description about them.
I was thinking to threat both the user textual information and the news text as document and find document similarity.
In this way, I hope, that if in my profile I wrote sentences like: I loved the speach of the president in Chicago last year, and a news talks about: Trump is going to speak in Illinois I can have a match (the example is purely casual).
I tried, first, to embed my documents using TF-IDF and then I tried a kmeans to see if there was something that makes sense, but I don't like to much the results.
I think the problem derives from the poor embedding that TF-IDF gives me.
Thus I was thinking of using BERT embedding to retrieve the embedding of my documents and then use cosine similarity to check similarity of two document (a document about the user profile and a news).
Is this an approach that could make sense? Bert can be used to retrieve the embedding of sentences, but there is a way to embed an entire document?
What would you advice me?
Thank you
BERT is trained on pairs of sentences, therefore it is unlikely to generalize for much longer texts. Also, BERT requires quadratic memory with the length of the text, using too long texts might result in memory issues. In most implementations, it does not accept sequences longer than 512 subwords.
Making pre-trained Transformers work efficiently for long texts is an active research area, you can have a look at a paper called DocBERT to have an idea what people are trying. But it will take some time until there is a nicely packaged working solution.
There are also other methods for document embedding, for instance Gensim implements doc2vec. However, I would still stick with TF-IDF.
TF-IDF is typically very sensitive to data pre-processing. You certainly need to remove stopwords, in many languages it also pays off to do lemmatization. Given the specific domain of your texts, you can also try expanding the standard list of stop words by words that appear frequently in news stories. You can get further improvements by detecting and keeping together named entities.
I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec.
I think both give me some words most similar with query word I request, by most_similar()(after training).
How can tell which case I have to use word2vec or doc2vec?
Someone could explain difference in short word, please?
Thanks.
In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).
Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).
If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.
How to get word vector representation when using Deep Learning in NLP ? The words are represented by a fixed length vector, see http://machinelearning.wustl.edu/mlpapers/paper_files/BengioDVJ03.pdf for more details.
Deep Learning and NLP are quite complex subjects, so if you really want to understand them you'll need to follow a couple of courses in the field and read many papers. There are lots of different techniques for converting words into vector representations and it's a very active area of research. Socher's DL for NLP tutorial is a good next step if you are already well acquainted with NLP and Machine Learning (including deep learning).
With that said (and considering it's a programming forum), if you are just interested for now in using someone's else tools to quickly obtain vector representations which can be useful in some tasks, one library which you must look at is word2vec. Take a look in its website: https://code.google.com/p/word2vec/. It's a very powerful tool and for some basic stuff it could be used without much knowledge.
For getting word vector for a word you can use Google News 300 dimensional word vector model.
Download the model from here - https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing OR from here
https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz .
After downloading load the model using gensim python library as below -
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
Then just query the model for word vector corresponding to a word like
model['usa']
And it returns you a 300 dimensional word vector for usa.
Note that you may not found word vectors for all the words in this model.
Also instead of this Google News model, other models can also be used.