How to find Sentence Transformer support languages? - nlp

I want to get the sentence embedding results to find the sentence similarities in my NLP project. Since I am working with a low-resource language (Sinhala), I want to know whether any sentence_transformer model supports my low-resource language. However, I was unable to find the pre-trained languages of those models. So How can I find that?
If those models are not trained with this language, How can I implement a sentence embedding model?

Related

How to create a custom BERT language model for a different language?

I want to create a language translation model using transformers. However, Tensorflow seems to only have a BERT model for English https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4 . If I want a BERT for another language, what is the best way to go about accomplishing this? Should I create a new BERT or can I train Tensorflow's own BertTokenizer on another language?
The Hugging Face model hub contains a plethora of pre-trained monolingual and multilingual transformers (and relevant tokenizers) which can be fine-tuned for your downstream task.
However, if you are unable to locate a suitable model for you language, then yes training from scratch is the only option. Beware though that training from scratch can be a resource-intensive task that will require significant compute power. Here is an excellent blog post to get you started.

BERT multilingual model - For classification

I am trying to build multilingual classification model with BERT.
I'm using a feature-based approach (concatenating the features from top-4 hidden layers) and building a CNN classifier on top of that.
After that I'm using different language (say chinese) from the same domain for testing, but accuracy for these languages is near zero.
I am not sure that I understand paper well, so here is my question:
Is it possible to fine-tune BERT multilingual model on one language
(e.g. English) or use feature-based approach to extract the features and build classifer, and after that use this model for different languages (other
languages from the list of supported languages in documentation of
BERT)?
Also, is my hypothesis, "regarding BERT that it maps I think that it's embedding layer maps words from different languages with same context to similar clusters", correct?

Is there a way to use french in Stanford CoreNLP sentiment analysis?

I am aware that only the English model is available for sentiment analysis but I found edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz in stanford-parser-3.5.2-models.jar. I'm actually looking at https://github.com/stanfordnlp/CoreNLP Is it possible to use this model instead of englishPCFG.sez.gz with CoreNLP and if so, how ?
CoreNLP does not include sentiment models for languages other than English. While we do ship French parser models, there is no available French sentiment model to use with the parser.
You may be able to find French sentiment analysis training data. There is plenty of information available about how to do this if you're interested; see e.g. this SO post.

Biasing word2vec towards special corpus

I am new to stackoverflow. Please forgive my bad English.
I am using word2vec for a school project. I want to work with a domain specific corpus (like Physics Textbook) for creating the word vectors using Word2Vec. This standalone does not provide good results due to lesser size of the corpus. This especially hurts as we want to evaluate on words that may very well be outside the vocabulary of the text book.
We want the textbook to encode the domain specific relationships and semantic "nearness". "Quantum" and "Heisenberg" are especially close in this textbook for eg. which may not hold true for background corpus. To handle the generic words (like "any") we need the basic background model(like the one provided by Google on word2vec site).
Is there any way that we can supplant to the background model using our newer corpus. Just training on the corpus etc. doesnot work well.
Are there any attempts to combine vector representations from two corpus- general and specific. I could not find any in my searches.
Let's talk about gensim since you tagged you question with it. You can load a previously trained model in python using gensim. Then you continue training it. Would it be useful?
# load from previous gensim file:
model = gensim.models.Word2Vec.load(fname)
# or from word2vec c format:
# model = gensim.models.Word2Vec.load_word2vec_format('/path/vectors.bin', binary=True)
# continue training:
model.train(other_sentences)
model.save(fname)

How to get word vector representation when using Deep Learning in NLP

How to get word vector representation when using Deep Learning in NLP ? The words are represented by a fixed length vector, see http://machinelearning.wustl.edu/mlpapers/paper_files/BengioDVJ03.pdf for more details.
Deep Learning and NLP are quite complex subjects, so if you really want to understand them you'll need to follow a couple of courses in the field and read many papers. There are lots of different techniques for converting words into vector representations and it's a very active area of research. Socher's DL for NLP tutorial is a good next step if you are already well acquainted with NLP and Machine Learning (including deep learning).
With that said (and considering it's a programming forum), if you are just interested for now in using someone's else tools to quickly obtain vector representations which can be useful in some tasks, one library which you must look at is word2vec. Take a look in its website: https://code.google.com/p/word2vec/. It's a very powerful tool and for some basic stuff it could be used without much knowledge.
For getting word vector for a word you can use Google News 300 dimensional word vector model.
Download the model from here - https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing OR from here
https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz .
After downloading load the model using gensim python library as below -
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
Then just query the model for word vector corresponding to a word like
model['usa']
And it returns you a 300 dimensional word vector for usa.
Note that you may not found word vectors for all the words in this model.
Also instead of this Google News model, other models can also be used.

Resources