Dataset Language identification - nlp

I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.
Is langid.py or simple langdetect suitable? or any other suggestions?
Thanks

The easiest way to identify the language of a text is to have a list of common grammatical words of each language (pretty much your stop words, in fact), take a sample of the text and count which words occur in your (language-specific) word lists. Then sum them up and the word list with the largest overlap should be the language of the text.
If you want to be more advanced, you can use n-grams instead of words: collect n-grams from a text you know the language of, and use that as a classifier instead of your stop words.

You could use any transformer-based model trained on multiple languages. For instance, you could use XLM-Roberta which is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang tensors to understand which language is used (which is good in your case), and should be able to determine the correct language from the input ids. Besides like any other transformer based model, it comes with its tokenizer so you could jump the preprocessing part.
You could use the Huggingface library to use any of these models.
Check the XLM Roberta Huggingface documentation here

Related

Is there an embeddings technique to represent multilingual paragraphs?

I have a dataset that includes English, Spanish and German documents. I want to represent them using document embeddings techniques to compute their similarities. However, as the documents are in different languages and the length of each one is paragraph-sized, it is difficult to find a pre-trained model (I do not have enough data for training) .
I found some interesting models like Sent2Vec and LASER that also work on multilingual context. However, both of them have been implemented for sentence representation. The question is two folds:
Is there any model that could be used to represent multilingual paragraphs?
Is it possible to employ sent2vec (or LASER) to represent paragraphs (I mean to represent each paragraph using an embeddings vector)?
Any help would be appreciated.

Semantic Similarity across multiple languages

I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good).
So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)?
Let's assume that your sentence-similarity scheme uses only word-vectors as an input – as in simple word-vector averaging schemes, or Word Mover's Distance.
It should be possible to do what you've suggested, provided that:
you have good sets of word-vectors for each language's words
the coordinate spaces of the word-vectors are compatible, meaning the words for the exact-same things in both languages have nearly-identical coordinates (and other words with similar meanings have close coordinates)
That second quality is not automatically assured. In fact, given the random initialization of word2vec models, and other randomization introduced by the algorithm/implementation, even subsequent training runs on the exact same data won't place words into the exact same places. So word-vectors trained on totally-separate English/Dutch corpuses won't likely place equivalent words at the same coordinates.
But, you can learn an algebraic-transformation between two spaces, based on certain anchor/reference word-pairs (that you know should have similar vectors). You can then apply that transformation to all words in one of the two sets, which results in you having vectors for those 'foreign' words within the comparable coordinate-space of the 'canonical' word-set.
In fact this very idea was used in one of the first word2vec papers:
"Exploiting Similarities among Languages for Machine Translation"
If you were to apply a similar transformation on one of your language word-vector sets, then use those transformed vectors as inputs to your sentence-vector scheme, those sentence-vectors would likely have some useful comparability to sentence-vectors in the other language, bootstrapped from word-vectors in the same coordinate-space.
Update: There's a very interesting recent paper that manages to train word-vectors in multiple languages simultaneously, using a corpus that includes both raw sentences in each single language, and a (smaller) set of aligned-sentences that are known to mean the same in both languages. Gensim doesn't yet support this mode, but there's discussion of supporting it in a future refactor.
I've recently produced a Python implementation of the technique mentioned in the paper from #gojomo's answer: transvec.
You'll need to provide word translation pairs as training data (I just threw words from my corpus into Google Translate to get as many such pairs as I can) and then you can use a wrapper model from transvec to produce comparable word embeddings for multiple languages. Here's an example:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
For the case of documents rather than words, things are a little trickier because Doc2Vec can't use pre-trained Word2Vec models as a starting point. However, you can get an approximate document vector by simply taking the mean of all the word vectors from that document. If you provide a 2d array to TranslationWordVectorizer's transform method, it will do exactly this and provide you with an approximate document vector so you can find documents with similar meaning even if the languages are different.

Natural Language Generation - how to go beyond templates

We've build a system that analyzes some data and outputs some results in plain English (i.e. no charts etc.). The current implementation relies on lots of templates and some randomization in order to give as much diversity to the text as possible.
We'd like to switch to something more advanced with the hope that the produced text is less repetitive and sounds less robotic. I've searched a lot on google but I cannot find something concrete to start from. Any ideas?
EDIT: The data fed to the NLG mechanism are in JSON format. Here is an example about web analytics data. The json file may contain for example a metric (e.g. visits), it's value in the last X days, whether the last value is expected or not and which dimensions (e.g. countries or marketing channels) affected its change.
The current implementation could give something like this:
Overall visits in the UK mainly from ABC email campaign reached 10K (+20% DoD) and were above the expected value by 10%. Users were mainly landing on XXX page while the increase was consistent across devices.
We're looking to finding a way to depend less on templates, sound even more natural and increase the vocabulary.
What you are looking for is a hot research area and a pretty tough task. Currently there is no way to generate 100% meaningful diverse and natural sentences. one approach to generate sentences is using n-grams. using these method you can generate sentences that look more natural and diverse that may look good but probably meaningless and grammatically incorrect.
A more up to date approach is using Deep learning. anyway if you want to generate meaningful sentences, maybe your best way is using your current template based method.
You can find an introduction to basics of n-gram based NLG here:
Generating Random Text with Bigrams
this tool sounds to implement some of the most famous techniques for natural language generation: simplenlg
Have you tried Neural Networks especially LSTM and GRU architectures? These models are the most recent developments in predicting sequences of words. Generating natural language means to generate a sequence of words such that it makes sense with respect to the input and earlier words in the sequence. This is equivalent to predicting time series. LSTM is designed for predicting time series. Hence, it is commonly used to predict a sequence of words, given an input sequence, an input word, or any other input that can be embedded in a vector.
Deep learning libraries such as Tensorflow, Keras, and Torch all have sequence to sequence implementations that can be used for generating natural language by predicting a sequence of words given an input.
Note that usually these models need a huge amount of training data.
You need to meet two criteria in order to benefit from such models:
You should be able to represent your input as a vector.
You need a relatively large amount of input/target pairs.

How to use hmmlearn to classify English text?

I want to implement a classic Markov model problem: Train MM to learn English text patterns, and use that to detect English text vs. random strings.
I decided to use hmmlearn so I don't have to write my own. However I am confused about how to train it. It seems to require the number of components in the HMM, but what is a reasonable number for English? Also, can I not do a simple higher order Markov model instead of hidden? Presumably the interesting property is is patterns of ngrams, not hidden states.
hmmlearn is designed for unsupervised learning of HMMs, while your problem is clearly supervised: given examples of English and random strings, learn to distinguish between the two. Also, as you've correctly pointed it out, the notion of hidden states is tricky to define for text data, therefore for your problem plain MMs would be more appropriate. I think you should be able to implement them in <100 lines of code in Python.

Stanford NER prop file meaning of DistSim

In one of the example .prop files coming with the Stanford NER software there are two options I do not understand:
useDistSim = true
distSimLexicon = /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters
Does anyone have a hint what DistSim stands for and where I can find any more documentation on how to use these options?
UPDATE: I just found out that DistSim means distributional similarity. I still wonder what that means in this context.
"DistSim" refers to using features based on word classes/clusters, built using distributional similarity clustering methods (e.g., Brown clustering, exchange clustering). Word classes group words which are similar, semantically and/or syntactically, and allow an NER system to generalize better, including handling words not in the training data of the NER system better. Many of our distributed models use a distributional similarity clustering features as well as word identity features, and gain significantly from doing so. In Stanford NER, there are a whole bunch of flags/properties that affect how distributional similarity is interpreted/used: useDistSim, distSimLexicon, distSimFileFormat, distSimMaxBits, casedDistSim, numberEquivalenceDistSim, unknownWordDistSimClass, and you need to look at the code in NERFeatureFactory.java to decode the details, but in the simple case, you just need the first two, and they need to be used while training the model, as well as at test time. The default format of the lexicon is just a text file with a series of lines with two tab separated columns of word clusterName. The cluster names are arbitrary.

Resources