Can CMU Sphinx support multiple languages in a sentence? - cmusphinx

I know CMU Sphinx has many language model , dictionaries and acoustic models.
I want to recognize a sentence which may contain several languages, for example, English and Mandarin.
Can it be done?

Yes, but it is better to try more modern framework like Kaldi. And you will have to train models from data, there are no pretrained models.

Related

How to find Sentence Transformer support languages?

I want to get the sentence embedding results to find the sentence similarities in my NLP project. Since I am working with a low-resource language (Sinhala), I want to know whether any sentence_transformer model supports my low-resource language. However, I was unable to find the pre-trained languages of those models. So How can I find that?
If those models are not trained with this language, How can I implement a sentence embedding model?

Dataset Language identification

I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.
Is langid.py or simple langdetect suitable? or any other suggestions?
Thanks
The easiest way to identify the language of a text is to have a list of common grammatical words of each language (pretty much your stop words, in fact), take a sample of the text and count which words occur in your (language-specific) word lists. Then sum them up and the word list with the largest overlap should be the language of the text.
If you want to be more advanced, you can use n-grams instead of words: collect n-grams from a text you know the language of, and use that as a classifier instead of your stop words.
You could use any transformer-based model trained on multiple languages. For instance, you could use XLM-Roberta which is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang tensors to understand which language is used (which is good in your case), and should be able to determine the correct language from the input ids. Besides like any other transformer based model, it comes with its tokenizer so you could jump the preprocessing part.
You could use the Huggingface library to use any of these models.
Check the XLM Roberta Huggingface documentation here

Is there a way to use french in Stanford CoreNLP sentiment analysis?

I am aware that only the English model is available for sentiment analysis but I found edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz in stanford-parser-3.5.2-models.jar. I'm actually looking at https://github.com/stanfordnlp/CoreNLP Is it possible to use this model instead of englishPCFG.sez.gz with CoreNLP and if so, how ?
CoreNLP does not include sentiment models for languages other than English. While we do ship French parser models, there is no available French sentiment model to use with the parser.
You may be able to find French sentiment analysis training data. There is plenty of information available about how to do this if you're interested; see e.g. this SO post.

Part of speech tagging in OpenNLP vs. StanfordNLP

I'm new to part of speech (pos) taging and I'm doing a pos tagging on a text document. I'm considering using either OpenNLP or StanfordNLP for this. For StanfordNLP I'm using a MaxentTagger and I use english-left3words-distsim.tagger to train it. In OpenNLP I'm using POSModel and train it using en-pos-maxent.bin. How these two taggers (MaxentTagger and POSTagger) and the training sets (english-left3words-distsim.tagger and en-pos-maxent.bin) are different and which one is usually giving a better result.
Both POS taggers are based on Maximum Entropy machine learning. They differ in the parameters/features used to determine POS tags. For example, StanfordNLP pos tagger uses: "(i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs" (read more in the paper). Features of OpenNLP are documented somewhere else which I currently don't know.
The models are probably trained on different corpora.
In general, it is really hard to tell which NLP tool performs better in term of quality. This is really dependent on your domain and you need to test your tools. See following papers for more information:
Is Part-Of-Tagging a Solved Task
Large Dataset for Keyphrases Extraction
In order to address this problem practically, I'm developing a Maven plugin and an annotation tool to create domain-specific NLP models more effectively.

What are some good tools/practises for aspect level sentiment analysis?

I am planning to get some review data from tripadvisor and I want to be able to extract hotel related aspects and assign polarity to them and classify them as negative or positive.
What tools can I use for this purpose and how and where do I start? I know there are some tools like GATE, Stanford NLP, Open NLP etc, but would I be able to perform the above specific tasks? If so, please let me know an approach to go forward. I am planning to use Java as the choice of programming language and would like to use some APIs
Also, should I go ahead with a rule based approach or a ML approach that uses a trained corpus of reviews, so some other approach completely?
P.S : I am new to NLP and I need some help to go forward.
Stanford CoreNLP has lot of features in one package
POS Tagger
NER Model
Sentiment Analysis
Parser
But in Apache OpenNLP package consist
Sentence Detector
POS tagger
NER
Chunker
But they don't have built in feature to find out Sentiment polarity So you have to pass your tags to other libraries such like SentiwordNet to find out the polarity.
I used used OpenNLP and Stanford Core NLP. But for both you need to modify sentiment corpus with respect to restaurant domain.
You can try ConceptNet (http://conceptnet5.media.mit.edu/). See for instance here (at the bottom of the page): https://github.com/commonsense/conceptnet5/wiki/API how to "see 20 things in English with the most positive affect:"

Resources