I want beam search code for making language translation PyTorch, I want to use it with sample code in this link.
https://pytorch.org/tutorials/beginner/translation_transformer.html
Because this link is only presented how to greedy decode in the translation process.
Related
I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.
Is langid.py or simple langdetect suitable? or any other suggestions?
Thanks
The easiest way to identify the language of a text is to have a list of common grammatical words of each language (pretty much your stop words, in fact), take a sample of the text and count which words occur in your (language-specific) word lists. Then sum them up and the word list with the largest overlap should be the language of the text.
If you want to be more advanced, you can use n-grams instead of words: collect n-grams from a text you know the language of, and use that as a classifier instead of your stop words.
You could use any transformer-based model trained on multiple languages. For instance, you could use XLM-Roberta which is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang tensors to understand which language is used (which is good in your case), and should be able to determine the correct language from the input ids. Besides like any other transformer based model, it comes with its tokenizer so you could jump the preprocessing part.
You could use the Huggingface library to use any of these models.
Check the XLM Roberta Huggingface documentation here
What is the difference between using CoreNLP (https://stanfordnlp.github.io/CoreNLP/ner.html) and the standalone distribution Stanford NER (https://nlp.stanford.edu/software/CRF-NER.html) for doing Named Entity Recognition? I noticed that the standalone distribution comes with a GUI, but are there any other differences in terms of supported functionality?
I'm trying to decide which one to use for a commercial purpose. I'm working on English models only.
There's no difference in terms of what algorithm is run. I would suggest the full version since you can use the pipeline code. But both versions use the exact same code for the actual NER part.
I'm try to get started with the gensim library. My goal is pretty simple. I want to use the keywords extraction provided by gensim on a german text. Unfortunately, i'm failing hard.
Gensim comes with a keywords extraction build in, it is build on TextRank. While the results look good on english text, it seems not to work on german. I simple installed gensim via pypi and used it out of the box. Well such AI Products are usually driven by a model. My guess is that gensim comes with a english model. A word2vec model for german is available on a github page.
But here i'm stuck, i can't find a way how the summarization module of gensim, which provides the keywords function i'm looking for, can work with a external model.
So the basic question is, how do i load the german model and get keywords from german text?
Thanks
There's nothing in the gensim docs, or the original TextRank paper (from 2004), suggesting that algorithm requires a Word2Vec model as input. (Word2Vec was 1st published around 2013.) It just takes word-tokens.
See examples of its use in the tutorial notebook that's included with gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb
I'm not sure the same algorithm would work as well on German text, given the differing importance of compound words. (To my eyes, TextRank isn't very impressive with English, either.) You'd have to check the literature to see if it still gives respected results. (Perhaps some sort of extra stemming/intraword-tokenizing/canonicalization would help.)
I am planning to get some review data from tripadvisor and I want to be able to extract hotel related aspects and assign polarity to them and classify them as negative or positive.
What tools can I use for this purpose and how and where do I start? I know there are some tools like GATE, Stanford NLP, Open NLP etc, but would I be able to perform the above specific tasks? If so, please let me know an approach to go forward. I am planning to use Java as the choice of programming language and would like to use some APIs
Also, should I go ahead with a rule based approach or a ML approach that uses a trained corpus of reviews, so some other approach completely?
P.S : I am new to NLP and I need some help to go forward.
Stanford CoreNLP has lot of features in one package
POS Tagger
NER Model
Sentiment Analysis
Parser
But in Apache OpenNLP package consist
Sentence Detector
POS tagger
NER
Chunker
But they don't have built in feature to find out Sentiment polarity So you have to pass your tags to other libraries such like SentiwordNet to find out the polarity.
I used used OpenNLP and Stanford Core NLP. But for both you need to modify sentiment corpus with respect to restaurant domain.
You can try ConceptNet (http://conceptnet5.media.mit.edu/). See for instance here (at the bottom of the page): https://github.com/commonsense/conceptnet5/wiki/API how to "see 20 things in English with the most positive affect:"
How to get word vector representation when using Deep Learning in NLP ? The words are represented by a fixed length vector, see http://machinelearning.wustl.edu/mlpapers/paper_files/BengioDVJ03.pdf for more details.
Deep Learning and NLP are quite complex subjects, so if you really want to understand them you'll need to follow a couple of courses in the field and read many papers. There are lots of different techniques for converting words into vector representations and it's a very active area of research. Socher's DL for NLP tutorial is a good next step if you are already well acquainted with NLP and Machine Learning (including deep learning).
With that said (and considering it's a programming forum), if you are just interested for now in using someone's else tools to quickly obtain vector representations which can be useful in some tasks, one library which you must look at is word2vec. Take a look in its website: https://code.google.com/p/word2vec/. It's a very powerful tool and for some basic stuff it could be used without much knowledge.
For getting word vector for a word you can use Google News 300 dimensional word vector model.
Download the model from here - https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing OR from here
https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz .
After downloading load the model using gensim python library as below -
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
Then just query the model for word vector corresponding to a word like
model['usa']
And it returns you a 300 dimensional word vector for usa.
Note that you may not found word vectors for all the words in this model.
Also instead of this Google News model, other models can also be used.