Microsoft Translate Adding Extra Words To Translation - azure

I am trying to translate to from English to Welsh. I have a data set of 3032 sentences which I am aware is below the recommended 10000 limit but the issue is random words being added to sentences or added at the end of the translation.
With the dataset I have, I am getting a BLEU score of 94.25.
Image of Translation Differences
I have attached four examples where extra words are being added throughout the form. At no point in the dataset is there duplication of words that match any of these formats and there is no trailing whitespace in the translations which would explain why "yn" in particular is appearing as a new sentence.
Is there any way of removing these erroneous extra words or increasing the accuracy of the translation? To increase the overall amount of sentences to more than 10000 would be a very large task and would not be something to undertake if the system is still going to have a high chance of returning random words.

I also raised this as a support request with Microsoft. They had said the issue was down to using a dictionary that included verbs as part of the translation.
I have since tried using English UK as the basis for the translation - an option that previously failed to build - and with the same dataset the BLEU score is 93.24 but the extra words have disappeared.
My issue has been resolved and it's now down to training out the incorrect translations. It appears the English to Welsh translation has a bug.

Related

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

When using word alignment tools like fast_align, does more sentences mean better accuracy?

I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.
Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.
[Disclaimer: I know next to nothing about alignment and have not used fast_align.]
Yes.
You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.
That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.
More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.
Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.

Part of speech tagging : tagging unknown words

In the part of speech tagger, the best probable tags for the given sentence is determined using HMM by
P(T*) = argmax P(Word/Tag)*P(Tag/TagPrev)
T
But when 'Word' did not appear in the training corpus, P(Word/Tag) produces ZERO for given all possible tags, this leaves no room for choosing the best.
I have tried few ways,
1) Assigning small amount of probability for all unknown words, P(UnknownWord/AnyTag)~Epsilon... means this completely ignores the P(Word/Tag) for unknowns word by assigning the constant probability.. So decision making on unknown word is by prior probabilities.. As expected it is not producing good result.
2) Laplace Smoothing
I confused with this. I don't know what is difference between (1) and this. My way of understanding Laplace Smoothing adds the constant probability(lambda) to all unknown & Known words.. So the All Unknown words will get constant probability(fraction of lambda) and Known words probabilities will be the same relatively since all word's prob increased by Lambda.
Is the Laplace Smoothing same as the previous one ?
*)Is there any better way of dealing with unknown words ?
Your two approaches are similar, but, if I understand correctly, they differ in one key way. In (1) you are assigning extra mass to counts of unknown words and in (2) you are assigning extra mass to all counts. You definitely want to do (2) and not (1).
One of the problems with Laplace smoothing is that it give too much of a boost to unknown words and drags down the probabilities of high-probability words too much (relatively speaking). Your version (1) would actually worsen this problem. Basically, it would over-smooth.
Laplace smoothing words ok for an HMM, but it's not great. Most people do add-one smoothing but you could experiment with things like add-one-half or whatever.
If you want to move beyond this naive approach to smoothing, check out "one-count smoothing", as described in the Appendix of Jason Eisner's HMM tutorial. The basic idea here is that for unknown words more probability mass should be given to tags that appear with a wider variety of low frequency words. For example, since the tag NOUN appears on a large number of different words and DETERMINER appears on a small number of different words, it is more likely that an unseen word will be a NOUN.
If you want to get even fancier, you could use a Chinese Restaurant Process model taken from non-parametric Bayesian statistics to put a prior distribution on unseen word/tag combinations. Kevin Knight's Bayesian inference tutorial has details.
I think the HMM-based TnT tagger provides a better approach to handle unknown words (see the approach in TnT tagger's paper).
The accuracy results (for known words and unknown words) of TnT and other two POS and morphological taggers on 13 languages including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese, can be found in this article.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks
The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:
How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.
iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Resources