Microsoft translator number to words weird behaviour - azure

Got an issue here, can't translate properly to "word", it only works on very small numbers.
This is how I call the webAPI
https://api.microsofttranslator.com/V2/Http.svc/Translate?to=zh-chs&text=Nine
I've consulted the docs but I've yet to find something.
MSDN
COGNITIVE
Is there any way of forcing the translate to translate to word and not arabic format ?
Thanks,

No, there are no options for controlling semantic behaviors such as number, date/time, etc. The result is dependent on the materials used for training the engine for a particular language, and in this particular case, it appears that for this language pair, word numbers are represented by digits in the result.

Related

Semantic similarity (text comparison) in NLP - best package or cloud service?

I am trying to find readymade service for text comparison (the meaning of sentences including consideration of synonyms)
I found out that AWS Comprehend does not support text comparison. I am looking for a trusted way to quickly implement.
The information "text comparison" is very broad. If there is an example of the expected functionality, we can help better. Following are some points regarding semantics with Python's library for WORDNET.
This offers a way to identify synonyms, hypernymy, and hyponymy.
This also offers a way to measure the distance between two words and thereby helps in comparing the "proximity" of those words.
Please note that WORDNET is a collection of NOUNS and does not have all the words in English.
I've used SprinkeML to do word comparisons and it's worked well.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

automatic keyword generation evaluation

I have a simple text analyzer with generates keywords for a given input text. Until now I have been doing a manual evaluation of it, i.e., manually selecting keywords of a text and comparing them against the ones generated by the analyzer.
Is there any way in which I can automate this? I tried googling a lot for some free keyword generators which can help in this evaluation but have not found any till now. I will appreciate any suggestions on how to go about this.
Testing keyword generation is a difficult problem. In the past, I have used the following method to evaluate it.
Identify the popular association-rule generation methods like Confidence, Jaccard, Lift, Chi-Squared, Mutual Information etc. There are many papers that compare such measures.
Implementing these measures is fairly simple. They all involve some simple algebraic expression using one or more of term frequencies, document frequencies and co-occurrence frequencies.
Generate related keywords using all of these measures and compute their union. Call this set TOTAL.
Compute the intersection of the keywords generated by your algorithm with the above TOTAL-set. When viewed as a fraction (intersection/TOTAL), it is a rough indicator of how powerful your measure is.
I found an automatic keyword generation evaluation tool Text Mechanic's Keyword Suggestion Generator, which might help.
It says:
The Text Mechanic's "Keyword Suggestion Generator" will retrieve Google.com auto suggest results* for your entered seed text in an easy to investegate format. Seed text can be a letter, number, word, phrase, related to what you (and others) are querying to find in Google search results.
I believe it can be automated.

Word/Phoneme Corpus for an Elman SRN (English)

I'm writing an Elman Simple Recurrent Network. I want to give it sequences of words, where each word is a sequence of phonemes, and I want a lot of training and test data.
So, what I need is a corpus of English words, together with the phonemes they're made up of, written as something like ARPAbet or SAMPA. British English would be nice but is not essential so long as I know what I'm dealing with. Any suggestions?
I do not currently have the time or the inclination to code something that derives the phonemes a word is comprised of from spoken or written data so please don't propose that.
Note: I'm aware of the CMU Pronouncing Dictionary, but it claims it's only based on the ARPABet symbol set - anyone know if there are actually any differences and if so what they are? (If there aren't any then I could just use that...)
EDIT: CMUPD 0.7a Symbol list - vowels may have lexical stress, and there are variants (of ARPABET standard symbols) indicating this.
CMUdict should be fine. "Arpabet symbol set" just means Arpabet. If there are any minor differences, they should be explained in the CMUdict documentation.
If you need data that's closer to real life than stringing together dictionary pronunciations of individual words, look for phonetically transcribed corpora, e.g., TIMIT.

associated words

I am developing a program but stuck on a particular hurdle. I need to find words associated with other words. EG "green" might be associated with "environment", "leaf", "earth", "wind", "electric", "hybrid", etc. All I can find is Google Sets. Is there any other resource that is better?
If you have a large text collection (say Wikipedia, Project Gutenberg) you can use co-occurrence scores extract this kind of data. See e.g. Padó and Lapata and the references therein.
I recently built a tool that mines this kind of associations from Wikipedia database dumps by another method. It requires a lot of memory though; other folks have tried to do the same using randomized methods.
If you're still looking for a resource of semantically related words, I've just recently developed an API that takes a query and returns semantically related words. It offers parts of speech, relationships to the query word, and a word similarity measurement.
https://kiingo.co/rapid-associations-api
Disclaimer: I'm the developer of this API.

Resources