Our application uses Azure speech to text in streaming mode to transcribe telephone calls. We're having an issue where the speech to text service returns a single joined number when numbers are uttered individually. For example, saying "fourteen, seven, thirty eight" will result in the speech to text returning "14738" (fourteen-thousand seven-hundred and forty-eight) instead of "14, 7, 38". This is not the behavior we want.
If I check the lexical result after setting the output format to detailed it shows the correct final transcription, albeit obviously in lexical instead of numerical form: "fourteen, seven, thirty eight". A half-way solution would be to implement a number recognition system based on this lexical output, but we would like to avoid this if possible.
I've tried changing recognition modes (dictation, etc.), changing the punctuation mode, etc. I've searched for settings that can give me some control over how azure returns numerical values but to no avail.
Is there any way of controlling this behavior?
Thanks.
Related
I have text data from two different groups. In total I have around 4000 text passages with around 300 words.
I am searching for a tool that allows me to analyze the difference between these two groups.
In the best case, this tool can analyze different dimensions, e.g. the length of sentences, usage of superlatives, perspective of the narrator, usage of passive form, clear and objective writing VS hedging and imprecise writing.
In Python, you can use the nltk or spacey packages to process the texts so that you can analyze them (using pandas, for example). But there's not ready-made software (as far as I know) that will do all of that for you. You're going to have to write your own code.
For example, you would create a pandas dataframe with a row for all of the texts, with their group ('A' or 'B' or whatever) as one of the columns and the raw text as the other. Then you use nltk to tokenize the text and do whatever other preprocessing you want to do, storing the clean, tokenized text in another column. Then you can have a column for, for example, sentence length (which you can compute using nltk). From there you'll be able to get the means of the two groups, standard deviation, statistical significance of difference, etc.
It's straightforward for something like sentence length, but the other features you mention are more difficult. What does it mean for a text to be clear and objective, or hedged and imprecise? That means nothing on its own: you have to decide what exactly you mean by that, and what features characterize it. For example, you could make a list of hedgers ('I think', 'may', 'might', 'I'm not sure but', etc.) and then count their frequency in each text.
Something like "perspective of the narrator" might need to be annotated manually, depending on what you mean by it. If you just mean 1st person vs. 3rd person, that could be easy to identify (compare the 'I's vs. the 'he/she's), but anything more subtle than that, I'm not sure how you'd do it.
Good luck with your project!
Using the telephony integration in DialogFlow, when trying to capture an intent like (for example)
I'm looking for the number six
Where six is defined as #sys.cardinal or #sys.number
I would get it to recognize any single digit except 2 & 4.
For those the text would almost consistently read as "to" & "for" respectively.
This would happen both on the phone, and when testing on the Dialogflow console, pressing the little microphone icon and recording the input.
Why is it missing these numbers when it knows I'm expecting a number in that position?
What can I do to give it better hints?
If the exact phrase the user speaks is "I'm looking for the number two" I believe the agent will detect is as a number based on the context of the phrase.
If they just say "two" it may detect as "to" instead.
Will users only be able to provide a single digit here? If so, perhaps you can create an example for every number (given there are only 10 digits that wouldn't be too onerous).
However, if you're expecting the user to provide a string of numbers perhaps try a different data type for the parameter. The number-sequence type might be more suitable.
Got an issue here, can't translate properly to "word", it only works on very small numbers.
This is how I call the webAPI
https://api.microsofttranslator.com/V2/Http.svc/Translate?to=zh-chs&text=Nine
I've consulted the docs but I've yet to find something.
MSDN
COGNITIVE
Is there any way of forcing the translate to translate to word and not arabic format ?
Thanks,
No, there are no options for controlling semantic behaviors such as number, date/time, etc. The result is dependent on the materials used for training the engine for a particular language, and in this particular case, it appears that for this language pair, word numbers are represented by digits in the result.
I've been using the (preview) CRIS speech to text service in Azure. For some short wav files, i get a correct text equivalent, but it is followed by "non". Is this a keyword meaning "non-word" or is this a bug? -- it happens both when i use the base conversational model, and also when i use a custom language model based on the base conversational model, but it does not happen with the "search and dictation" model.
for example, i send a noisy wav file of someone saying "yes" and i get back "yes non". If the wav file is not noisy this doesn't happen, and if the spoken text is two or more words it doesn't happen. it just seems to happen for noisy one-word files. what does "non" mean?
After talking with the product group, this is apparently a bug in the current build of CRIS and will be fixed shortly. The "non" doesn't mean anything, it just appears when there are bursts of background noise.
I have to write some code working with locales. Is there a good introduction to the subject to get me started?
First posted at Everything you need to know about Locales
A long time ago when I was a senior developer in the Windows group at Microsoft, I was sent to the Far East to help get the F.E. version of Windows 3.1 shipped. That was my introduction to localizing software – basically being pushed in to the deep end of the pool and told to learn how to swim. This is where I learned that localization is a lot more than translation.
Note: One interesting thing we hit - the infamous Blue Screen of Death switched the screen into text mode. You can't display Asian languages in text mode. So we (and by we I mean me) came up with a system where we put the screen in VGA mode, stored the 12 pt. courier bitmap at the resolution for just the characters used in BSoD messages, and rendered it that way. You kids today have it so easy J.
So keep in mind that taking locale into account can lead to some very unexpected work.
The Locale
Ok, so forward to today. What is a locale and what do you need to know? A locale is fundamentally the language and country a program is running under. (There can also be a variant added to the country but use of this is extremely rare.) The locale is this combination but you can have any combination of these two parts. For example a Spanish national in Germany would set es_DE so that their user interface is in Spanish (es) but their country settings are in German(DE). Do not assume location based on language or vice-versa.
The language part of the locale is very simple - that's what language you want to display the text in your app in. If the user is a Spanish speaker, you want to display all text in Spanish. But what dialect of Spanish - it is quite different between Spain and Mexico (just as in America we spell color while in England it's colour). So the country can impact the language used, depending on the combination.
All languages that support locale specific resources (which is pretty much all of them today) use a fall-back system. They will first look for a resource for the language_country combination. While es_DE has probably never been done, there often is an es_MX and es_ES. So for a locale set to es_MX it will first look for the es_MX resource. If that is not found, it then looks for the es resource. This is the resource for that language, but not specific to any country. Generally this is copied from the largest country (economically) for that language. If that is not found, it then goes to the "general" resource which is almost always the native language the program was written in.
The theory behind this fallback is you only have to define different resources for the more specific resources - and that is very useful. But even more importantly, when new parts of the UI are made and you want to ship beta copies or you release before you can get everything translated, well then the translated parts are in localized but the untranslated parts still display - but in English. This annoys the snot out of users in other countries, but it does get them the program sooner. (Note: We use Sisulizer for translating our resources - good product.)
The second half is the country. This is used primarily for number and date/time settings. This spans the gamut from what the decimal and thousand separator symbols are (12,345.67 in the U.S. is 12 345,67 in Russia) to what calendar is in use. The way to handle this is by using the run-time classes available for all operations on these elements when interacting with a user. Classes exist for both parsing user entered values as well as displaying them.
Keep a clear distinction between values the user enters or are displayed to the user and values stored internally as data. A number is a string in an XML file but in the XML file it will be "12345.67" (unless someone did something very stupid). Keep your data strongly typed and only do the locale specific conversions when displaying or parsing text to/from the user. Storing data in a locale specific format will bite you in the ass sooner or later.
Chinese
Chinese does not have an alphabet but instead has a set of glyphs. The People's Republic of China several decades ago significantly revised how to draw the glyphs and this is called simplified. The Chinese glyphs used elsewhere continued with the original and that is called traditional. It is the exact same set of characters, but they are drawn differently. It is akin to our having both a text A and a script A - they both mean the same thing but are drawn quite differently.
This is more of a font issue than a translation issue, except that wording and usage has diverged a bit, in part due to the differences in approach between traditional and simplified Chinese. The end result is that you generally do want to have two Chinese language resources, one zh_CN (PRC) and one zh_TW (Taiwan). As to which should be the zh resource - that is a major geopolitical question and you're on your own (but keep in mind PRC has nukes - and you don't).
Strings with substituted values
So you need to display the message Display ("The operation had the error: " + msg); No, no, no! Because in another language the proper usage could be Display("The error: " + msg + " was caused by the operation"); Every modern run-time library has a construct where you can have a string resource "The operation had the error: {0}" and will then substitute in your msg at {0}. (Some use a syntax other than {0}, {1}, …)
You store these strings in a resource file that can be localized. Then when you need to display the message, you load it from the resources, substitute in the variables, and display it. The combination of this, plus the number & date/time formatters make it easy to build up these strings. And once you get used to them, you'll find it easier than the old approach. (If you are using Visual Studio - download and install ResourceRefactoringTool to make this trivial.)
Arabic, Hebrew, and complex scripts.
Arabic & Hebrew are called b-directional because parts of it are right to left while other parts are left to right. The text in Arabic/Hebrew are written and read right to left. But when you get to Latin text or numbers, you then jump to the left-most part and read that left to right, then jump back to where that started and read right to left again. And then there is punctuation and other non-letter characters where the rules depend on where they are used.
Here's the bottom line - it is incredibly complex and there is no way you are going to learn how it works unless you take this on as a full-time job. But not to worry, again the run-time libraries for most languages have classes to handle this. The key to this is the text for a line is stored in the order you read the characters. So in the computer memory it is in left to right order for the order you would read (not display) the characters. In this way everything works normally except when you display the text and determine moving the caret.
Complex scripts like Indic scripts have a different problem. While they are read left to right, you can have cases where some combinations of letters are placed one above the other, so the string is no wider on the screen when the second letter is added. This tends to require a bit of care with caret movement but nothing more.
We even have cases like this in English where ae is sometimes rendered as a single æ character. (When the human race invented languages, they were not thinking computer friendly.)
Don't Over-Stress it
It seems like a lot but it's actually quite simple. In most cases you need to display text based on the closest resource you have. And you use the number & date/time classes for all locales, including your native one. No matter where you live, most computer users are in another country speaking another language - so localizing well significantly increases your potential market.
And if you're a small company, consider offering a free copy for people who translate your product. When I created Page 2 Stage I offered a free copy (list price $79.95) for translating it - and got 28 translations. I also met some very nice people online in the process. For an enterprise level product, many times a VAR in another country will translate it for you at a reduced rate or even free if they see a good market potential. But in these cases, do the first translation in-house to get the kinks worked out.
One resource I find very useful is the Microsoft Language Portal where you can put in text in English and if that text is in any of the Microsoft products, it will give you the translation Microsoft used for a given language. This can give you a fast high-quality translation for up to 80% of your program in many cases.
Удачи! (Good Luck)