Pound sign (£) getting wrongly identified by Azure RecognizeText API in Cognitive Services

Pound sign (£) getting wrongly identified by Azure RecognizeText API in Cognitive Services - azure

I have many cases of pictures of texts where one can find a pound sign (£) but the sign is NEVER correctly recognized by Azure Cognitive Services RecognizeText API, as far as I tested. Other symbols, like the dollar sign ($) for example, are identified without problems.
I made tests with print screens of texts containing £, since these should be easy for the OCR tool to convert, and again the pound sign is not correctly identified (it becomes an f, a 2, a 1, a $ etc).
I am suspecting that the pound sign is not included in the set of characters that the tool supports, although I couldn't find a specific mention of that in the documentation (only that the tool is experimental and is optimized for English).
Has anyone been able to correctly convert a £ using the tool, or does anyone know FOR SURE (possibly through documentation)
that £ is not included in their character set?
Thanks!

Related

Internationalization Web Number-Symbols

do I need to use another number-symbols when I want my webpage to be accessible in other countries? According to Microsoft there are different shape of numbers: https://learn.microsoft.com/en-us/globalization/locale/number-formatting#:~:text=formatting%20for%20details.-,The%20character%20used%20as%20the%20thousands%20separator,thousands%20separator%20is%20a%20space.
I have been searching since a few days to get a clear answer but I cant find some. Also, on most international websites/apps I only ever see the digits 0,1,2,3,4,5,6,7,8,9 although the digits for the language actually look different. That unsettles me. I feel like many websites/apps just ignore this fact. Can anybody help me further? Also do I need to know how to activate foreign symbols in html?

I do not know for sure what language you are translating/typing in HTML. But here is an example of what you can use as a guide to certain scripts in Arabic: https://sites.psu.edu/symbolcodes/languages/mideast/arabic/arabicchart/
You may also need to use a converter. For example, I type Chinese on my website by typing the characters into a character to unicode converter. Then I copy and paste the unicode to my HTML text.

How to detect punctuation in Google Home or Assistant

Is there a way to know to detect punctuation such as periods and commas in the audio taken by Google Home or Assistant? The output text is one long sentence instead of sentences separated by periods.
I am thinking it can be found in the action package or the requests and responses of fulfillment url. The closest I found is the Google Speech-to-Text API which requires an audio file.
Thank you in advance.
Edit: I am using Actions SDK from Google Actions

If the user does not provide the punctuation verbally you will only ever see a string of words with no punctuation.
You can instruct your users to dictate their punctuation, however.
It is not very hard. Let me show you how, by example.
Those last two sentences can be dictated:
it is not very hard period let me show you how comma by example period
That will get your desired outcome.

Azure search hit-highlighting and match delimiter

I am using hit-highlighting in azure search. It works fine but I want to fine tune it a bit.
Say, a field has the following value:
"It uses period as the delimiter. If not, please clarify"
If I search for "please" I will get a highlight hit on that field, e.g.:
"If not, <em>please</em> clarify"
If I search for "period" I will get a highlight hit on that field, e.g.:
"It uses <em>period</em> as the delimiter."
After trying it with several examples it seems that it uses period (".") as a delimiter so that it doesn't return the whole field.
From another SO question (Hit Highlighting in Azure Search Service) it seems that I cannot configure azure search to return the whole field with all terms highlighted.
I want to ask:
if this is really the case or more complex rules apply
do I have any control of how the field is split for hit highlighting, e.g. change the delimiter to say "," or "\n"
Thanks in advance

Unfortunately there is no way to customize how documents are split for hit highlighting. Feel free to use Azure Search User Voice website to post improvements ideas giving other users opportunity to vote for them and helping us prioritize: http://feedback.azure.com/forums/263029-azure-search
The hit highlighter splits documents into sentences. In general it's fair to assume it breaks on dots but it also handles abbreviations etc.

Analyzing Text for Accents

This is the first part of another question of mine that had a recommendation to make it two questions: Adding Accents to Speech Generation.
Summary: The other question asks how to add an accent programatically to generated speech. Not an accent mark or inflection, but a full accent like a British or Scottish or Russian one.
The first question (same as this one) asks how the original text could be analyzed to determine what accents need to be added and where.
Basically, how could text be analyzed to find these accents and generate a set of instructions that could be used to add any accent to any generated speech?

Intelligent file search for windows that can ignore whitespace and search in code?

Does anybody know a Windows based searching tool that is easy to use and is programmer
friendly.
The functions I am looking for:
Ignore white space in search
= capable to find
myTestFunction ( $parameter, $another_parameter, $yet_another_parameter )
{ doThis();
using the query
myTestFunction($parameter,$another_parameter,$yet_another_parameter){doThis();
without Regexes.
Search code "semantically" (for me, it would have to be PHP):
Search in comments only
Search in function names only
Search for parameters that are named $xyz
Search in (insert code construct here) only
If there is none around, it's high time somebody developed it! :)
I have opened a bounty for this.

See our SD Search Engine. This is a language-sensitive search engine designed to search large code bases, with special language classifiers for C, C++, Java, C#, COBOL, JavaScript, Ada, Python, Ruby and lot of other languages, including your specific target langauge PHP (PHP4 and PHP5).
I think it does everything you requested.
It indexes the language elements so search across large code bases are extremely fast (Linux Kernal ~~ 7.5 Million lines --> 2.5 seconds). (The indexing step runs
on Windows, but the display engine is in Java.)
Search hits are shown in one-line context hit window showing the file and line number, as well as the line with the hit highlighted. Clicks on hits bring up the source code, tabs expanded appropriately, and the line count right even for languages which have odd line counting rules (such as GCC WRT form characters), with the hit line and hit text highlighted. Clicking in the source window will launch your favorite editor on the file.
Because it understands language elements, it ignores language-specific whitespace. It skips over comments unless you insist they be inspected. Searches thus ignore whitespace, comments and lineboundaries (if the language thinks lineboundaries are whitespace, which is why there are langauge-specific scanners). The query language allows you to specify which language tokens you want (specific tokens in quotes, or generic tokens such as identifiers I, numbers N, strings S, operators O and punctuation P) with constraints on the token value as well as a series of tokens.
Your example search:
myTestFunction($parameter,$another_parameter,$yet_another_parameter){doThis();
would be expressed to the search engine precisely as:
I=myTestFunction '(' I ',' I ',' I ')' '{' I=dothis '(' ')' ';'
but it would probably be easier (less typing) to find it as:
I=myTest* ... I=dothis
where I=myTest* means an identifier starting with myTest and ... means "near".
The Search Engine also offer regular expressions searches on the text, if you insist.
So you still have grep-like searches (a lot slower than indexed searches)
but with the hit window and source display windows too.

I use ack really successfully for this kind of thing, particularly when trying to find things in large codebases. I run it linux myself but I don't see any reason why it won't run on windows or in Cygwin at the very least. Check it out, I think you'll find it is exactly what you're looking for.

Search code "semantically" (for me, it would have to be PHP):
For this you could (and I think should) use some custom code using token_get_all()
See also the available tokens
Ignore white space in search
A simple regex should be sufficient. It depends on your regex-library, but most come with a whitespace modifier/flag.

For my Windows desktop search, I use Agent Ransack. I use this as a replacement for the windows search.
You can use regular expressions, but there is a nice entry screen if you want to avoid entering them directly.

Take a look at Google Desktop API, it has very powerful set of methods to do what you're looking for.
Of course it requires you to have the Google Desktop installed.
After reviewing it a little, it provides some functionality but not that specific as what you require.

I really like Crimson Editor and it allows RegEx searches. It has helped me a bunch over the past six years. I think it will fit your needs. Try it.

I use TextPad for searching code files in Windows. It has a very handy find-in-files function (Search / Find In Files) and you can use regex which should meet any search requirements. In the search results it will list the file location, line number and a snippet from that line.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pound sign (£) getting wrongly identified by Azure RecognizeText API in Cognitive Services - azure

Related

Internationalization Web Number-Symbols

How to detect punctuation in Google Home or Assistant

Azure search hit-highlighting and match delimiter

Analyzing Text for Accents

Intelligent file search for windows that can ignore whitespace and search in code?

Categories

Resources