Is there a way to know to detect punctuation such as periods and commas in the audio taken by Google Home or Assistant? The output text is one long sentence instead of sentences separated by periods.
I am thinking it can be found in the action package or the requests and responses of fulfillment url. The closest I found is the Google Speech-to-Text API which requires an audio file.
Thank you in advance.
Edit: I am using Actions SDK from Google Actions
If the user does not provide the punctuation verbally you will only ever see a string of words with no punctuation.
You can instruct your users to dictate their punctuation, however.
It is not very hard. Let me show you how, by example.
Those last two sentences can be dictated:
it is not very hard period let me show you how comma by example period
That will get your desired outcome.
Related
I have many cases of pictures of texts where one can find a pound sign (£) but the sign is NEVER correctly recognized by Azure Cognitive Services RecognizeText API, as far as I tested. Other symbols, like the dollar sign ($) for example, are identified without problems.
I made tests with print screens of texts containing £, since these should be easy for the OCR tool to convert, and again the pound sign is not correctly identified (it becomes an f, a 2, a 1, a $ etc).
I am suspecting that the pound sign is not included in the set of characters that the tool supports, although I couldn't find a specific mention of that in the documentation (only that the tool is experimental and is optimized for English).
Has anyone been able to correctly convert a £ using the tool, or does anyone know FOR SURE (possibly through documentation)
that £ is not included in their character set?
Thanks!
I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?
You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.
I am trying to see if there is a way to underline a text posted to slack. I am using webhook for posting messages to slack.
You can approximate it with Unicode’s COMBINING LOW LINE character: http://www.fileformat.info/info/unicode/char/0332/index.htm . Before posting, split your string along grapheme boundaries and insert a COMBINING LOW LINE after each. This sort of works, but with Slack’s default font the underline sometimes splits visually between characters. It’s enough though to give an impression, which might be what you want if, for example, you’re trying to give an example of the position of a link within a piece of text.
I don't think this can be done. See https://api.slack.com/docs/formatting for the available message formatting options.
I am using hit-highlighting in azure search. It works fine but I want to fine tune it a bit.
Say, a field has the following value:
"It uses period as the delimiter. If not, please clarify"
If I search for "please" I will get a highlight hit on that field, e.g.:
"If not, <em>please</em> clarify"
If I search for "period" I will get a highlight hit on that field, e.g.:
"It uses <em>period</em> as the delimiter."
After trying it with several examples it seems that it uses period (".") as a delimiter so that it doesn't return the whole field.
From another SO question (Hit Highlighting in Azure Search Service) it seems that I cannot configure azure search to return the whole field with all terms highlighted.
I want to ask:
if this is really the case or more complex rules apply
do I have any control of how the field is split for hit highlighting, e.g. change the delimiter to say "," or "\n"
Thanks in advance
Unfortunately there is no way to customize how documents are split for hit highlighting. Feel free to use Azure Search User Voice website to post improvements ideas giving other users opportunity to vote for them and helping us prioritize: http://feedback.azure.com/forums/263029-azure-search
The hit highlighter splits documents into sentences. In general it's fair to assume it breaks on dots but it also handles abbreviations etc.
Question about Lucene,
I have a file that I would like to index and search by different analyzers. My goal is to be able to change how I search.
In one case I would like to search exact phrase with punctuation IE. for "one,two" and only return exact matchings w/ punctuation.
I would also like to be able to search the exact phrase without punctuation. IE. for "one two." As in the StandardAnalyzer
Essentially I need to change the search functionality on one field.
How can I change the search on the same file. Ive tried using two analyzers (standard and whitespace) however this makes the indexing time very long.
My second thought is to use just a WhitespaceAnalyzer and when searching pass a query that further tokenizes each string if needed? However I am not sure which API has this if any do.
Also is there a good reading on how analyzers and tokens work and are implemented.
Thanks
What do you mean you tried two analyzers? Duplicate the content to 2 seperate fields with different analyzers? That would be my suggestion.