Google Speech API transcribe email - speech-to-text

I'm having trouble transcribing email using Google Speech REST API. The best I can get is most of the email address, however Google Speech ignores "dot" and "dot com". For example first.last#gmail.com returns "First Last at gmail". If I say "period" instead of "dot" I at least get "First. Last at gmail." I'm using the following:
{
"config": {
"encoding": "MULAW",
"sampleRateHertz": 8000,
"languageCode": "en-US",
"maxAlternatives": 0,
"profanityFilter": true,
"enableWordTimeOffsets": false,
"model": "phone_call",
"useEnhanced": true
},
"audio": {
"content":"&&NameBase64&&"
}
}
I've tried add "dot" as a speech context with no changes. ".", ".com", "com", and "kom" also didn't change the results.
{
"config": {
"encoding": "MULAW",
"sampleRateHertz": 8000,
"languageCode": "en-US",
"maxAlternatives": 1,
"profanityFilter": true,
"enableWordTimeOffsets": false,
"model": "phone_call",
"useEnhanced": true,
"speechContexts": [{
"phrases": ["dot"],
}],
},
"audio": {
"content":"Base64Recording"
}
}
I've tried adding alphanumberic speech contexts and spelling it out but the results were pretty bad.
Any thoughts on how I can get "." or "dot" and "com" to show up in the transcription would be greatly appreciated.

Have you tried providing a boost value for the phrase? I'm facing the same issue and I noticed that increasing the boost value helped in identifying the word "dot".
Boost values are usually between 0 and 20, but applying anything above 10 helped in recognizing the "dot".
Here's an example:-
"config": {
"encoding": "MULAW",
"sampleRateHertz": 8000,
"languageCode": "en-US",
"maxAlternatives": 1,
"profanityFilter": true,
"enableWordTimeOffsets": false,
"model": "phone_call",
"useEnhanced": true,
"speechContexts": [{
"phrases": ["dot"],
"boots": 15.0
}],
},
"audio": {
"content":"Base64Recording"
}
}
You can also have multiple key value pairs in the context, each with different boost values. For example, this is what I use to detect email addresses:-
[{
phrases: ["$OOV_CLASS_ALPHANUMERIC_SEQUENCE"],
boost: 14.0
},
{
phrases: ["gmail.com","yahoo.com","aol.com","outlook.com"],
boost:5.0
},
{
phrases: ["com",".","c o m",".com","dotcom","dot com","dot","at","at the rate","#"],
boost: 10.0
},
{
phrases: ["org","io","dot org","dot io","gov","dot gov","net","dot net","co","dot co"],
boost:8.0
},
{
phrases: ["$OOV_CLASS_DIGIT_SEQUENCE","8","naught","z","zed","zee","zz","d","aa","ae","ee","oo","ii","ay","eh","ahh","ah","ze","dee",
"1","2","3","4","5","6","7","8","9","0","zero",],
boost: -20.0
}
]
Notice, the phrases with negative boost values will help weed out words that are often misunderstood.

Related

How to get catalog search results by keywords via Amazon SP API that like Amazon website search results?

First method "listCatalogItems" produces correct results but limits max 10 ASINs. And now this method is deprecated.
Other method "searchCatalogItems" produces INcorrect random results.
fyi listCatalogItems says it's deprecated but it still works
I am getting correct results when I use searchCatalogItems. Here is my postman call: https://sellingpartnerapi-na.amazon.com/catalog/2022-04-01/items?marketplaceIds=ATVPDKIKX0DER&keywords=samsung
and part of my results:
{
"numberOfResults": 54592886,
"pagination": {
"nextToken": "9HkIVcuuPmX_bm51o3-igBfN45pxW4Ru7ElIM6GCECYCuXJKzT26f-3Tfs1Ro3IhelNA74VxDMJwt_JvE7qiRh0loZTzTpEBWUbZ8HB0T4ttV8cFw4xYQ4RMUzdY_udbnvAHOHCcZcycn0nW8RotZh1l1vj7KQoFIa7pWiOPHyaYWP7sBE9Fg7cGN2wE0an5ePw96h6ZL7m6olRxFOcqTWNanEVRjipq"
},
...
"items": [
{
"asin": "B09YN4W5C1",
"summaries": [
{
"marketplaceId": "ATVPDKIKX0DER",
"adultProduct": false,
"autographed": false,
"brand": "SAMSUNG",
"itemClassification": "VARIATION_PARENT",
"itemName": "SAMSUNG Jet Bot Robot Vacuum Cleaner",
"manufacturer": "SAMSUNG",
"memorabilia": false,
"packageQuantity": 1,
"tradeInEligible": false,
"websiteDisplayGroup": "home_display_on_website",
"websiteDisplayGroupName": "Home"
}
]
},
{
"asin": "B01AQ6OWAG",
"summaries": [
{
"marketplaceId": "ATVPDKIKX0DER",
"adultProduct": false,
"autographed": false,
"brand": "SAMSUNG",
"browseClassification": {
"displayName": "Remote Controls",
"classificationId": "10967581"
},
"itemClassification": "BASE_PRODUCT",
"itemName": "SAMSUNG TV Remote Control BN59-01199F by Samsung",
"manufacturer": "Samsung",
"memorabilia": false,
"modelNumber": "BN59-01199F",
"packageQuantity": 1,
"partNumber": "BN59-01199F",
"tradeInEligible": false,
"websiteDisplayGroup": "ce_display_on_website",
"websiteDisplayGroupName": "CE"
}
]
},

Indexing e-mails in Azure Search

I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?
It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.

Returning partial matches in Azure Search

A while ago I set up a search index for a web application. One of the requirements was to return partial matches of the search terms. For instance, searching for Joh should find John Doe. The most straightforward way to implement this was to append a * to each search term before posting the query to Azure Search. So if a user types Joh, we actually ask Azure Search to search for Joh*.
One limitation of this approach is that all the matches of Joh* have the same search score. Because of this, sometimes a partial match appears higher in the results than an exact match. This is documented behavior, so I guess there is not much I can do about it. Or can I?
While my current way to return partial matches seems like a hack, it has worked well enough in practice that I didn't matter finding out how to properly solve the problem. Now I have the time to look into it and my instinct says there must be a "proper" way to do this. I have read the word "ngrams" here and there, and it seems to be part of the solution. I could probably find a passable solution after some of hours of hacking on it, but if there is any "standard way" to achieve what I want, I would rather follow that path instead of using a home-grown hack. Hence this question.
So my question is: is there a standard way to retrieve partial matches in Azure Search, while giving exact matches a higher score? How should I change the code below to make Azure Search return the search results I need?
The code
Index definition, as returned by the Azure API:
{
"name": "test-index",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
},
{
"name": "name",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
The documents, as posted to the Azure API:
{
"value": [
{
"#search.action": "mergeOrUpload",
"id": "1",
"name": "Joh Doe"
},
{
"#search.action": "mergeOrUpload",
"id": "2",
"name": "John Doe"
}
]
}
Search query, as posted to the Azure API:
{
search: "Joh*"
}
Results, where the exact match appears second, while we would like it to appear first:
{
"value": [
{
"#search.score": 1,
"id": "2",
"name": "John Doe"
},
{
"#search.score": 1,
"id": "1",
"name": "Joh Doe"
}
]
}
This is a very good question and thanks for providing a detailed explanation. The easiest way to achieve that would be to use term boosting on the actual term and combine it with a wildcard query. You can modify the query in your post to -
search=Joh^10 OR Joh*&queryType=full
This will score the documents that match Joh exactly higher. If you have more complicated requirements, you can look at constructing a custom analyzer with ngrams to search on them to support partial search.

Returning the "search term" along with result - Elasticsearch

In the elasticsearch module I have built, is it possible to return the "input search term" in the search results ?
For example :
GET /signals/_search
{
"query": {
"match": {
"focused_content": "stock"
}
}
}
This returns
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534903,
"hits": [
{
"_index": "signals",
"_type": "signal",
"_id": "13",
"_score": 0.057534903,
"_source": {
"username": "abc#abc.com",
"tags": [
"News"
],
"content_url": "http://www.wallstreetscope.com/morning-stock-highlights-western-digital-corporation-wdc-fibria-celulose-sa-fbr-ametek-inc-ame-cott-corporation-cot-graftech-international-ltd-gti/25375462/",
"source": null,
"focused_content": "Morning Stock Highlights: Western Digital Corporation (WDC), Fibria Celulose SA (FBR), Ametek Inc. (AME), Cott Corporation (COT), GrafTech International Ltd. (GTI) - WallStreet Scope",
"time_stamp": "2015-08-12"
}
}
]
}
Is it possible to have the input search term "stock" along with each of the results (like an additional JSON Key along with "content_url","source","focused_content","time_stamp") to identify which search term had brought that result ?
Thanks in Advance !
All I can think of, would be using highlighting feature. So it would bring back additional key _highlightand it would highlight things, that matched.
It won't bring exact matching terms, tho. You'd have to deal with them in your application. You could use pre/post tags functionality to wrap them up somehow specially, so your app could recognize that it was a match.
You can use highlights on all fields, like #Evaldas suggested. This will return the result along with the value in the field which matched, surrounded by customisable tags (default is <em>).
GET /signals/_search
{
"highlight": {
"fields": {
"username": {},
"tags": {},
"source": {},
"focused_content": {},
"time_stamp": {}
}
},
"query": {
"match": {
"focused_content": "stock"
}
}
}

Elasticsearch index short words + make indexes applying EdgeNGram

I am using Elasticsearch with a EdgeNGram filter which is set as follows:
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15,
},
The problem is that when I make a query using very short words, they are completely omitted from the search. Let's say I type in "Vitamin C" -> this gives me results for the first term "Vitamin" only. Is there any way how to tell Elasticsearch not to use EdgeNGram filter when indexing words up to 3 characters?
Thank you.
EDIT:
These are my settings:
ELASTICSEARCH_INDEX_SETTINGS = {
"settings": {
"analysis": {
"analyzer": {
"sk_hunspell": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"sk_lowercase", "sk_SK", "stopwords_SK",
"edgeNGram", "asciifolding",
"remove_duplicities",
]
},
},
"filter": {
"sk_SK": {
"type": "hunspell",
"locale": "sk_SK",
"dedup": True,
"recursion_level": 0,
"ignore_case": True,
},
"sk_lowercase": {
"type": "lowercase",
},
"stopwords_SK": {
"type": "stop",
"stopwords": STOPWORDS_SK,
},
"remove_duplicities": {
"type": "unique",
"only_on_same_position": True
},
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15,
"token_chars": ["letter", "digit"],
},
},
}
}
}
In the database I store information about vitamins, minerals and medicinal plants. (Their use, collecting, blooming, health benefits etc.) The information are written in Slovak. (The names of the plants and minerals are also stored in Czech and Latin).
This idea may be a hack but you could pad words less than 3 with a special charecter before inserting them into the index so they are length 3.
When you accept the user's query you would have to also pad their words less than three with the same special charecter.
You would need to create a custom tokenizer for this.

Resources