Facebook's Duckling Cannot Identify Time Dimension Correctly - haskell

I'm using Facebook's Duckling to parse text. When I pass the text: 13h 47m it correctly classifies the entire text as DURATION (= 13 hours 47 minutes).
However, when I pass the text: 13h 47m 13s it cannot identify the 13s part of the String as being part of the DURATION. I was expecting it to parse it as 13 hours, 47 minutes and 13 seconds but it essentially ignores the 13s part as not being part of the DURATION.
Command: curl -XPOST http://127.0.0.1:0000/parse --data locale=en_US&text="13h 47m 13s"
JSON Array:
[
{
"latent": false,
"start": 0,
"dim": "duration",
"end": 7,
"body": "13h 47m",
"value": {
"unit": "minute",
"normalized": {
"unit": "second",
"value": 49620
},
"type": "value",
"value": 827,
"minute": 827
}
},
{
"latent": false,
"start": 8,
"dim": "number",
"end": 10,
"body": "13",
"value": {
"type": "value",
"value": 13
}
}
]
Is this a bug? How can I update Duckling so that it parses the text as described above?

The documentation seems pretty clear about this:
To extend Duckling's support for a dimension in a given language, typically 4 files need to be updated:
Duckling/<Dimension>/<Lang>/Rules.hs
Duckling/<Dimension>/<Lang>/Corpus.hs
Duckling/Dimensions/<Lang>.hs (if not already present in Duckling/Dimensions/Common.hs)
Duckling/Rules/<Lang>.hs
Taking a look in Duckling/Duration/Rules.hs, I see:
ruleIntegerUnitofduration = Rule
{ name = "<integer> <unit-of-duration>"
, pattern =
[ Predicate isNatural
, dimension TimeGrain
]
-- ...
So next I peeked in Duckling/TimeGrain/EN/Rules.hs (because Duckling/TimeGrain/Rules.hs did not exist), and see:
grains :: [(Text, String, TG.Grain)]
grains = [ ("second (grain) ", "sec(ond)?s?", TG.Second)
-- ...
Presumably this means 13h 47m 13sec would parse the way you want. To make 13h 47m 13s parse in the same way, I guess the first thing I would try would be to make the regex above a bit more permissive, maybe something like s(ec(ond)?s?)?, and see if that does the trick without breaking anything else you care about.

Related

Searching for sub-objects with a date range containing the queried date value

Let's say we're handling the advertising of various job openings across several channels (newspapers, job boards, etc.). For each channel, we can buy a "publication period" which will mean the channel will advertise our job openings during that period. How can we find the jobs for a given channel that have a publication period valid for today (i.e. starting on or before today, and ending on or after today)? The intent is to be able to generate a feed of "active" job openings that (e.g.) a job board can consume periodically to determine which jobs should be displayed to its users.
Another wrinkle is that each job opening is associated with a given tenant id: the feeds will have to be generated scoped to tenant and channel.
Let's say we have the following simplified documents (if you think the data should be modeled differently, please let me know also):
{
"_id": "A",
"tenant_id": "foo",
"name": "Job A",
"publication_periods": [
{
"channel": "linkedin",
"start": "2021-03-10T00:00:0.0Z",
"end": "2021-03-17T00:00:0.0Z"
},
{
"channel": "linkedin",
"start": "2021-04-10T00:00:0.0Z",
"end": "2021-04-17T00:00:0.0Z"
},
{
"channel": "monster.com",
"start": "2021-03-10T00:00:0.0Z",
"end": "2021-03-17T00:00:0.0Z"
}
]
}
{
"_id": "B",
"tenant_id": "foo",
"name": "Job B",
"publication_periods": [
{
"channel": "linkedin",
"start": "2021-04-10T00:00:0.0Z",
"end": "2021-04-17T00:00:0.0Z"
},
{
"channel": "monster.com",
"start": "2021-03-15T00:00:0.0Z",
"end": "2021-03-20T00:00:0.0Z"
}
]
}
{
"_id": "C",
"tenant_id": "foo",
"name": "Job C",
"publication_periods": [
{
"channel": "monster.com",
"start": "2021-05-15T00:00:0.0Z",
"end": "2021-05-20T00:00:0.0Z"
}
]
}
{
"_id": "D",
"tenant_id": "bar",
"name": "Job D",
"publication_periods": [
...
]
}
How can I query the jobs linked to tenant "foo" that have an active publication period for "monster.com" on for the date of 17.03.2021? (I.e. this query should return both jobs A and B.)
Note that the DB will contain documents of other (irrelevant) types.
Since I essentially need to "find all job openings containing an object in the publication_periods array having: CHAN as the channel value, "start" <= DATE, "end" >= DATE" it appears I'd require a Mango query to achieve this, as standard view queries don't provide comparison operators (if this is mistaken, please correct me).
Naturally, I want the Mango query to be executed only on relevant data (i.e. exclude documents that aren't job openings), but I can find references on how to do this (whether in the docs or elsewhere): all resources I found simply seem to define the Mango index on the entire set of documents, relying on the fact that documents where the indexed field is absent won't be indexed.
How can I achieve what I'm after?
Initially, I was thinking of creating a view that would emit the publication period information along with a {'_id': id} object in order to "JOIN" the job opening document to the matching periods at query time (per Best way to do one-to-many "JOIN" in CouchDB). However, I realized that I wouldn't be able to query this view as needed (i.e. "start" value before today, "end" value after today) since I wouldn't have a definite start/end key to use... And I have no idea how to properly leverage a Mango index/query for this. Presumably I'd have to create a partial index based on document type and the presence of publication periods, but how can I even index the multiple publication periods that can be located within a single document? Can a Mango index be defined against a specific view as opposed to all documents in the DB?
I stumbled upon this answer Mango search in Arrays indicating that I should be able to index the data with
{
"index": {
"fields": [
"tenant_id",
"publication_periods.[].channel",
"publication_periods.[].start",
"publication_periods.[].end"
]
},
"ddoc": "job-openings-periods-index",
"type": "json"
}
And then query them with
{
"selector": {
"tenant_id": "foo",
"publication_periods": {
"$elemMatch": {
"$and": [
{
"channel": "monster.com"
},
{
"start": {
"$lte": "2021-03-17T00:00:0.0Z"
}
},
{
"end": {
"$gte": "2021-03-17T00:00:0.0Z"
}
}
]
}
}
},
"use_index": "job-openings-periods-index"
"execution_stats": true
}
Sadly, I'm informed that the index "was not used because it does not contain a valid index for this query" and terrible performance, which I will leave for another question.

Indexing e-mails in Azure Search

I'm trying to best index contents of e-mail messages, subjects and email addresses. E-mails can contain both text and HTML representation. They can be in any language so I can't use language specific analysers unfortunately.
As I am new to this I have many questions:
First I used Standard Lucene analyser but after some testing and
checking what each analyser does I switched to using "simple"
analyser. Standard one didn't allow me to search by domain in
user#domain.com (It sees user and domain.com as tokens). Is "simple" the best I can use in my case?
How can I handle HTML contents of e-mail? I thought this should be
possible to do it in Azure Search but right now I think I would need
to strip HTML tags myself.
My users aren't tech savvy and I assumed "simple" query type will be
enough for them. I expect them to type word or two and find messages
containing this word/containing words starting with this word. From my tests it looks I need to append * to their queries to get "starting with" to work?
It would help if you included an example of your data and how you index and query. What happened, and what did you expect?
The standard Lucene analyzer will work with your user#domain.com example. It is correct that it produces the tokens user and domain.com. But the same happens when you query, and you will get records with the tokens user and domain.com.
CREATE INDEX
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "Email", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "analyzer": "standard"}
]
UPLOAD
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Email": "some.user#some-domain.com"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Email": "another#another.com"
}
]
}
QUERY
Query, using full and all.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=user#domain.com&$count=true&$select=Id,Email&searchMode=all&queryType=full&api-version={{API-VERSION}}
Which produces results as expected (all records containing user and domain.com):
{
"#odata.context": "https://<your-search-env>.search.windows.net/indexes('dg-test-65392234')/$metadata#docs(*)",
"#odata.count": 2,
"value": [
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
},
{
"#search.score": 0.25316024,
"Id": "2",
"Email": "some.user#some-domain.com"
}
]
}
If your expected result is to only get the record above where the email matches completely, you could instead use a phrase search. I.e. replace the search parameter above with search="user#domain.com" and you would get:
{
"#search.score": 0.51623213,
"Id": "1",
"Email": "user#domain.com"
}
Alternatively, you could use the keyword analyzer.
ANALYZE
You can compare the different analyzers directly via REST. Using the keyword analyzer on the Email property will produce a single token.
{
"text": "some-user#some-domain.com",
"analyzer": "keyword"
}
Results in the following tokens:
"tokens": [
{
"token": "some-user#some-domain.com",
"startOffset": 0,
"endOffset": 25,
"position": 0
}
]
Compared to the standard tokenizer, which does a decent job for most types of unstructured content.
{
"text": "some-user#some-domain.com",
"analyzer": "standard"
}
Which produces reasonable results for cases where the email address was part of some generic text.
"tokens": [
{
"token": "some",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "user",
"startOffset": 5,
"endOffset": 9,
"position": 1
},
{
"token": "some",
"startOffset": 10,
"endOffset": 14,
"position": 2
},
{
"token": "domain.com",
"startOffset": 15,
"endOffset": 25,
"position": 3
}
]
SUMMARY
This is a long answer already, so I won't cover your other two questions in detail. I would suggest splitting them to separate questions so it can benefit others.
HTML content: You can use a built-in HTML analyzer that strips HTML tags. Or you can strip the HTML yourself using custom code. I typically use Beautiful Soup for cases like these or simple regular expressions for simpler cases.
Wildcard search: Usually, users don't expect automatic wildcards appended. The only application that does this is the Outlook client, which destroys precision. When I search for "Jan" (a common name), I annoyingly get all emails sent in January(!). And a search for Dan (again, a name), I also get all emails from Danmark (Denmark).
Everything in search is a trade-off between precision and recall. In your first example with the email address, your expectation was heavily geared toward precision. But, in your last wildcard question, you seem to prefer extreme recall with wildcards on everything. It all comes down to your expectations.

How to make Microsoft LUIS case sensitive?

I have a Azure LUIS instance for NLP,
tried to extract Alphanumberic values using RegEx Expression. it worked well but the output had output in lowercase alphabets.
For example:
CASE 1*
My Input: " run job for AE0002" RegExCode = [a-zA-Z]{2}\d+
Output:
{
"query": " run job for AE0002",
"topScoringIntent": {
"intent": "Run Job",
"score": 0.7897274
},
"intents": [
{
"intent": "Run Job",
"score": 0.7897274
},
{
"intent": "None",
"score": 0.00434472738
}
],
"entities": [
{
"entity": "ae0002",
"type": "Alpha Number",
"startIndex": 15,
"endIndex": 20
}
]
}
I need to maintain the case of the input.
CASE 2
My Input : "Extract only abreaviations like HP and IBM" RegExCode = [A-Z]{2,}
Output :
{
"query": "extract only abreaviations like hp and ibm", // Query accepted by LUIS test window
"query": "extract only abreaviations like HP and IBM", // Query accepted as an endpoint url
"prediction": {
"normalizedQuery": "extract only abreaviations like hp and ibm",
"topIntent": "None",
"intents": {
"None": {
"score": 0.09844558
}
},
"entities": {
"Abbre": [
"extract",
"only",
"abreaviations",
"like",
"hp",
"and",
"ibm"
],
"$instance": {
"Abbre": [
{
"type": "Abbre",
"text": "extract",
"startIndex": 0,
"length": 7,
"modelTypeId": 8,
"modelType": "Regex Entity Extractor",
"recognitionSources": [
"model"
]
},
{
"type": "Abbre",
"text": "only",
"startIndex": 8,
"length": 4,
"modelTypeId": 8,
"modelType": "Regex Entity Extractor",
"recognitionSources": [
"model"
]
},....
{
"type": "Abbre",
"text": "ibm",
"startIndex": 39,
"length": 3,
"modelTypeId": 8,
"modelType": "Regex Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
}
}
}
This makes me doubt if the entire training is happening in lowercase, What shocked me was all the words that were trained initially to their respective entities were retrained as Abbre
Any input would be of great help :)
Thank you
For Case 1, do you need to preserve the case in order to query the job on your system? As long as the job identifier always has uppercase characters you can just use toUpperCase(), e.g. var jobName = step._info.options.entities.Alpha_Number.toUpperCase() (not sure about the underscore in Alpha Number, I've never had an entity with spaces before).
For Case 2, this is a shortcoming of the LUIS application. You can force case sensitivity in the regex with (?-i) (e.g. /(?-i)[A-Z]{2,}/g). However, LUIS appears to convert everything to lowercase first, so you'll never get any matches with that statement (which is better than matching every word, but that isn't saying much!). I don't know of any way to make LUIS recognize entities in the way you are requesting.
You could create a list entity with all of the abbreviations you are expecting, but depending on the inputs you are expecting, that could be too much to maintain. Plus abbreviations that are also words would be picked up as false positives (e.g. CAT and cat). You could also write a function to do it for you outside of LUIS, basically building your own manual entity detection. There could be some additional solutions based on exactly what you are trying to do after you identify the abbreviations.
You can simply use the word indexes provided in the output to get the values from the input string, exactly as they were provided.
{
"query": " run job for AE0002",
...
"entities": [
{
"entity": "ae0002",
"type": "Alpha Number",
"startIndex": 15,
"endIndex": 20
}
]
}
Once you got this reply, use a substring method on your query, using startIndex and endIndex (or endIndex - startIndex if your method want a length, not an end index), in order to have the value you are looking for.

How do I add a start and end time to a RDF triple?

Supposing we have the following triple in Turtle syntax:
<http:/example.com/Paul> <http:/example.com/running> <http:/example.com/10miles> .
How do I add a start and end time? For example if I want to say he started at 10 am and finished his 10miles run at 12 am. I want to use xsd:dateTime.
One way of doing this is through reification - making statements about the statement. Here, you have a choice of giving the statement a URI, so that it's externally dereferenceable, or using a blank node. That would mean, in your case, that you need to identify the statement by making statements about it subject, object and predicate, and tell more things about it, in your case - about start and end of a period it represents. This is how it would look with a blank node:
[
rdf:type rdf:Statement ; #this anonymous resource is a Statement...
rdf:subject ex:Paul ; #...with subject Paul
rdf:predicate ex:running ; #...predicate running
rdf:object "10miles" ; #...and object "10miles"
ex:hasPeriodStart "2018-04-09T10:00:00"^^xsd:dateTime ;
ex:hasPeriodEnd "2018-04-09T12:00:00"^^xsd:dateTime ;
].
When defining ex:hasPeriodStart and ex:hasPeriodEnd you might want to declare their type and range:
ex:hasPeriodStart
rdf:type owl:DatatypeProperty ;
rdfs:range xsd:dateTime ;
Or you might prefer to assure the quality of your data with SHACL, where you'll define your constraints with shape expressions.
I'd recommend not to define your time-related properties but to reuse those from the time ontology.
Give each of Paul’s runs its own URI:
#prefix ex: <http://example.com/> .
ex:Paul ex:running ex:PaulsRun1, ex:PaulsRun2, ex:PaulsRun3 .
This allows you (and others) to make statements about each run:
#prefix ex: <http://example.com/> .
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:PaulsRun3
ex:lengthInMiles 10.0 ;
ex:startTime "2018-04-09T10:00:00"^^xsd:dateTime ;
ex:endTime "2018-04-09T12:00:00"^^xsd:dateTime .
Instead of listing all these runs as objects of ex:Paul ex:running, you could specify the runner for each run:
#prefix ex: <http://example.com/> .
ex:PaulsRun1
ex:runner ex:Paul .
# ex:lengthInMiles, ex:startTime, ex:endTime, etc.
ex:PaulsRun2
ex:runner ex:Paul .
# ex:lengthInMiles, ex:startTime, ex:endTime, etc.
ex:PaulsRun3
ex:runner ex:Paul .
# ex:lengthInMiles, ex:startTime, ex:endTime, etc.
If you don’t want to create a URI for each runner’s run, you could use (unlabeled) blank nodes instead. But this makes it hard/impossible for others to refer to these runs.
Just as an idea.
1. The modelling part (not much RDF involved)
{
"runs": [
{
"id": "runs:0000001",
"distance": {
"length": 10.0,
"unit": "mile"
},
"time": {
"start": "2018-04-09T10:00:00",
"end": "2018-04-09T12:00:00"
},
"runner": {
"id": "runner:0000002",
"name": "Paul"
}
}
]
}
2. The RDF part: define a proper context for your document.
"#context": {
"ical": "http://www.w3.org/2002/12/cal/ical#",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"runs": {
"#id": "info:stack/49726990/runs/",
"#container": "#list"
},
"distance": {
"#id": "info:stack/49726990/distance"
},
"length": {
"#id": "info:stack/49726990/length",
"#type": "xsd:double"
},
"unit": {
"#id": "info:stack/49726990/unit"
},
"runner": {
"#id": "info:stack/49726990/runner/"
},
"name": {
"#id": "info:stack/49726990/name"
},
"time": {
"#id": "info:stack/49726990/time"
},
"start": {
"#id":"ical:dtstart",
"#type": "xsd:dateTime"
},
"end": {
"#id":"ical:dtend",
"#type": "xsd:dateTime"
},
"id": "#id"
}
3. The fun part: Throw it to an RDF converter of your choice
This is how it looks in JSON-Playground

Identifying numbers correctly

I have an intent where I might say 'Transfer 4 to Bob' and it identifies this as 'Transfer for to Bob'
Also I might say 'Transfer 10 to Bob and it identifies this as 'Transfer 102 Bob' treating to word to as 2 on the end of the previous number.
What is the best way to get API.AI to recognise these parts correctly so 4 is not for and to is not 2?
You mentioned that you're using the Actions on Google platform. This means that speech recognition - the process of translating what the user says into text - is happening before the data gets to API.AI.
The problem you're experiencing is that Actions on Google is misrecognizing some numbers as words, e.g. four becomes for.
Because this happens before - and separately from - API.AI, you won't be able to fix the misrecognition.
Below, I'll explain how you can work around this issue in API.AI. However, it's also worth thinking about how you could make your conversation design as robust as possible so that issues like this are less likely to cause problems.
One way you could increase robustness would be to mark the number as a required parameter in API.AI so the user is prompted if it isn't detected due to a recognition error. In that case, the dialog would go like this:
User: Give me four lattes.
App: Sure, four lattes coming up.
User: Give me for lattes.
App: How many do you want?
User: Four.
App: Sure, four lattes coming up.
Regardless, here's a workaround you can use to help recover from misrecognition:
In your intent, provide examples of these commonly misrecognized values. Highlight and mark them as numbers.
Test out your intent out in the console and you'll see that "for" is now matched as a "number" entity with value "for".
In your fulfillment webhook, check the parameter for this value and convert it to the appropriate number using a dictionary. Here's the JSON for the above query:
{
"id": "994c4e39-be49-4eae-94b0-077700ef87a3",
"timestamp": "2017-08-03T19:50:26.314Z",
"lang": "en",
"result": {
"source": "agent",
"resolvedQuery": "Get me for lattes",
"action": "",
"actionIncomplete": false,
"parameters": {
"drink": "lattes",
"number": "for" // NOTE: Convert this to "4" in your webhook
},
"contexts": [],
"metadata": {
"intentId": "0e1b0e72-78ba-4c61-a4fd-a73788034de1",
"webhookUsed": "false",
"webhookForSlotFillingUsed": "false",
"intentName": "get drink"
},
"fulfillment": {
"speech": "",
"messages": [
{
"type": 0,
"speech": ""
}
]
},
"score": 1
},
"status": {
"code": 200,
"errorType": "success"
},
"sessionId": "8b0891c1-50c8-43c6-99c4-8f77261acf86"
}

Resources