LUIS - understand any person name

LUIS - understand any person name - bots

we are building a product on LUIS / Microsoft Bot framework and one of the doubt we have is Person Name understanding. The product is set to use by anyone by just signing up to our website. Which means any company who is signing up can have any number of employees with any name obviously.
What we understood is the user entity is not able to recognize all names. We have created a phrase list but as per we know there is a limit to phrase list (10K or even if its 100K) and names in the world can never have a limit. The other way we are thinking is to not train the entity with utterances. However if we have 100s of customers with 1000s of users each, the utterances will not be a good idea in that case.
I do not see any other way of handling this situation. Probably I am missing something here? Anyone faced similar problem and how it is handled?
The worst case would be to create a separate LUIS instance for each customer but that's really a big task to do only because we cant handle names.

As you might already know, a person's name could literally be anything: e.g. an animal, car, month, or color. So, there isn't any definitive way to identify something as a name. The closest you can come is via text analysis parts of speech and either taking a guess or comparing to an existing list. LUIS or any other NLP tool is unlikely to help with this. Here's one approach that might work out better. Try something like Microsoft's Text Analytics cognitive service, with a POST to the Key Phrases endpoint, like this:
https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases
and the body:
{
"documents": [
{
"language": "en-us",
"id": "myid",
"text": "Please book a flight for John Smith at 2:30pm on Wednesday."
}
]
}
That returns:
{
"languageDetection": {
"documents": [
{
"id": "e4263091-2d54-4ab7-b660-d2b393c4a889",
"detectedLanguages": [
{
"name": "English",
"iso6391Name": "en",
"score": 1.0
}
]
}
],
"errors": []
},
"keyPhrases": {
"documents": [
{
"id": "e4263091-2d54-4ab7-b660-d2b393c4a889",
"keyPhrases": [
"John Smith",
"flight"
]
}
],
"errors": []
},
"sentiment": {
"documents": [
{
"id": "e4263091-2d54-4ab7-b660-d2b393c4a889",
"score": 0.5
}
],
"errors": []
}
}
Notice that you get "John Smith" and "flight" back as key phrases. "flight" is definitely not a name, but "John Smith" might be, giving you a better idea of what the name is. Additionally, if you have a database of customer names, you can compare the value to a customer name, either exact or soundex, to increase your confidence in the name.
Sometimes, the services don't give you an 100% answer and you have to be creative with work-arounds. Please see the Text Analytics API docs for more info.

Have asked this question to few MS guys in my local region however it seems there is no way LUIS at moment can identify names.
Its not good as being NLP, it is not able to handle such things :(
I found wit.ai (best so far) in identifying names and IBM Watson is also good upto some level. Lets see how they turn out in future but for now I switched to https://wit.ai

Related

Azure Spell not detecting spelling mistakes

I've written up a quick proof of concept console app to test out the functionality of the AzureSpell Cognitive Services product, however it doesn't seem to often detect obvious spelling mistakes.
Having experimented with recommendations through other SO answers, I've had limited success.
Even using the demo located at https://azure.microsoft.com/en-us/services/cognitive-services/spell-check/ produces no results.
For example, consider the following piece of text: "Currently growing my compny which is a UK based Online compny with clients across the world. Working since 2001 to help indivduals."
This produces no results. I've looked at regional settings, PROOF vs SPELL, character counts to no avail.
Has anyone had any success with this service, or, even better, does the above text snippet produce results for you?

Spell mode is working for me with your sample, see below:
The JSON result is:
{
"_type": "SpellCheck",
"flaggedTokens": [
{
"offset": 21,
"token": "compny",
"type": "UnknownToken",
"suggestions": [
{
"suggestion": "company",
"score": 0.9264452620075305
}
]
},
{
"offset": 55,
"token": "compny",
"type": "UnknownToken",
"suggestions": [
{
"suggestion": "company",
"score": 0.8740149238635179
}
]
},
{
"offset": 120,
"token": "indivduals",
"type": "UnknownToken",
"suggestions": [
{
"suggestion": "individuals",
"score": 0.753968656686115
}
]
}
]
}

Ok, so after a fair amount of trial I've had some success, which has solved some issues and created others. I've not been able to get a reliable result from Spell mode, but I have with Proof, however after adding a fairly short piece of text, it would again not report any results. Inspecting the API shows the text is encoded in the POST, removing both "%0D" and "%0A", line feed chars allows me to Proof long texts with success, which would be fine, however being UK based, lots of correct spellings are now flagged as incorrect as the PROOF mode is only available in the US. So, I've still been unable to solve getting a functioning SPELL result (which works for very short pieces of text). I understand the documentation states upto 130 chars for GET, but 10,000 chars for POST and my typical example POSTS are around 1,000 chars. Possibly a ticket with MS unless anyone has any ideas?

RESTful API design - naming an "activity" resource

When designing the endpoints for an activity resource that provides information on the activity of other resources such as users and organisations we are struggling with naming conventions.
What would be more semantic:
/organisations/activity
/organisations/activity/${activityId}
/users/activity
/users/activity/${activityId}
OR
/activity/users/${activityId}
/activity/users
/activity/organisations/${activityId}
/activity/organisations

There's not a generic answer for this, especially since the mechanisms doing the lookup/retrieval at the other end, and associated back-ends vary so drastically, not to mention the use case purpose and intended application.
That said, assuming for all intents and purposes the "schema" (or ... endpoint convention from the point of view of the end user) was just going to be flat, I have seen many more of the latter activity convention, as that is the actual resource, which is what many applications and APIs are developed around.
I've come to expect the following style of representation from APIs today (how they achieve the referencings and mappings is a different story, but from the point of view of API reference)
-
{
"Activity": [
{
"date": "1970-01-01 08:00:00",
"some_other_resource_reference_uuid": "f1c4a41e-1639-4e35-ba98-e7b169d1c92d",
"user": "b3ababc4-461b-404a-a1a2-83b4ca8c097f",
"uuid": "0ccf1b41-aecf-45f9-a963-178128096c97"
}
],
"Users": [
{
"email": "johnanderson#mycompany.net",
"first": "John",
"last": "Anderson",
"user_preference_1": "somevalue",
"user_property_1": "somevalue",
"uuid": "b3ababc4-461b-404a-a1a2-83b4ca8c097f"
}
]
}
The StackExchange API allows retrieving objects through multiple methods also:
For example, the User type look like this:
-
{
"view_count": 1000,
"user_type": "registered",
"user_id": 9999,
"link": "http://example.stackexchange.com/users/1/example-user",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User"
}
And on the Question type, the same user is shown underneath the owner object :
-
{
"owner": {
"user_id": 9999,
"user_type": "registered",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User",
"link": "https://example.stackexchange.com/users/1/example-user"
},
"is_answered": false,
"view_count": 31415,
"favorite_count": 1,
"down_vote_count": 2,
"up_vote_count": 3,
"answer_count": 0,
"score": 1,
"last_activity_date": 1494871135,
"creation_date": 1494827935,
"last_edit_date": 1494896335,
"question_id": 1234,
"link": "https://example.stackexchange.com/questions/1234/an-example-post-title",
"title": "An example post title",
"body": "An example post body"
}
On the Posts Type reference (Using this as a separate example because there is only a handful of methods to reach this type), you'll see an example down the bottom :
Methods That Return This Type
  posts
  posts/{ids}
  users/{ids}/posts 2.2
  me/posts 2.2
So whilst you can access resources (or "types" as it is on StackExchange), through a number of ways including filters and complex queries, there still exists the ability to see the desired resource through a number of more direct transparent URI conventions.
Different applications will clearly have different requirements. For example, the Gmail API is user based all the way - this makes sense from a users point of view given that in the context of the authenticated credential, you're separating one users objects from another.
This doesn't mean google uses the same convention for all of their APIs, their Activities API resource is all about the activity
Even looking at the Twitter API, there is a Direct Messages endpoint resource that has sender and receiver objects within.
I've not seen many API's at all that are limited to accessing resources purely via a user endpoint, unless the situation obviously calls for it, i.e. the Gmail example above.
Regardless of how flexible a REST API can be, the minimum I have come to expect is that some kind of Activity, location, physical object, or other entity is usually it's own resource, and the user association is plugged in and referenced at various degrees of flexibility (at a minimum, the example given at the top of this post).

It should be pointed out that in a true REST api the uri holds no meaning. It's the link relationships from your organizations and users resources that matter.
Clients should just discover those urls, and should also adapt to the new situation if you decide that you want a different url structure after all.
That being said, it's nice to have a logical structure for this type of thing. However, either is fine. You're asking for an opinion, there is not really a standard or best practice. That said, I would choose option #1.

What are the possible kinds of webhooks Trello can send? What attributes come in each?

I'm developing an app that is tightly integrated with Trello and uses Trello webhooks for a lot of things. However, I can't find anywhere in Trello's developer documentation what are the "actions" that may trigger a webhook and what data will come in each of these.
In fact, in my experience, the data that comes with each webhook is kinda random. For example, while most webhooks contain the shortLink of the card which is being the target of some action, some do not, in a totally unpredictable way. Also, creating cards from checklists doesn't seem to trigger the same webhook that is triggered when a card is created normally, and so on.
So, is that documented somewhere?

After fighting against these issues and my raw memory of what data should come in each webhook, along with the name of each different action, I decided to document this myself and released it as a (constantly updating as I find new webhooks out there) set of JSON files showing samples of the data each webhook will send to your endpoint:
https://github.com/fiatjaf/trello-webhooks
For example, when a board is closed, a webhook will be sent with
{
"id": "55d7232fc3597726f3e13ddf",
"idMemberCreator": "50e853a3a98492ed05002257",
"data": {
"old": {
"closed": false
},
"board": {
"shortLink": "V50D5SXr",
"id": "55af0b659f5c12edf972ac2e",
"closed": true,
"name": "Communal Website"
}
},
"type": "updateBoard",
"date": "2015-08-21T13:10:07.216Z",
"memberCreator": {
"username": "fiatjaf",
"fullName": "fiatjaf",
"avatarHash": "d2f9f8c8995019e2d3fda00f45d939b8",
"id": "50e853a3a98492ed05002257",
"initials": "F"
}
}
In fact, what comes is a JSON object like {"model": ..., "action": ... the data you see up there...}, but I've removed these for the sake o brevity and I'm showing only what comes inside the "action" key.

based on #flatjaf's repo, I gathered and summarized all* the webhooks types.
addAttachmentToCard
addChecklistToCard
addLabelToCard
addMemberToBoard
addMemberToCard
commentCard
convertToCardFromCheckItem
copyCard
createCard
createCheckItem
createLabel
createList
deleteAttachmentFromCard
deleteCard
deleteCheckItem
deleteComment
deleteLabel
emailCard
moveCardFromBoard
moveCardToBoard
moveListFromBoard
moveListToBoard
removeChecklistFromCard
removeLabelFromCard
removeMemberFromBoard
removeMemberFromCard
updateBoard
updateCard
updateCheckItem
updateCheckItemStateOnCard
updateChecklist
updateComment
updateLabel
updateList
hope it helps!
*I don't know if that list includes all the available webhooks types because as i already said, it's based on flatjaf's repo created 2 years ago

What are couchdb user._id and role field limits?

In CouchDB _users, I'm making user id's and roles with emails in them.
Are you aware of any special CouchDB problems this causes?
I can't find any docs on valid values for these fields. They appear to be docs, so it seems ok.
{
"_id": "org.couchdb.user:some#email.com",
"_rev": "1-0bb5ba9dd3e989a28bc8282efaf32aa2",
"password_scheme": "pbkdf2",
"iterations": 10,
"type": "user",
"name": "some#email.com",
"roles": [
"f#soddddddddddddddddddddddddddme#examddddddddddple.com"
],
"derived_key": "f1f41961688ffd35addebdd0ece7714b08242c5e",
"salt": "3d299831afccb98c39ddeb3308275acb"
}

CouchDB Core Dev here. There are no semantic limits to either the _id field roles.
The only thing is that the _id needs to start with org.couchdb.user:.
Roles are just arrays of strings, anything that goes into a string can be a role.
General advice is to keep things short, but email-addresses are totally within the realm of applicable values.

Freebase batch search

I'm trying to use Freebase to search for multiple items at a time (using one API call). For example, if I have two items:
Robert Downey, Jr.
The Avengers
I want to query Freebase once and get back results for both items. Basically all I need is the mid for the top 3 or 4 results for both items. I would like to rely on Freebase's search API to provide disambiguation for topics. For example, I'd like to be able to search for "Robert Downey, Jr." with the abbreviation: "RDJ".
This is easy to do when searching one item at a time:
https://www.googleapis.com/freebase/v1/search?query=rdj
Making two calls like this would give me exactly what I'm looking for, but I would like to stay away from making these calls individually.
Reconciliation
I did run across the json-rpc call for reconciliation, and I have tried the following:
Endpoint: https://www.googleapis.com/rpc
POST body:
[
{
"method": "freebase.reconcile",
"apiVersion": "v1",
"params": {
"name": ["RDJ"],
"key": "api_key",
"limit":10
}
},
{
"method": "freebase.reconcile",
"apiVersion": "v1",
"params": {
"name": ["the avengers"],
"key": "api_key",
"limit":10
}
}
]
This works fairly well for Robert Downey, Jr in that I get a result of type /film/actor as I did using the search api. However, for The Avengers, I get a set of results with type /book/book rather than the 2012 film. These results don't seem to be prioritized the same way as the search results.
I tried something similar using json-rpc for a Freebase search method:
{
"method": "freebase.search",
"apiVersion": "v1",
"params": {
"name": ["RDJ"],
"key": "api_key",
"limit":10
}
}
But the "freebase.search" method didn't seem to exist.
One thing to note is that I will not know the expected type of the items I am looking for before hand.
Long story short: I want the exact results the search API provides, but with multiple queries wrapped up into one call.
Am I missing something terribly simple like an OR operator for the search API?? I've been searching for days, but can't seem to find a good solution. I would appreciate any help at all!

Why not just make two calls asynchronously? That would give you the results you need with almost no penalty in latency.
A few relevant facts:
The Reconcile API is still experimental. It's intended for use in reconciling against a type at a minimum and usually scoring using additional property values.
The Search API isn't included in the RPC mechanism because its freeform output doesn't work with the assumptions of the RPC framework. Ditto for the Topic API, although that's not really relevant here.
The Search API has a fairly expressive S-expression language. You don't say if you want the queries scored independently or together, but if you want them ranked jointly, you can use a filter expression like [(any name:rdj name:"The Avengers")]
https://www.googleapis.com/freebase/v1/search?query=&limit=10&filter=%28any%20name:rdj%20name:%22the%20avengers%22%29

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string