Train or Custom Word Entity Types? - google-cloud-nl

I was looking through the documentation and testing Google's Natural Language API and noticed it gets a number of people, events, organizations, and locations incorrect - it appears to be using Wikipedia as a major data source so if it is not in Wikipedia it seems to have trouble identifying the type of various words. Also, if certain words appear in a name (proper noun) it seems to always identify an entity as a certain type which is not always correct.
For instance: "Congress" seems to always identify as an organization [government] even when it is part of an event name. The name "WordCamp" shows as a location, but it is an event.
Is there a way to train the Natural Language engine or provide a custom set of organizations, locations, events, etc. so that it provides more accurate type information for entities that are not extremely popular?

I am the Product manager for this product. Custom entity types are not currently supported. As per your comment about not getting some entity types right, this is true for any NLP system but our goal is to keep improving. We are working on ways for you to provide us feedback on instances that we get wrong to improve our accuracy and will share the details shortly. Note we have trained our models on multiple data sources and not just Wikipedia data. The API returns the most relevant Wikipedia article for an entity detected so if an entity has multiple interpretations, we will only return the most commonly used interpretation.

Related

Extract entities without specifying during intent specification

I am using Rasa 2.0 to build an FAQ chatbot, wherein I have a large dataset, and specifying entities while defining intents does not seem efficient to me.
I have the intents and examples defined in nlu.yml and would like to extract entities.
Here is an example of what I want to achieve,
User message -> I want a hospital in Delhi.
Entity -> Delhi, hospital
Is it possible to do so?
Entity detection is not a solved problem. There exist pre-trained models that integrate with Rasa like Duckling and spaCy and while these tools certainly contribute a lot of knowledge, they will make errors. If you're interested in learning more of the background on why these models can certainly fail, you can enjoy this youtube video that explains human name detection.
That's why a popular alternative is to use name-lists. There are lists of cities around the world as well as lists of baby names that you can download that might be used as a rule based alternative. You can configure this in Rasa via the RegexEntityExtractor but if you have namelists with 1000+ items then a FlashTextExtractor might be preferable.
If you've got labelled examples you can also train Rasa itself to recognise the entities. But in order to do this you will to have labels around.
specifying entities while defining intents does not seem efficient to me
Labelling might not be super fun, but it is super effective. Without labelling your received utterances you won't know what intents your users are interested in.
You could use entity annotations in your nlu training data; for example, assuming you have defined building_type and city as entity names:
I want a [hospital]("building_type") in [Delhi]("city").
Alternatively, you could try out these options:
annotate a smaller sample (for example, those entities that are essential for your FAQ assistant)
use the RegexEntityExtractor to write some rules
if you have a list of entities, you can use lookup tables to generate the regular expressions

Is there a general way to calculate the similarity between product models or specifications?

Product models and specifications always differ subtly.
For example:
iphone6, iphone7sp
12mm*10mm*8mm, 12*8*8, (L)12mm*(W)8mm*(H)8mm
brand-410B-12, brand-411C-09, brand410B12
So, in common E-commerce search, is there a general method to calculate the model or specification similarity?
is there a general method to calculate the model or specification similarity?
No.
This is a research topic sometimes referred to as "product matching", or more broadly "schema matching". It's a hard problem with no standard approach.
Finding out if two strings refer to the same thing is covered by entity resolution, but that's typically used for things like the names of people or organizations where a small change is more likely to be a typo or meaningless change than an important difference (Example: Ulysses S. Grant vs Ulysses Grant). Because a small change in a model number may or may not be important it's different problem. Specifications make things even more complicated.
Here are some papers you can look at for example approaches:
Synthesizing Products for Online Catalogs - Semantic Scholar
Matching Unstructured Product Offers to Structured Product Descriptions - Microsoft Research
Tailoring entity resolution for matching product offers

Mention Types and Mention Classes in Watson Knowledge Studio

How important are Mention Types and Mention Classes to training a machine learning annotator model? Will they get assigned automatically when entities are highlighted? For example, when you click on the Mention Type tab, “NONE” seems to be preselected. Likewise for “SPC” on the Mention Class tab. None of the videos in IBM's Watson Knowledge Studio playlist covers this aspect of using WKS and the official documentation's explanations of whether and how to properly annotate mentions with these attributes is insufficient.
If you annotate mention types and classes and then train a model, the model can predict them like entities and relations.
However, currently there is no service which can get out put with custom models created by WKS.
So if you want to consume only entities and relations by using the model with services like Natural Language Understanding, Discovery or Watson Explorer, it will be inefficient to assign mention classes and types generally.
It depends on your use case.
All the contents that WKS can provide about training are in the documentation.
https://console.bluemix.net/docs/services/knowledge-studio/index.html

Tool for protocol or interface description

We have to develop a protocol as a interface description between different systems in different companies. The implementations will be made in different languages (not known) by the developers in each company.
However, we want to develop the protocol on a textual description base together. I will have the base copy of the current version and want to send it out to all for comments.
What is a good tool to do so?
At the moment we are using MS Word what leads to several problems:
We need a lot of time for text formatting things.
Its not possible to reference to a datatype in the methods description.
The wording differs from chapter to chapter (different authors) and is hard to align.
Perfect would be:
A tool with a glossary and auto-completion.
References to other items (methods, data types, ...) with active links.
Automatic generation of a human-readable (PDF-) document.
Do you know such a tool?
PS: I did not get Sparx Systems Enterprise Architect doing the job. Maybe also hints for this one?
This is a very big question, since there are many possible aspects you may (should) wish to document in a protocol specification. The two most important ones would be data structures and message sequences, then there's error management, authentication, timing, etc etc.
UML can certainly be used to describe these things, and Enterprise Architect is eminently able to generate versatile documentation from a model - it will definitely help with your reference issues. But first you will have to determine quite strictly how each aspect of the protocol is to be modelled, and from that you will need to construct the necessary EA configuration / adaptation.
In order to get good quality documents out of EA, I recommend generating the documents from an Add-In using the Object Model's DocumentGenerator class as this gives you more flexibility than the traditional RTF generator - for one thing, you can access Word's API in addition to EA's and thus do far more with the document than is possible using EA's API alone.
Without knowing the size or complexity of your protocol, I'd say this would require at the very least a few weeks' work for someone who is experienced in writing EA adaptations. But if the scope of your project is such that there are several companies involved, it is likely to be worth the investment.

DDD modeling 1:1...N relationships with query performance in mind

I'm a DDD beginner, and I have a legacy project which would surely benefit from a proper domain layer. The application has to be modified to support multiple application and UI layers. The domain logic is at the moment implemented using the transaction script pattern. Basically I inherited a DB structure which is not allowed to be altered, the new application should be a drop in replacement of the old one.
I stumbled upon an interesting modelling problem in a small part of the domain, which I'm sure experienced DDD practitioners will find interesting. I can't be too specific about the problem, so I'll describe a problem which closely matches mine.
Problem description
Let's suppose we should manage a collection of products. Products are identified by ids, they contain some description, and every product has a few images associated with it. Here comes the tricky part. The images, their contents, are physically stored in the DB, so they are huge chunks of data. (let's just ignore now how good or bad is storing images in a DB, it's just an example). There are some invariants that must be enforced on adding/editing/removing products.
Adding products
A product is only valid if it has images associated with it, without adding images a new product should not be allowed to be entered
Every product must be associated with exactly 5 images, no more, no less.
The order of images associated with the product must be maintained
Editing products
Images of existing products can be replaced, but the number and order of the associated images should be maintained
Removing products
When a product is removed, all of the images associated with it should also be removed
Considered solutions
The class diagrams of various solutions
Solution 1:
The simplest way to model these concepts would be the following.
The Product is the AR. The Images associated with the Product can be accessed and modified through the Product, so Product is responsible for enforcing the 5 Images rule. The advantage of this approach is that invalid Products can't be created or edited in a way to make them invalid, and no Images will be left behind when a Product is removed. So the aggregate if formed around the transaction boundary. The problems with this approach is that in the vast majority of cases the UI would just need to present the list of products, and maybe to modify their description. The UI would very rarely need to display or modify the Images associated with the product. So 95% of the cases huge amounts of unnecessary data would be loaded into the memory.
Lazy loading? The domain model should be implemented in a language which doesn't have ORM tools with lazy loading support. Implement my own lazy loading mechanism? The domain objects shouldn't be aware of the way they're persisted or if they're persisted at all. Instead Solution 2 is recommended by Vaughn Vernon.
Solution 2:
The querying performance problems can be solved with this approach by favoring small aggregates and following the reference other aggregates by identity rule. Vaughn Vernon has a great series of articles describing how to achieve this.
The aggregate is split into two parts Product and ImageSet. Both of them are referencing ProductId as a value object. The Product would be responsible for enforcing the no product without Images rule, and the ImageSet would enforce the no ImageSet without 5 images rule. Querying is not a problem anymore, the ImageSet would be retrieved only when it's needed by a service.
However, this problem is a lot more complex then what Vernon describes in his articles (0...N association). The problem is that the creation of a Product would lead to modifying or creating 2 aggregates, which eliminates the purpose of modelling aggregates around transaction boundaries. The service which adds the new Product would be responsible for transaction management.
Solution 3:
The final solution would be the use of bounded contexts. So for simplicity we name them BC1 and BC2. In BC1 a Product would just contain the ProductDetails. Services interested in querying Products for their details and maybe modifiyng them would use BC1 (ProductRepository in BC1 wouldn't allow adding or removing products, just querying/modifying existing ones). In BC2 a Product would contain the ProductDetails and the Images associated with it. So services interested in adding/removing products, and modifying/retrieving their images would use BC2. Commmon value objects and entities would be shared between these 2 BCs.
This solution would solve all the transactional consistency and querying performance problems. However, I'm not sure based on their definition BCs should be created in response to these kinds of problems.
I'm sorry for the long question, but I feel I should really point out which kinds of solutions I've already considered. And sorry for the linked images, I'm not allowed to upload images yet.
An important observation in your use-case is that the problems of the 1st solution are isolated to the query side of the application. There is no reason to use the same model for processing commands and enforcing constraints as the model used for queries. The read-model pattern can be used to separate the reads from the writes which would allow you to create specific read-models for specific UI requirements and the read-model won't affect your domain model. While it is tempting to utilize the same model for reading as the one for writing, especially given that most ORMs support intricate queries and given the DRY principle, in practice it is much easier to separate the read model from the executable domain model.
Also, the series of articles by Vaughn Vernon are a great resource for understanding intricacies of aggregate design, however the central focus of the articles is on how to partition aggregates based on behavioral requirements not query requirements.

Resources