Mapping interchangeably terms such as Weight to Mass for QAnswering NLP - nlp

I've been working on a Question Answering engine in C#. I have implemented the features of most modern systems and are achieving good results. Despite the aid of Wordnet , one problem I haven't been able to solve yet is changing the user input to the correct term.
For example
changing Weight -> Mass
changing Tall -> Height
My question is about the existence of some sort of resource that can aid me in this task of changing the terms to the correct terms.
Thank You

Looking at all the synsets in WordNet for both Mass and Weight I can see that there is no shared synset and thus there is no meaning in common. Words that actually do have the same meaning can be matched by means of their synset labels, as I'm sure you've realized.
In my own natural language engine (http://nlp.abodit.com) I allow users to use any synset label in the grammar they define but I would still create two separate grammar rules in this case, one recognizing questions about mass and one recognizing questions about weight.
However, there are also files for Wordnet that give you class relationships between synsets too. For example, if you type 'define mass' into my demo page you'll see:-
4. wn30:synset-mass-noun-1
the property of a body that causes it to have weight in a gravitational field
--type--> wn30:synset-fundamental_quantity-noun-1
--type--> wn30:synset-physical_property-noun-1
ITokenText, IToken, INoun, Singular
And if you do the same for 'weight' you'll also see that it too has a class relationship to 'physical property'.
In my system you can write a rule that recognizes a question about a 'physical property' and perhaps a named object and then try to figure out which physical property they are likely to be asking about. And, perhaps, if you can't match maybe just tell them all about the physical properties of the object.
The method signature in my system would be something like ...
... QuestionAboutPhysicalProperties (... IPhysicalProperty prop,
INamedObject obj, ...)
... and in code I would look at the properties of obj and try to find one called 'prop'.

The only way that I know how to do this effectively requires having a large corpus of user query sessions and a happiness measure on sessions, and then finding correlations between substituting word x for word y (possibly given some context z) that improves user happiness.
Here is a reasonable paper on generating query substitutions.
And here is a new paper on generating synonyms from anchor text, which doesn't require a query log.

Related

Always valid domain model entails prefixing a bunch of Value Objects. Doesn't that break the ubiquitous language?

The principle of always valid domain model dictates that value object and entities should be self validating to never be in an invalid state.
This requires creating some kind of wrapper, sometimes, for primitive values. However it seem to me that this might break the ubiquitous language.
For instance say I have 2 entities: Hotel and House. Each of those entities has images associated with it which respect the following rules:
Hotels must have at least 5 images and no more than 20
Houses must have at least 1 image and no more than 10
This to me entails the following classes
class House {
HouseImages images;
// ...
}
class Hotel {
HotelImages images;
}
class HouseImages {
final List<Image> images;
HouseImages(this.images) : assert(images.length >= 1),
assert(images.length <= 10);
}
class HotelImages {
final List<Image> images;
HotelImages(this.images) : assert(images.length >= 5),
assert(images.length <= 20);
}
Doesn't that break the ubiquitous languages a bit ? It just feels a bit off to have all those classes that are essentially prefixed (HotelName vs HouseName, HotelImages vs HouseImages, and so on). In other words, my value object folder that once consisted of x, y, z, where x, y and z where also documented in a lexicon document, now has house_x, hotel_x, house_y, hotel_y, house_z, hotel_z and it doesn't look quite as english as it was when it was x, y, z.
Is this common or is there something I misunderstood here maybe ? I do like the assurance it gives though, and it actually caught some bugs too.
There is some reasoning you can apply that usually helps me when deciding to introduce a value object or not. There are two very good blog articles concerning this topic I would like to recommend:
https://enterprisecraftsmanship.com/posts/value-objects-when-to-create-one/
https://enterprisecraftsmanship.com/posts/collections-primitive-obsession/
I would like to address your concrete example based on the heuristics taken from the mentioned article:
Are there more than one primitive values that encapsulate a concept, i.e. things that always belong together?
For instance, a Coordinate value object would contain Latitude and Longitude, it would not make sense to have different places of your application knowing that these need to be instantiated and validated together as a whole. A Money value object with an amount and a currency identifier would be another example. On the other hand I would usually not have a separate value object for the amount field as the Money object would already take care of making sure it is a reasonable value (e.g. positive value).
Is there complexity and logic (like validation) that is worth being hidden behind a value object?
For instance, your HotelImages value object that defines a specific collection type caught my attention. If HotelImages would not be used in different spots and the logic is rather simple as in your sample I would not mind adding such a collection type but rather do the validation inside the Hotel entity. Otherwise you would blow up your application with custom value objects for basically everything.
On the other hand, if there was some concept like an image collection which has its meaning in the business domain and a set of business rules and if that type is used in different places, for instance, having a ImageCollection value object that is used by both Hotel and House it could make sense to have such a value object.
I would apply the same thinking concerning your question for HouseName and HotelName. If these have no special meaning and complexity outside of the Hotel and House entity but are just seen as some simple properties of those entities in my opinion having value objects for these would be an overkill. Having something like BuildingName with a set of rules what this name has to follow or if it even is consisting of several primitive values then it would make sense again to use a value object.
This relates to the third point:
Is there actual behaviour duplication that could be avoided with a value object?
Coming from the last point thinking of actual duplication (not code duplication but behaviour duplication) that can be avoided with extracting things into a custom value object can also make sense. But in this case you always have to be careful not to fall into the trap of incidental duplication, see also [here].1
Does your overall project complexity justify the additional work?
This needs to be answered from your side of course but I think it's good to always consider if the benefits outweigh the costs. If you have a simple CRUD like application that is not expected to change a lot and will not be long lived all the mentioned heuristics also have to be used with the project complexity in mind.

ID3 Implementation Clarification

I am trying to implement the ID3 algorithm, and am looking at the pseudo-code:
(Source)
I am confused by the bit where it says:
If examples_vi is empty, create a leaf node with label = most common value in TargegetAttribute in Examples.
Unless I am missing out on something, shouldn't this be the most common class?
That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
1) Unless I am missing out on something, shouldn't this be the most
common class?
You're correct, and the text also says the same. Look at the function description at the top :
Target_Attribute is the attribute whose value is to be predicted by the tree
so the value of Target_Attribute is the class/label.
2) That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Yes, but not among all samples in your whole dataset, but rather those samples that reached up to this point in the tree/recursion. (ID3 functions is recursive and so the current Examples is actually Examples_vi of the caller)
3) Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
No, picking a random class (with equal chances for each class) is not the same. Because often the inputs do have an unbalanced class distribution (this distribution is often called the prior distribution in many texts), so you may have 99% of examples are positive and only 1% negative. So whenever you really have no information whatsoever to decide on the outcome of some input, it makes sense to predict the most probable class, so that you have the most probability of being correct. This maximizes your classifier's accuracy on unseen data only under the assumption that the class distribution in your training data is the same as in the unseen data.
This explanation holds with the same reasoning for the base case when Attributes is empty (see 4 line in your pseudocode text); whenever we have no information, we just report the most common class of the data at hand.
If you never implemented the codes(ID3) but still want to know more in processing details, I suggest you to read this paper:
Building Decision Trees in Python
and here is the source code from the paper:
decision tree source code
This paper has a example or use example from your book(replace the "data" file with the same format). And you can just debug it (with some breakpoints) in eclipse to check the attribute values during the algorithms running.
Go over it, you will understand ID3 better.

Using variables in UML multiplicities

I was running a tutorial today, and a we were designing a Class diagram to model a road system. One of the constraints of the system is that any one segment of road has a maximum capacity; once reached, no new vehicles can enter the segment.
When drawing the class diagram, can I use capacity as one of the multiplicities? This way, instead of having 0..* vehicles on a road segment, I can have 0..capacity vehicles.
I had a look at ISO 1905-1 for inspiration, and I thought that what I want is similar to what they've called a 'multiplicity element'. In the standard, it states:
If the Multiplicity is associated with an element whose notation is a text string (such as an attribute, etc.), the multiplicity string will be placed within square brackets ([]) as part of that text string. Figure 9.33 shows two multiplicity strings as part of attribute specifications within a class symbol. -- section 9.12
However, in the examples it gives, they don't seem to employ this feature in the way I expected - they annotate association links rather than replace the multiplicities.
I would rather get a definitive answer for the students in question, rather than make a guess based on the standard, so I ask here: has anyone else faced this issue? How did you overcome it?
According to the UML specification you can use a ValueSpecification for lower and upper bounds of a multiplicity element. And a ValueSpecification can be an expression. So in theory it must be possible although the correct expression will be more complex. Indeed it mixes design and instance level.
In such a case it is more usual to use a constraint like this:
UML multiplicity constraint http://app.genmymodel.com/engine/xaelis/roads.jpg

Options for representing string input as an object

I am receiving as input a "map" represented by strings, where certain nodes of the map have significance (s). For example:
---s--
--s---
s---s-
s---s-
-----s
My question is, what reasonable options are there for representing this input as an object.
The only option that really comes to mind is:
(1) Each position translated to node with up,down,left,right pointers. The whole object contains a pointer to top right node.
This seems like just a graph representation specific to this problem.
Thanks for the help.
Additionally, if there are common terms for this type of input, please let me know
Well, it depends a lot on what you need to delegate to those objects. OOP is basically about asking objects to perform things in order to solve a given problem, so it is hard to tell without knowing what you need to accomplish.
The solution you mention can be a valid one, as can also be having a matrix (in this case of 6x5) where you store in each matrix cell an object representing the node (just as an example, I used both approaches once to model the Conway's game of life). If you could give some more information on what you need to do with the object representation of your map then a better design can be discussed.
HTH

What is the best way to classify following words in POS tagging?

I am doing POS tagging. Given the following tokens in the training set, is it better to consider each token as Word1/POStag and Word2/POStag or consider them as one word that is Word1/Word2/POStag ?
Examples: (the POSTag is not required to be included)
Bard/EMS
Interstate/Johnson
Polo/Ralph
IBC/Donoghue
ISC/Bunker
Bendix/King
mystery/comedy
Jeep/Eagle
B/T
Hawaiian/Japanese
IBM/PC
Princeton/Newport
editing/electronic
Heller/Breene
Davis/Zweig
Fleet/Norstar
a/k/a
1/2
Any suggestion is appreciated.
The examples don't seem to fall into one category with respect to the use of the slash -- a/k/a is a phrase acronym, 1/2 is a number, mystery/comedy indicates something in between the two words, etc.
I feel there is no treatment of the component words that would work for all the cases in question, and therefore the better option is to handle them as unique words. At decoding stage, when the tagger will probably be presented with more previously unseen examples of such words, the decision can often be made based on the context, rather than the word itself.

Resources