Converting NLP to CSP: Story Consistency - nlp

Background: I would like to know if anyone has succeeded in converting Natural Language to a knowledge base representative of a constraint satisfaction problem. I want preform constraint satisfaction on a person's statements in order to see if any inconsistencies are present while preforming a resolution proof on the statements. This could be used in a courtroom or during election debates.
So to lay out my idealistic story consistency algorithm:
A first statement comes in, convert it and add it to the Knowledge Base (KB)
While next
get next statement
convert statement to clause
negate clause,
add negated clause to KB
check for contradiction (perform resolution)
report finding
remove the original clause to see if the story changes again
add the new clause
How would I convert a statement to a usable clause?
For example:
~A B C
a ~B C

Related

How is it in Spark, reduce and aggregate are the same

• In math, we think of reduce when a denominator and numerator share the same multiplier and that multiplier is “reduced” to a simpler concept (as in “is divisible by”).
• In math, an aggregate, while similar, does not produce the same value in a reduced form, instead, aggregate will produce a single value that is representative of the whole, the whole being a derived state of the data and used primarily for statistical purposes. For example: “Out of 10 sales people, we generated $60000 in capital sales.”
• https://docs.databricks.com/sql/language-manual/functions/reduce.html
• https://docs.databricks.com/sql/language-manual/functions/aggregate.html
Apparently in Apache Spark, reduce means the same thing as aggregate as explained by Databricks.
Can someone clarify the difference or explain how these two words (reduce and aggregate), with a perceptibly different context, can be considered to be the same?
From the doc. you posted, you can find the explanation of the reduce, that is
This function is a synonym for aggregate function.

How to determine whether given language is regular or not(by just looking at the language)?

Is there any trick to guess if a language is regular by just looking at the language?
In order to choose proof methods, I have to have some hypothesis at first. Do you know any hints/patterns required to reduce time consumption in solving long questions?
For instance, in order not to spend time on pumping lemma, when language is regular and I don't want to construct DFA/grammar.
For example:
1. L={w ε {a,b}*/no of a in (w) < no of b in (w)}
2. L={a^nb^m/n,m>=0}
How to tell which is regular by just looking at the above examples??
In general, when looking at a language, a good rule of thumb for whether the language is regular or not is to think of a program that can read a string and answer the question "is this string in the language?"
To write such a program, do you need to store some arbitrary value in a variable or is the program's state (that is, the combination of all possible variables' values) limited to some finite fixed number of possibilities? If the language can be recognized by a program that only needs a fixed number of variables that can only have a fixed number of values, then you've got a regular language. If not, then not.
Using this, I can see that the first language is not regular, but the second language is. In the first language, I need to remember how many as I've seen, and how many bs. (Or at the very least, I need to keep track of (# of as) - (# of bs), and accept if the string ends while that count is negative). At the same time, there's no limit on the number of as, so this count could go arbitrarily large.
In the second language, I don't care what n and m are at all. So with the second language, my program would just keep track of "have I seen at least one b yet?" to make sure we don't have any a characters that occur after the first b. (So, one variable with only two values - true or false)
So one way to make language 1 into a regular language is to change it to be:
1. L={w ∈ {a,b}*/no of a in (w) < no of b in (w), and no of a in (w) < 100}
Now I don't need to keep track of the number of as that I've seen once I hit 100 (since then I know automatically that the string isn't in the language), and likewise with the number of bs - once I hit 100, I can stop counting because I know that'll be enough unless the number of as is itself too large.
One common case you should watch out for with this is when someone asks you about languages where "number of as is a multiple of 13" or "w ∈ {0,1}* and w is the binary representation of a multiple of 13". With these, it might seem like you need to keep track of the whole number to make the determination, but in fact you don't - in both cases, you only need to keep a variable that can count from 0 to 12. So watch out for "multiple of"-type languages. (And the related "is odd" or "is even" or "is 1 more than a multiple of 13")
Other mathematical properties though - for example, w ∈ {0,1}* and w is the binary representation of a perfect square - will result in non-regular languages.

SUM not working 'Invalid or missing field format'

I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.
If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.
After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)

Stronger boosting by date in Solr

Boosting by date field in solr is defined as:
{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}
I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?
User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.
And the solr debug output is waaay too confusing to me to understand the problem.
Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.
recip(x, m, a, b) implements f(x) = a/(xm+b) with :
x : the document age in ms, defined as ms(NOW,<datefield>).
m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).
a and b are constants (defined arbitrarily).
xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.
Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.
With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.
How to make a date boosting stronger ?
Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.
Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).
Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :
bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
It is important to note a few things :
bf is an additive boost and acts as a bonus added to the score of newer documents.
{!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.
A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.
Do not use recip() for dates more than one reference_time in the future or it will yield negative values.
See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.
User is searching for two keywords. Both items contain both keywords
(in same order) in both title and description. Neither of the keywords
is repeated.
Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.
With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:
score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)
In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.
The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.
There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.
References for this answer:
Advanced Apache Solr boosting: a case study
Using Solr’s Dismax Tie Parameter
Shishir
There is an example very well presented in the ReciprocalFloatFunction that will give you a clear view on how the boosting recipe works. If you find that dismax does not offer you enough control over the boosting, you will have to do some tinkering with BoostQParserPlugin.
A multiplier of 3.16e-11 changes the units from milliseconds to years
(since there are about 3.16e10 milliseconds per year). Thus, a very
recent date will yield a value close to 1/(0+1) or 1, a date a year in
the past will get a multiplier of about 1/(1+1) or 1/2, and date two
years old will yield 1/(2+1) or 1/3.

Mapping interchangeably terms such as Weight to Mass for QAnswering NLP

I've been working on a Question Answering engine in C#. I have implemented the features of most modern systems and are achieving good results. Despite the aid of Wordnet , one problem I haven't been able to solve yet is changing the user input to the correct term.
For example
changing Weight -> Mass
changing Tall -> Height
My question is about the existence of some sort of resource that can aid me in this task of changing the terms to the correct terms.
Thank You
Looking at all the synsets in WordNet for both Mass and Weight I can see that there is no shared synset and thus there is no meaning in common. Words that actually do have the same meaning can be matched by means of their synset labels, as I'm sure you've realized.
In my own natural language engine (http://nlp.abodit.com) I allow users to use any synset label in the grammar they define but I would still create two separate grammar rules in this case, one recognizing questions about mass and one recognizing questions about weight.
However, there are also files for Wordnet that give you class relationships between synsets too. For example, if you type 'define mass' into my demo page you'll see:-
4. wn30:synset-mass-noun-1
the property of a body that causes it to have weight in a gravitational field
--type--> wn30:synset-fundamental_quantity-noun-1
--type--> wn30:synset-physical_property-noun-1
ITokenText, IToken, INoun, Singular
And if you do the same for 'weight' you'll also see that it too has a class relationship to 'physical property'.
In my system you can write a rule that recognizes a question about a 'physical property' and perhaps a named object and then try to figure out which physical property they are likely to be asking about. And, perhaps, if you can't match maybe just tell them all about the physical properties of the object.
The method signature in my system would be something like ...
... QuestionAboutPhysicalProperties (... IPhysicalProperty prop,
INamedObject obj, ...)
... and in code I would look at the properties of obj and try to find one called 'prop'.
The only way that I know how to do this effectively requires having a large corpus of user query sessions and a happiness measure on sessions, and then finding correlations between substituting word x for word y (possibly given some context z) that improves user happiness.
Here is a reasonable paper on generating query substitutions.
And here is a new paper on generating synonyms from anchor text, which doesn't require a query log.

Resources