ICU: What does NFD and NFC mean? - icu

I found a snippet which reads
Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();
...and is supposed to make arbitrary string well fitting into a URL.
So, I guess the things between the semicolons are something like "commands" which are to be executed, but what particularly does NDF and NFC stand for? I really did find nothing even in the official documentation...

See ICU transliterators and the linked page on TR15 normalization forms for complete examples.
Normalization Form D (NFD) Canonical Decomposition
Normalization Form C (NFC) Canonical Decomposition,
followed by Canonical Composition
Normalization Form KD (NFKD) Compatibility Decomposition
Normalization Form KC (NFKC) Compatibility Decomposition,
followed by Canonical Composition

Related

Extracting <subject, predicate, object> triplet from unstructured text

I need to extract simple triplets from unstructured text. Usually it is of the form noun- verb- noun, so I have tried POS tagging and then extracting nouns and verbs from neighbourhood.
However it leads to lot of cases and gives low accuracy.
Will Syntactic/semantic parsing help in this scenario?
Will ontology based information extraction be more useful?
I expect that syntactic parsing would be the best fit for your scenario. Some trivial template-matching method with POS tags might work, where you find verbs preceded and followed by a single noun, and take the former to be the subject and the latter the object. However, it sounds like you've already tried something like that -- unless your neighborhood extraction ignores word order (which would be a bit silly - you'd be guessing which noun was the word and which was the object, and that's assuming exactly two nouns in each sentence).
Since you're looking for {s, v, o} triplets, chances are you won't need semantic or ontological information. That would be useful if you wanted more information, e.g. agent-patient relations or deeper knowledge extraction.
{s,v,o} is shallow syntactic information, and given that syntactic parsing is considerably more robust and accessible than semantic parsing, that might be your best bet. Syntactic parsing will be sensitive to simple word re-orderings, e.g. "The hamburger was eaten by John." => {John, eat, hamburger}; you'd also be able to specifically handle intransitive and ditransitive verbs, which might be issues for a more naive approach.

Differences between lexical features and orthographic features in NLP?

Features are used for model training and testing. What are the differences between lexical features and orthographic features in Natural Language Processing? Examples preferred.
I am not aware of such a distinction, and most of the time when people talk about lexical features they talk about using the word itself, in contrast to only using other features, ie its part-of-speech.
Here is an example of a paper that means "whole word orthograph" when they say lexical features
One could venture that orthographic could mean something more abstract than the sequence of characters themselves, for example whether the sequence is capitalized / titlecased / camelcased / etc. But we already have the useful and clearly understood shape feature denomination for that.
As such, I would recommend distinguishing features like this:
lexical features:
whole word, prefix/suffix (various lengths possible), stemmed word, lemmatized word
shape features:
uppercase, titlecase, camelcase, lowercase
grammatical and syntactic features:
POS, part of a noun-phrase, head of a verb phrase, complement of a prepositional phrase, etc...
This is not an exhaustive list of possible features and feature categories, but it might help you categorizing linguistic features in a clearer and more widely-accepted way.

Mapping interchangeably terms such as Weight to Mass for QAnswering NLP

I've been working on a Question Answering engine in C#. I have implemented the features of most modern systems and are achieving good results. Despite the aid of Wordnet , one problem I haven't been able to solve yet is changing the user input to the correct term.
For example
changing Weight -> Mass
changing Tall -> Height
My question is about the existence of some sort of resource that can aid me in this task of changing the terms to the correct terms.
Thank You
Looking at all the synsets in WordNet for both Mass and Weight I can see that there is no shared synset and thus there is no meaning in common. Words that actually do have the same meaning can be matched by means of their synset labels, as I'm sure you've realized.
In my own natural language engine (http://nlp.abodit.com) I allow users to use any synset label in the grammar they define but I would still create two separate grammar rules in this case, one recognizing questions about mass and one recognizing questions about weight.
However, there are also files for Wordnet that give you class relationships between synsets too. For example, if you type 'define mass' into my demo page you'll see:-
4. wn30:synset-mass-noun-1
the property of a body that causes it to have weight in a gravitational field
--type--> wn30:synset-fundamental_quantity-noun-1
--type--> wn30:synset-physical_property-noun-1
ITokenText, IToken, INoun, Singular
And if you do the same for 'weight' you'll also see that it too has a class relationship to 'physical property'.
In my system you can write a rule that recognizes a question about a 'physical property' and perhaps a named object and then try to figure out which physical property they are likely to be asking about. And, perhaps, if you can't match maybe just tell them all about the physical properties of the object.
The method signature in my system would be something like ...
... QuestionAboutPhysicalProperties (... IPhysicalProperty prop,
INamedObject obj, ...)
... and in code I would look at the properties of obj and try to find one called 'prop'.
The only way that I know how to do this effectively requires having a large corpus of user query sessions and a happiness measure on sessions, and then finding correlations between substituting word x for word y (possibly given some context z) that improves user happiness.
Here is a reasonable paper on generating query substitutions.
And here is a new paper on generating synonyms from anchor text, which doesn't require a query log.

What is the best way to classify following words in POS tagging?

I am doing POS tagging. Given the following tokens in the training set, is it better to consider each token as Word1/POStag and Word2/POStag or consider them as one word that is Word1/Word2/POStag ?
Examples: (the POSTag is not required to be included)
Bard/EMS
Interstate/Johnson
Polo/Ralph
IBC/Donoghue
ISC/Bunker
Bendix/King
mystery/comedy
Jeep/Eagle
B/T
Hawaiian/Japanese
IBM/PC
Princeton/Newport
editing/electronic
Heller/Breene
Davis/Zweig
Fleet/Norstar
a/k/a
1/2
Any suggestion is appreciated.
The examples don't seem to fall into one category with respect to the use of the slash -- a/k/a is a phrase acronym, 1/2 is a number, mystery/comedy indicates something in between the two words, etc.
I feel there is no treatment of the component words that would work for all the cases in question, and therefore the better option is to handle them as unique words. At decoding stage, when the tagger will probably be presented with more previously unseen examples of such words, the decision can often be made based on the context, rather than the word itself.

Mathematica: what is symbolic programming?

I am a big fan of Stephen Wolfram, but he is definitely one not shy of tooting his own horn. In many references, he extols Mathematica as a different symbolic programming paradigm. I am not a Mathematica user.
My questions are: what is this symbolic programming? And how does it compare to functional languages (such as Haskell)?
When I hear the phrase "symbolic programming", LISP, Prolog and (yes) Mathematica immediately leap to mind. I would characterize a symbolic programming environment as one in which the expressions used to represent program text also happen to be the primary data structure. As a result, it becomes very easy to build abstractions upon abstractions since data can easily be transformed into code and vice versa.
Mathematica exploits this capability heavily. Even more heavily than LISP and Prolog (IMHO).
As an example of symbolic programming, consider the following sequence of events. I have a CSV file that looks like this:
r,1,2
g,3,4
I read that file in:
Import["somefile.csv"]
--> {{r,1,2},{g,3,4}}
Is the result data or code? It is both. It is the data that results from reading the file, but it also happens to be the expression that will construct that data. As code goes, however, this expression is inert since the result of evaluating it is simply itself.
So now I apply a transformation to the result:
% /. {c_, x_, y_} :> {c, Disk[{x, y}]}
--> {{r,Disk[{1,2}]},{g,Disk[{3,4}]}}
Without dwelling on the details, all that has happened is that Disk[{...}] has been wrapped around the last two numbers from each input line. The result is still data/code, but still inert. Another transformation:
% /. {"r" -> Red, "g" -> Green}
--> {{Red,Disk[{1,2}]},{Green,Disk[{3,4}]}}
Yes, still inert. However, by a remarkable coincidence this last result just happens to be a list of valid directives in Mathematica's built-in domain-specific language for graphics. One last transformation, and things start to happen:
% /. x_ :> Graphics[x]
--> Graphics[{{Red,Disk[{1,2}]},{Green,Disk[{3,4}]}}]
Actually, you would not see that last result. In an epic display of syntactic sugar, Mathematica would show this picture of red and green circles:
But the fun doesn't stop there. Underneath all that syntactic sugar we still have a symbolic expression. I can apply another transformation rule:
% /. Red -> Black
Presto! The red circle became black.
It is this kind of "symbol pushing" that characterizes symbolic programming. A great majority of Mathematica programming is of this nature.
Functional vs. Symbolic
I won't address the differences between symbolic and functional programming in detail, but I will contribute a few remarks.
One could view symbolic programming as an answer to the question: "What would happen if I tried to model everything using only expression transformations?" Functional programming, by contrast, can been seen as an answer to: "What would happen if I tried to model everything using only functions?" Just like symbolic programming, functional programming makes it easy to quickly build up layers of abstractions. The example I gave here could be easily be reproduced in, say, Haskell using a functional reactive animation approach. Functional programming is all about function composition, higher level functions, combinators -- all the nifty things that you can do with functions.
Mathematica is clearly optimized for symbolic programming. It is possible to write code in functional style, but the functional features in Mathematica are really just a thin veneer over transformations (and a leaky abstraction at that, see the footnote below).
Haskell is clearly optimized for functional programming. It is possible to write code in symbolic style, but I would quibble that the syntactic representation of programs and data are quite distinct, making the experience suboptimal.
Concluding Remarks
In conclusion, I advocate that there is a distinction between functional programming (as epitomized by Haskell) and symbolic programming (as epitomized by Mathematica). I think that if one studies both, then one will learn substantially more than studying just one -- the ultimate test of distinctness.
Leaky Functional Abstraction in Mathematica?
Yup, leaky. Try this, for example:
f[x_] := g[Function[a, x]];
g[fn_] := Module[{h}, h[a_] := fn[a]; h[0]];
f[999]
Duly reported to, and acknowledged by, WRI. The response: avoid the use of Function[var, body] (Function[body] is okay).
You can think of Mathematica's symbolic programming as a search-and-replace system where you program by specifying search-and-replace rules.
For instance you could specify the following rule
area := Pi*radius^2;
Next time you use area, it'll be replaced with Pi*radius^2. Now, suppose you define new rule
radius:=5
Now, whenever you use radius, it'll get rewritten into 5. If you evaluate area it'll get rewritten into Pi*radius^2 which triggers rewriting rule for radius and you'll get Pi*5^2 as an intermediate result. This new form will trigger a built-in rewriting rule for ^ operation so the expression will get further rewritten into Pi*25. At this point rewriting stops because there are no applicable rules.
You can emulate functional programming by using your replacement rules as function. For instance, if you want to define a function that adds, you could do
add[a_,b_]:=a+b
Now add[x,y] gets rewritten into x+y. If you want add to only apply for numeric a,b, you could instead do
add[a_?NumericQ, b_?NumericQ] := a + b
Now, add[2,3] gets rewritten into 2+3 using your rule and then into 5 using built-in rule for +, whereas add[test1,test2] remains unchanged.
Here's an example of an interactive replacement rule
a := ChoiceDialog["Pick one", {1, 2, 3, 4}]
a+1
Here, a gets replaced with ChoiceDialog, which then gets replaced with the number the user chose on the dialog that popped up, which makes both quantities numeric and triggers replacement rule for +. Here, ChoiceDialog as a built-in replacement rule along the lines of "replace ChoiceDialog[some stuff] with the value of button the user clicked".
Rules can be defined using conditions which themselves need to go through rule-rewriting in order to produce True or False. For instance suppose you invented a new equation solving method, but you think it only works when the final result of your method is positive. You could do the following rule
solve[x + 5 == b_] := (result = b - 5; result /; result > 0)
Here, solve[x+5==20] gets replaced with 15, but solve[x + 5 == -20] is unchanged because there's no rule that applies. The condition that prevents this rule from applying is /;result>0. Evaluator essentially looks the potential output of rule application to decide whether to go ahead with it.
Mathematica's evaluator greedily rewrites every pattern with one of the rules that apply for that symbol. Sometimes you want to have finer control, and in such case you could define your own rules and apply them manually like this
myrules={area->Pi radius^2,radius->5}
area//.myrules
This will apply rules defined in myrules until result stops changing. This is pretty similar to the default evaluator, but now you could have several sets of rules and apply them selectively. A more advanced example shows how to make a Prolog-like evaluator that searches over sequences of rule applications.
One drawback of current Mathematica version comes up when you need to use Mathematica's default evaluator (to make use of Integrate, Solve, etc) and want to change default sequence of evaluation. That is possible but complicated, and I like to think that some future implementation of symbolic programming will have a more elegant way of controlling evaluation sequence
As others here already mentioned, Mathematica does a lot of term rewriting. Maybe Haskell isn't the best comparison though, but Pure is a nice functional term-rewriting language (that should feel familiar to people with a Haskell background). Maybe reading their Wiki page on term rewriting will clear up a few things for you:
http://code.google.com/p/pure-lang/wiki/Rewriting
Mathematica is using term rewriting heavily. The language provides special syntax for various forms of rewriting, special support for rules and strategies. The paradigm is not that "new" and of course it's not unique, but they're definitely on a bleeding edge of this "symbolic programming" thing, alongside with the other strong players such as Axiom.
As for comparison to Haskell, well, you could do rewriting there, with a bit of help from scrap your boilerplate library, but it's not nearly as easy as in a dynamically typed Mathematica.
Symbolic shouldn't be contrasted with functional, it should be contrasted with numerical programming. Consider as an example MatLab vs Mathematica. Suppose I want the characteristic polynomial of a matrix. If I wanted to do that in Mathematica, I could do get an identity matrix (I) and the matrix (A) itself into Mathematica, then do this:
Det[A-lambda*I]
And I would get the characteristic polynomial (never mind that there's probably a characteristic polynomial function), on the other hand, if I was in MatLab I couldn't do it with base MatLab because base MatLab (never mind that there's probably a characteristic polynomial function) is only good at calculating finite-precision numbers, not things where there are random lambdas (our symbol) in there. What you'd have to do is buy the add-on Symbolab, and then define lambda as its own line of code and then write this out (wherein it would convert your A matrix to a matrix of rational numbers rather than finite precision decimals), and while the performance difference would probably be unnoticeable for a small case like this, it would probably do it much slower than Mathematica in terms of relative speed.
So that's the difference, symbolic languages are interested in doing calculations with perfect accuracy (often using rational numbers as opposed to numerical) and numerical programming languages on the other hand are very good at the vast majority of calculations you would need to do and they tend to be faster at the numerical operations they're meant for (MatLab is nearly unmatched in this regard for higher level languages - excluding C++, etc) and a piss poor at symbolic operations.

Resources