Chomsky hierarchy - examples with real languages - nlp

I'm trying to understand the four levels of the Chomsky hierarchy by using some real languages as models. He thought that all the natural languages can be generated through a Context-free Grammar, but Schieber contradicted this theory proving that languages such as Swiss German can only be generated through Context-sensitive grammar. Since Chomsky is from US, I guess that the American language is an example of Context-free grammar. My questions are:
Are there languages which can be generated by regular grammars (type 3)?
Since the Recursively enumerable grammars can generate all languages, why not using that? Are they too complicated and less linear?
What the characteristic of Swiss German which make it impossible to be generated through Context-free grammars?

I don't think this is an appropriate question for StackOverflow, which is a site for programming questions. But I'll try to address it as best I can.
I don't believe Chomsky was ever under the impression that natural languages could be described with a Type 2 grammar. It is not impossible for noun-verb agreement (singular/plural) to be represented in a Type 2 grammar, because the number of cases is finite, but the grammar is awkward. But there are more complicated features of natural language, generally involving specific rules about how word order can be rearranged, which cannot be captured in a simple grammar. It was Chomsky's hope that a second level of analysis -- "transformational grammars" -- could useful capture these rearrangement rules without making the grammar computationally intractable. That would require finding some systematization which fit between Type 1 and Type 2, because Type 1 grammars are not computationally tractable.
Since we do, in fact, correctly parse our own languages, it stands to reason that there be some computational algorithm. But that line of reasoning might not actually be correct, because there is a limit to the complexity of a sentence which we can parse. Any finite language is regular (Type 3); only languages which have an unlimited number of potential sentences require more sophisticated grammars. So a large collection of finite patterns could suffice to understand natural language. These patterns might be a lot more sophisticated than regular expressions, but as long as each pattern only applies to a sentence of limited length, the pattern could be expressed mathematically as a regular expression. (The most obvious one is to just list all possible sentences as alternatives, which is a regular expression if the number of possible sentences is finite. But in many cases, that might be simplified into something more useful.)
As I understand it, modern attempts to deal with natural language using so-called "deep learning" are essentially based on pattern recognition through neural networks, although I haven't studied the field deeply and I'm sure that there are many complications I'm skipping over in that simple description.
Noam Chomsky is an American, but "American" is not a language (y si fuera, podría ser castellano, hablado por la mayoría de los residentes de las Americas). As far as I know, his first language is English, but he is not by any means unilingual, although I don't know how much Swiss German he speaks. Certainly, there have been criticisms over the years that his theories have an Indo-European bias. Certainly, I don't claim competence in Swiss German, despite having lived several years in Switzerland, but I did read Shieber's paper and some of the follow-ups and discussed them with colleagues who were native Swiss German speakers. (Opinions were divided.)
The basic issue has to do with morphological agreement in lists. As I mentioned earlier, many languages (all Indo-European languages, as far as I know) insist that the form of the verb agrees with the form of the subject, so that a singular subject requires a singular verb and a plural subject requires a plural verb. [Note 1]
In many languages, agreement is also required between adjectives and nouns, and this is not just agreement in number but also agreement in grammatical gender (if applicable). Also, many languages require agreement between the specific verb and the article or adjective of the object of the verb. [Note 2]
Simple agreement can be handled by a context-free (Type 2) grammar, but there is a huge restriction. To put it simply, a context-free grammar can only deal with parenthetic constructions. This can work even if there is more than one type of parenthesis, so a context-free grammar can insist that an [ be matched with a ] and not a ). But the grammar must have this "inside-out" form: the matching symbols must be in the reverse order to the symbols being matched.
One consequence of this is that there is a context-free grammar for palindromes -- sentences which read the same in both directions, which effectively means that they consist of a phrase followed by its reverse. But there is no context-free grammar for duplications: a language consisting of repeated phrases. In the palindrome, the matching words are in the reverse order to the matched words; in the duplicate, they are in the same order. Hence the difference.
Agreement in natural languages mostly follows this pattern, and some of the exceptions can be dealt with by positing simple rules for reordering finite numbers of phrases -- Chomsky's transformational grammar. But Swiss German features at least one case where agreement is not parenthetic, but rather in the same order. [Note 3] This involves the feature of German in which many sentences are in the order Subject-Object-Verb, which can be extended to Subject Object Object Object... Verb Verb Verb... when the verbs have indirect objects. Shieber showed some examples in which object-verb agreement is ordered, even when there are intervening phrases.
In the general case, such "cross-serial agreement" cannot be expressed in a context-free grammar. But there is a huge underlying assumption: that the length of the agreeing series be effectively unlimited. If, on the other hand, there are a finite number of patterns actually in common use, the "deep learning" model referred to above would certainly be able to handle it.
(I want to say that I'm not endorsing deep learning here. In fact, the way "artificial intelligence" is "trained" involves the uses of trainers whose cultural biases may well not be sufficiently understood. This could easily lead to the same unfortunate consequences alluded to in my first footnode.)
Notes
This is not the case in many native American languages, as Whorf pointed out. In those languages, using a singular verb with a plural noun implies that the action was taken collectively, while using a plural verb would imply that the action was taken separately. Roughly transcribed to English, "The dogs run" would be about a bunch of dogs independently running in different directions, whereas "The dogs runs" would be about a single pack of dogs all running together. Some European "teachers" who imposed their own linguistic prejudices on native languages failed to correctly understand this distinction, and concluded that the native Americans must be too primitive to even speak their own language "correctly"; to "correct" this "deficiency", they attempted to eliminate the distinction from the language, in some cases with success.
These rules, not present in English, are one of the reasons some English speakers are tortured by learning German. I speak from personal experience.
Ordered agreement, as opposed to parenthetic agreement, is known as cross-serial dependency.

Related

How computer languages are made using theory of automata concept?

I tried really hard to find answer to this question on google engine.
But I wonder how these high level programming languages are created in principle of automata or is automata theory not included in defining the languages?
Language design tends to have two important levels:
Lexical analysis - the definition of what tokens look like. What is a string literal, what is a number, what are valid names for variables, functions, etc.
Syntactic analysis - the definition of how tokens work together to make meaningful statements. Can you assign a value to a literal, what does a block look like, what does an if statement look like, etc.
The lexical analysis is done using regular languages, and generally tokens are defined using regular expressions. It's not that a DFA is used (most regex implementations are not DFAs in practice), but that regular expressions tend to line up well with what most languages consider tokens. If, for example, you wanted a language where all variable names had to be palindromes, then your language's token specification would have to be context-free instead.
The input to the lexing stage is the raw characters of the source code. The alphabet would therefore be ASCII or Unicode or whatever input your compiler is expecting. The output is a stream of tokens with metadata, such as string-literal (value: hello world) which might represent "hello world" in the source code.
The syntactic analysis is typically done using a subset of context-free languages called LL or LR parsers. This is because the implementation of CFG (PDAs) are nondeterministic. LL and LR parsing are ways to make deterministic decisions with respect to how to parse a given expression.
We use CFGs for code because this is the level on the Chomsky hierarchy where nesting occurs (where you can express the idea of "depth", such as with an if within an if). Higher or lower levels on the hierarchy are possible, but a regular syntax would not be able to express nesting easily, and context-sensitive syntax would probably cause confusion (but it's not unheard of).
The input to the syntactic analysis step is the token stream, and the output is some form of executable structure, typically a parse tree that is either executed immediately (as in interpretted languages) or stored for later optimization and/or execution (as in compiled languages) or something else (as in intermediate-compiled languages like Java). The alphabet of the CFG is therefore the possible tokens specified by the lexical analysis step.
So this whole thing is a long-winded way of saying that it's not so much the automata theory that's important, but rather the formal languages. We typically want to have the simplest language class that meets our needs. That typically means regular tokens and context-free syntax, but not always.
The implementation of the regular expression need not be an automaton, and the implementation of the CFG cannot be a PDA, because PDAs are nondeterministic, so we define deterministic parsers on reasonable subsets of the CFG class instead.
More generally we talk about Theory of computation.
What has happened through the history of programming languages is that it has been formally proven that higher-level constructs are equivalent to the constructs in the abstract machines of the theory.
We prefer the higher-level constructs in modern languages because they make programs easier to write, and easier to understand by other people. That in turn leads to easier peer-review and team-play, and thus better programs with less bugs.
The Wikipedia article about Structured programming tells part of the history.
As to Automata theory, it is still present in the implementation of regular expression engines, and in most programming situations in which a good solution consists in transitioning through a set of possible states.

Differences between lexical features and orthographic features in NLP?

Features are used for model training and testing. What are the differences between lexical features and orthographic features in Natural Language Processing? Examples preferred.
I am not aware of such a distinction, and most of the time when people talk about lexical features they talk about using the word itself, in contrast to only using other features, ie its part-of-speech.
Here is an example of a paper that means "whole word orthograph" when they say lexical features
One could venture that orthographic could mean something more abstract than the sequence of characters themselves, for example whether the sequence is capitalized / titlecased / camelcased / etc. But we already have the useful and clearly understood shape feature denomination for that.
As such, I would recommend distinguishing features like this:
lexical features:
whole word, prefix/suffix (various lengths possible), stemmed word, lemmatized word
shape features:
uppercase, titlecase, camelcase, lowercase
grammatical and syntactic features:
POS, part of a noun-phrase, head of a verb phrase, complement of a prepositional phrase, etc...
This is not an exhaustive list of possible features and feature categories, but it might help you categorizing linguistic features in a clearer and more widely-accepted way.

Why do I need a tokenizer for each language? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.
Chinese: 如果您在新加坡只能前往一间夜间娱乐场所,Zouk必然是您的不二之选。
English: If you only have time for one club in Singapore, then it simply has to be Zouk.
Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.
Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。
Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.
Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.
Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf
The tokenized version of the parallel text above should look like this:
For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.
However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.
For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.
Some languages, like Chinese, don't use whitespace to separate words at all.
Other languages will use punctuation differently - an apostrophe might or might not be a part of a word, for instance.
Case-folding rules vary from language to language.
Stopwords and stemming are different between languauges (though I guess I'm straying from tokenizer to analyzer here).
Edit by Bjerva: Additionally, many languages concatenate compound nouns. Whether this should be tokenised to several tokens or not can not be easily determined using only whitespace.
The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:
(Missing) Spaces between words
Many languages do not put spaces in between words at all, and so the
basic word division algorithm of breaking on whitespace is of no use
at all. Such languages include major East-Asian languages/scripts,
such as Chinese, Japanese, and Thai. Ancient Greek was also written by
Ancient Greeks without word spaces. Spaces were introduced (together
with accent marks, etc.) by those who came afterwards. In such
languages, word segmentation is a much more major and challenging
task. (MANNI:1999, p. 129)
Compounds
German compound nouns are written as a single word, e.g.
"Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust
agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)).
Their information-density, however, is high, and one may wish to
divide such a compound, or at least to be aware of the internal
structure of the word, and this becomes a limited word segmentation
task.(Ibidem)
There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.
Variant styles and codings
Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:
[...] Even if one is not dealing with multilingual text, any
application dealing with text from different countries or written
according to different stylistic conventions has to be prepared to
deal with typographical differences. In particular, some items such as
phone numbers are clearly of one semantic sort, but can appear in many
formats. (MANNI:1999, p. 130)
Misc.
One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).
If one wants to handle multilingual contractions/"cliticons":
English: They‘re my father‘s cousins.
French: Montrez-le à l‘agent!
German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)
Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:
Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.
References
(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

The difference between Chomsky type 3 and Chomsky type 2 grammar

I'm having trouble articulating the difference between Chomsky type 2 (context free languages) and Chomsky type 3 (Regular languages).
Can someone out there give me an answer in plain English? I'm having trouble understanding the whole hierarchy thing.
A Type II grammar is a Type III grammar with a stack
A Type II grammar is basically a Type III grammar with nesting.
Type III grammar (Regular):
Use Case - CSV (Comma Separated Values)
Characteristics:
can be read with a using a FSM (Finite State Machine)
requires no intermediate storage
can be read with Regular Expressions
usually expressed using a 1D or 2D data structure
is flat, meaning no nesting or recursive properties
Ex:
this,is,,"an "" example",\r\n
"of, a",type,"III\n",grammar\r\n
As long as you can figure out all of the rules and edge cases for the above text you can parse CSV.
Type II grammar (Context Free):
Use Case - HTML (Hyper Text Markup Language) or SGML in general
Characteristics:
can be read using a DPDA (Deterministic Pushdown Automata)
will require a stack for intermediate storage
may be expressed as an AST (Abstract Syntax Tree)
may contain nesting and/or recursive properties
HTML could be expressed as a regular grammar:
<h1>Useless Example</h1>
<p>Some stuff written here</p>
<p>Isn't this fun</p>
But it's try parsing this using a FSM:
<body>
<div id=titlebar>
<h1>XHTML 1.0</h1>
<h2>W3C's failed attempt to enforce HTML as a context-free language</h2>
</div>
<p>Back when the web was still pretty boring, the W3C attempted to standardize away the quirkiness of HTML by introducing a strict specification</p
<p>Unfortunately, everybody ignored it.</p>
</body>
See the difference? Imagine you were writing a parser, you could start on an open tag and finish on a closing tag but what happens when you encounter a second opening tag before reaching the closing tag?
It's simple, you push the first opening tag onto a stack and start parsing the second tag. Repeat this process for as many levels of nesting that exist and if the syntax is well-structured, the stack can be un-rolled one layer at a time in the opposite level that it was built
Due to the strict nature of 'pure' context-free languages, they're relatively rare unless they're generated by a program. JSON, is a prime example.
The benefit of context-free languages is that, while very expressive, they're still relatively simple to parse.
But wait, didn't I just say HTML is context-free. Yep, if it is well-formed (ie XHTML).
While XHTML may be considered context-free, the looser-defined HTML would actually considered Type I (Ie Context Sensitive). The reason being, when the parser reaches poorly structured code it actually makes decisions about how to interpret the code based on the surrounding context. For example if an element is missing its closing tags, it would need to determine where that element exists in the hierarchy before it can decide where the closing tag should be placed.
Other features that could make a context-free language context-sensitive include, templates, imports, preprocessors, macros, etc.
In short, context-sensitive languages look a lot like context-free languages but the elements of a context-sensitive languages may be interpreted in different ways depending on the program state.
Disclaimer: I am not formally trained in CompSci so this answer may contain errors or assumptions. If you asked me the difference between a terminal and a non-terminal you'll earn yourself a blank stare. I learned this much by actually building a Type III (Regular) parser and by reading extensively about the rest.
The wikipedia page has a good picture and bullet points.
Roughly, the underlying machine that can describe a regular language does not need memory. It runs as a statemachine (DFA/NFA) on the input. Regular languages can also be expressed with regular expressions.
A language with the "next" level of complexity added to it is a context free language. The underlying machine describing this kind of language will need some memory to be able to represent the languages that are context free and not regular. Note that adding memory to your machine makes it a little more powerful, so it can still express languages (e.g. regular languages) that didn't need the memory to begin with. The underlying machine is typically a push-down automaton.
Type 3 grammars consist of a series of states. They cannot express embedding. For example, a Type 3 grammar cannot require matching parentheses because it has no way to show that the parentheses should be "wrapped around" their contents. This is because, as Derek points out, a Type 3 grammar does not "remember" anything about the previous states that it passed through to get to the current state.
Type 2 grammars consist of a set of "productions" (you can think of them as patterns) that can have other productions embedded within them. Thus, they are recursively defined. A production can only be defined in terms of what it contains, and cannot "see" outside of itself; this is what makes the grammar context-free.

Algorithm for Negating Sentences

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation.
For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad".
Obviously, accomplishing this with a high degree of accuracy would probably be beyond the scope of current NLP, but I'm sure there has been some work on the subject. If anybody knows of any work, care to point me to some papers?
While I'm not aware of any work that specifically looks at automatically generating negated sentences, I imagine a good place to start would be to read up on linguistics work in formal semantics and pragmatics. A good accessible introduction would be Steven C. Levinson's Pragmatics book.
One issue that I think you'll run into is that it can be very difficult to negate all the information that is conveyed by a sentence. For example, take:
John fixed the vase that he broke.
Even if you change this to John did not fix the vase that he broke, there is a presupposition that there is a vase and that John broke it.
Similarly, simply negating the sentence John did not stopped using drugs as John stopped using drugs still conveys that John, at one point, used drugs. A more thorough negation would be John never used drugs.
Some existing natural language processing (NLP) work that you might want to look at is MacCartney and Manning 2007's Natural Logic for Textual Inference. In this paper they use George Lakoff's notion of Natural Logic and Sanchez Valencia's monotonicity calculus to create software that automatically determines whether one sentence entails another. You could probably use some their techniques for detecting non-entailment to artificially construct negated and contradicting sentences.
I'd recommend checking out wordnet. You can use it to lookup antonyms for a word, so you could conceivably replace "bad" with "not good" since bad is an antonym of good. NLTK has a simple python interface to wordnet.
The naïve way of course, is to try to add "not" right after {am,are,is}. I have no idea how this will work in your setting though, it will probably only work with predicate-like sentences.
For simple sentences parse looking for adverbs or adjectives given the English grammar rules and substitute an antonym if only one meaning exists. Otherwise use the correct English negation rule to negate the verb (ie: is -> is not).
High level algorithm:
Look up each word for it's type (noun, verb, adjective, adverb, conjunction, etc...)
Infer sentence structure from word type sequences (Your sentence was: article, noun, verb, adjective/adverb; This is known to be a simple sentence.)
For simple sentences, choose one invertible word and invert it. Either by using an antonym, or negating the verb.
For more complex sentences, such as those with subordinate clauses, you will need to have more complex analysis, but for simple sentences, this shouldn't be infeasible.
There's a similar process for first-order logic. The usual algorithm is to map P to not P, and then perform valid translations to move the not somewhere convenient, e.g.:
Original: (not R(x) => exists(y) (O(y) and P(x, y)))
Negate it: not (not R(x) => exists(y) (O(y) and P(x, y)))
Rearrange: not (R(x) or exists(y) (O(y) and P(x, y)))
not R(x) and not exists(y) (O(y) and P(x, y))
not R(x) and forall(y) not (O(y) and P(x, y))
not R(x) and forall(y) (not O(y) or not P(x, y))
Performing the same on English you'd be negating "If it's not raining here, then there is some activity that is an outdoors activity and can be performed here" to "It is NOT the case that ..." and finally into "It's not raining and every possible activity is either not for outdoors or can't be performed here."
Natural language is a lot more complicated than first-order logic, of course... but if you can parse the sentence into something where the words "not", "and", "or", "exists" etc. can be identified, then you should be able to perform similar translations.
For a rule-based negation approach, you can take a look at the Python module negate1.
1 Disclaimer: I am the author of the module.
As for some papers related to the topic, you can take a look at:
Understanding by Understanding Not: Modeling Negation in Language Models
An Analysis of Natural Language Inference Benchmarks through the Lens of Negation
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Nice demos using NTLK - http://text-processing.com/demo and a short writeup - http://text-processing.com/demo/sentiment/.

Resources