I would like to illustrate the difference between those two parser. Yet I am not sure if my representation is correct and would love to get some input:
It's not the most common way to draw either kind of parse (constituency parses usually don't use arrows and dependency parses usually don't use arrows to link words with POS tags), but the basic structures are okay. In the dependency parse VDB (I think you mean VBD) should be linked to drove.
Related
I'm implementing a compiler compiling a source language to a target language (assembly like) in Haskell.
For debugging purpose, a source map is needed to map target language assembly instruction to its corresponding source position (line and column).
I've searched extensively compiler implementation, but none includes a source map.
Can anyone please point me in the right direction on how to generate a source map?
Code samples, books, etc. Haskell is preferred, other languages are also welcome.
Details depend on a compilation technique you're applying.
If you're doing it via a sequence of transforms of intermediate languages, as most sane compilers do these days, your options are following:
Annotate all intermediate representation (IR) nodes with source location information. Introduce special nodes for preserving variable names (they'll all go after you do, say, an SSA-transform, so you need to track their origins separately)
Inject tons of intrinsic function calls (see how it's done in LLVM IR) instead of annotating each node
Do a mixture of the above
The first option can even be done nearly automatically - if each transform preserves source location of an original node in all nodes it creates from it, you'd only have to manually change some non-trivial annotations.
Also you must keep in mind that some optimisations may render your source location information absolutely meaningless. E.g., a value numbering would collapse a number of similar expressions into one, probably preserving a source location information for one random origin. Same for rematerialisation.
With Haskell, the first approach will result in a lot of boilerplate in your ADT definitions and pattern matching, even if you sugar coat it with something like Scrap Your Boilerplate (SYB), so I'd recommend the second approach, which is extensively documented and nicely demonstrated in LLVM IR.
Are there any tools available for producing a parse tree of NodeJS code? The closest I can find is the closure compiler but I don't think that will give me a parseable tree for analysis.
Node.js is just JavaScript (aka ECMAScript). I recommend Esprima (http://esprima.org/), which supports the de-facto standard AST established by Mozilla.
Esprima generates very nice ASTs, but if you need the parse tree you must look elsewhere. Esprima only returns ASTs and the sequence of tokens for some text. If you don't want to model the language yourself, you could use another tool like ANTLR (see: https://stackoverflow.com/a/5982455/206543).
For the difference between ASTs and parse trees, look here: https://stackoverflow.com/a/9864571/206543
Say I'd like to find instances of the expression while using the Java7 grammar:
FoobarClass.getInstanceOfType("Bazz");
Using a ParseTreeWalker and listening to exitExpression() calls sounded like a good first place to start. What surprised me was the level of manual traversal of the Java7Parser.ExpressionContext required to find expressions of this type.
What's the appropriate method to find matches to the above expression? At this point using a Regex in place of ANTLR4 yields simpler code, but this won't scale.
ANTLR 4 does not currently include feature allowing you to write concrete or abstract syntax queries. We hope to add something in the future to help with this type of application.
I've needed to write a few pattern recognition features for ANTLR 4 parse trees. I implemented the predicate itself with relative success by extending BaseMyParserVisitor<Boolean> (the parser in this example is called MyParser).
I'm having trouble articulating the difference between Chomsky type 2 (context free languages) and Chomsky type 3 (Regular languages).
Can someone out there give me an answer in plain English? I'm having trouble understanding the whole hierarchy thing.
A Type II grammar is a Type III grammar with a stack
A Type II grammar is basically a Type III grammar with nesting.
Type III grammar (Regular):
Use Case - CSV (Comma Separated Values)
Characteristics:
can be read with a using a FSM (Finite State Machine)
requires no intermediate storage
can be read with Regular Expressions
usually expressed using a 1D or 2D data structure
is flat, meaning no nesting or recursive properties
Ex:
this,is,,"an "" example",\r\n
"of, a",type,"III\n",grammar\r\n
As long as you can figure out all of the rules and edge cases for the above text you can parse CSV.
Type II grammar (Context Free):
Use Case - HTML (Hyper Text Markup Language) or SGML in general
Characteristics:
can be read using a DPDA (Deterministic Pushdown Automata)
will require a stack for intermediate storage
may be expressed as an AST (Abstract Syntax Tree)
may contain nesting and/or recursive properties
HTML could be expressed as a regular grammar:
<h1>Useless Example</h1>
<p>Some stuff written here</p>
<p>Isn't this fun</p>
But it's try parsing this using a FSM:
<body>
<div id=titlebar>
<h1>XHTML 1.0</h1>
<h2>W3C's failed attempt to enforce HTML as a context-free language</h2>
</div>
<p>Back when the web was still pretty boring, the W3C attempted to standardize away the quirkiness of HTML by introducing a strict specification</p
<p>Unfortunately, everybody ignored it.</p>
</body>
See the difference? Imagine you were writing a parser, you could start on an open tag and finish on a closing tag but what happens when you encounter a second opening tag before reaching the closing tag?
It's simple, you push the first opening tag onto a stack and start parsing the second tag. Repeat this process for as many levels of nesting that exist and if the syntax is well-structured, the stack can be un-rolled one layer at a time in the opposite level that it was built
Due to the strict nature of 'pure' context-free languages, they're relatively rare unless they're generated by a program. JSON, is a prime example.
The benefit of context-free languages is that, while very expressive, they're still relatively simple to parse.
But wait, didn't I just say HTML is context-free. Yep, if it is well-formed (ie XHTML).
While XHTML may be considered context-free, the looser-defined HTML would actually considered Type I (Ie Context Sensitive). The reason being, when the parser reaches poorly structured code it actually makes decisions about how to interpret the code based on the surrounding context. For example if an element is missing its closing tags, it would need to determine where that element exists in the hierarchy before it can decide where the closing tag should be placed.
Other features that could make a context-free language context-sensitive include, templates, imports, preprocessors, macros, etc.
In short, context-sensitive languages look a lot like context-free languages but the elements of a context-sensitive languages may be interpreted in different ways depending on the program state.
Disclaimer: I am not formally trained in CompSci so this answer may contain errors or assumptions. If you asked me the difference between a terminal and a non-terminal you'll earn yourself a blank stare. I learned this much by actually building a Type III (Regular) parser and by reading extensively about the rest.
The wikipedia page has a good picture and bullet points.
Roughly, the underlying machine that can describe a regular language does not need memory. It runs as a statemachine (DFA/NFA) on the input. Regular languages can also be expressed with regular expressions.
A language with the "next" level of complexity added to it is a context free language. The underlying machine describing this kind of language will need some memory to be able to represent the languages that are context free and not regular. Note that adding memory to your machine makes it a little more powerful, so it can still express languages (e.g. regular languages) that didn't need the memory to begin with. The underlying machine is typically a push-down automaton.
Type 3 grammars consist of a series of states. They cannot express embedding. For example, a Type 3 grammar cannot require matching parentheses because it has no way to show that the parentheses should be "wrapped around" their contents. This is because, as Derek points out, a Type 3 grammar does not "remember" anything about the previous states that it passed through to get to the current state.
Type 2 grammars consist of a set of "productions" (you can think of them as patterns) that can have other productions embedded within them. Thus, they are recursively defined. A production can only be defined in terms of what it contains, and cannot "see" outside of itself; this is what makes the grammar context-free.
I have been successfully using UIMA ConceptMapper with a dictionary that I built. I set the TokenAnnotation parameter to uima.tt.TokenAnnotation and the SpanFeatureStructure parameter to uima.tt.SentenceAnnotation (based on the reference example). These types are I believe coming from the OpenNLP parser. But I also do another parse using medkatp and would like to use their types. So far I have not figured out how to do that. If I change either of these two parameters the whole thing fails saying that it cannot find the type.
I've searched for hours on the net but have found no examples of ConceptMapper that use anything except these two types. Any suggestions are welcome.
These types are I believe coming from the OpenNLP parser.
These types are defined in the component descriptor (usually, in the desc folder), under the tag <typeSystemDescription>
But I also do another parse using medkatp and would like to use their types.
If you want to use other types (from other UIMA components, or your own types), you need to define them wherever you use them (meaning: in every component descriptor where you need it, under the tag <typeSystemDescription>).