Simplest nested block parser - lexer

I want to write a simple parser for a nested block syntax, just hierarchical plain-text. For example:
Some regular text.
This is outputted as-is, foo{but THIS
is inside a foo block}.
bar{
Blocks can be multi-line
and baz{nested}
}
What's the simplest way to do this? I've already written 2 working implementations, but they are overly complex. I tried full-text regex matching, and streaming char-by-char analysis.
I have to teach the workings of it to people, so simplicity is paramount. I don't want to introduce a dependency on Lex/Yacc Flex/Bison (or PEGjs/Jison, actually, this is javascript).

The good choices probably boil down as follows:
Given your constaints, it's going to be recursive-descent. That's a fine way to go even without constraints.
you can either parse char-by-char (traditional) or write a lexical layer that uses the local string library to scan for { and }. Either way, you might want to return three terminal symbols plus EOF: BLOCK_OF_TEXT, LEFT_BRACE, and RIGHT_BRACE.

char c;
boolean ParseNestedBlocks(InputStream i)
{ if ParseStreamContent(i)
then { if c=="}" then return false
else return true
}
else return false;
boolean ParseSteamContent(InputStream i)
{ loop:
c = GetCharacter(i);
if c =="}" then return true;
if c== EOF then return true;
if c=="{"
{ if ParseStreamContent(i)
{ if c!="}" return false; }
else return false;
}
goto loop
}

Recently, I've been using parser combinators for some projects in pure Javascript. I pulled out the code into a separate project; you can find it here. This approach is similar to the recursive descent parsers that #DigitalRoss suggested, but with a more clear split between code that's specific to your parser and general parser-bookkeeping code.
A parser for your needs (if I understood your requirements correctly) would look something like this:
var open = literal("{"), // matches only '{'
close = literal("}"), // matches only '}'
normalChar = not1(alt(open, close)); // matches any char but '{' and '}'
var form = new Parser(function() {}); // forward declaration for mutual recursion
var block = node('block',
['open', open ],
['body', many0(form)],
['close', close ]);
form.parse = alt(normalChar, block).parse; // set 'form' to its actual value
var parser = many0(form);
and you'd use it like this:
// assuming 'parser' is the parser
var parseResult = parser.parse("abc{def{ghi{}oop}javascript}is great");
The parse result is a syntax tree.
In addition to backtracking, the library also helps you produce nice error messages and threads user state between parser calls. The latter two I've found very useful for generating brace error messages, reporting both the problem and the location of the offending brace tokens when: 1) there's an open brace but no close; 2) there's mismatched brace types -- i.e. (...] or {...); 3) a close brace without a matching open.

Related

antlr4 grammar to iteratively parse repeating things from a single InputStream

I have an InputStream that contains repeating chunks like this:
fld1:val1
fld2:val2
[A B C D]
[E F]
fld1:val3
fld2:val4
[M N]
[Q S T Y]
fld1:val5
...
I wish to construct a solution where I can parse the fld:val block, skip the blank line separator, then parse the "listy" part, then stop parsing at the next blank line and reset the parser on the same open stream to process the next chunk. I was thinking I might be able to do this in my override of the baselistener class exitListy callback by getting access to the parser and calling reset(). Ideally, this would end the call chain to ParseTree t = parser.parse() and let control return to the code immediately following parse() I experimented with this and, somewhat predictably, got a null pointer exception here: org.antlr.v4.runtime.Parser.exitRule(Parser.java:639) I cannot change the format of the input stream, like inserting snip-here markers or anything like that.
(Completely new answer based on comment)
Listeners operate on ParseTrees returned once a parse completes. In your case, it appears, You'll be listening on an, essentially, unending stream, and want data back periodically.
I'd highly recommend "The Definitive ANTLR 4 Reference" from Pragmatic Programmers.
There are two very pertinent sections:
"Making Things Happen During the Parse"
"Unbuffered Character and Token Streams"
For your grammar, try something akin to the following "rough draft" (this may not be reporting back exactly when you want, but hopefully gives you the idea to work with)
grammar Streaming;
#parser::members {
java.util.function.Consumer<MyData> consumer;
MyData myData = new MyData();
public StreamingParser(TokenStream input, java.util.function.Consumer<MyData> consumer) {
this(input);
this.consumer = consumer;
}
}
stream: (fldLine emptyLine listLine emptyLine) EOF;
fldLine:
fld = ITEM COLON val = ITEM EOL {
// add data to MyDataObject
};
listLine:
O_BRACKET (items = ITEM)* C_BRACKET {
// add data to MyDataObject
};
emptyLine:
EOL {
consumer.accept(myData);
// reset myData
};
O_BRACKET: '[';
C_BRACKET: ']';
EOL: '\n';
COLON: ':';
ITEM: [a-zA-Z][a-zA-Z0-9]*;
SPACE: ' ' -> skip;
This takes advantage of embedded actions that are described in the first section.
Then the second section describes how to use Unbuffered streams.
Something like this (untested; much lifted directly from the referenced book)
CharStream input = new UnbufferedCharStream(<your stream>);
StreamingLexer lex = new StreamingLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
StreamingParser parser = new StreamingParser(tokens,
// This lambda will handle data reported back when a blank line is encountered
myData -> handle(myData));
// You just want ANTLR reporting back periodically
// not building a giant parse tree
parser.setBuildParseTree(false);
parser.stream(); // won't return until you shut down the input stream

rewriting AST Action Translation to ANTLR4

I have a grammar file written in antlr2 syntax and need help understanding how to rewrite some of the parser rules in antlr4 syntax. I know antlr4 eliminated the need for building AST so I'm not sure what to do with the rules that are AST action translations. ANTLR Tree Construction explains some of the syntax and how to use the # construct but I'm still unsure how to read this rules and re-write them.
temp_root :
temp { #temp_root = #([ROOT, "root"], #temp_root); } EOF;
temp :
c:temp_content
{ #temp = #(#([FUNCTION_CALL, "template"], #template), c);
reparent((MyAST)#temp, MyAST)#c); };
temp_content :
(foo | bar);
foo :
{
StringBuilder result = new StringBuilder("");
}
: (c:FOO! { result.append(c.getText()); } )+
{ #foo = #([TEMPLATE_STRING_LITERAL, result.toString()], #foo); };
bar :
BEGIN_BAR! expr END_BAR!
exception
catch [Exception x] {
bar_AST = handleException(x);
};
You cannot manipulate the produced parse tree (at least not with grammar code), so simply remove all tree rewriting stuff (you may have to adjust consumer code, if that relies on a specific tree structure). Also remove the exclamation marks (which denote a token that should not appear in the AST). A surprise is the c:FOO part. Can't remember having ever seen this. but judging from the following action code I guess it's a var assignment and should be rewritten as c = FOO.

Short-circuiting in functional Groovy?

"When you've found the treasure, stop digging!"
I'm wanting to use more functional programming in Groovy, and thought rewriting the following method would be good training. It's harder than it looks because Groovy doesn't appear to build short-circuiting into its more functional features.
Here's an imperative function to do the job:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
String shortestUniqueName(String nameToShorten) {
def currentLevel = 1
String shortName = ''
def separator = '/'
while (fullyQualifiedNames.findAll { fqName ->
shortName = nameToShorten.tokenize(separator)[-currentLevel..-1].join(separator)
fqName.endsWith(shortName)
}.size() > 1) {
++currentLevel
}
return shortName
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
It scans a list of fully-qualified filenames and returns the shortest unique form. There are potentially hundreds of fully-qualified names.
As soon as the method finds a short name with only one match, that short name is the right answer, and the iteration can stop. There's no need to scan the rest of the name or do any more expensive list searches.
But turning to a more functional flow in Groovy, neither return nor break can drop you out of the iteration:
return simply returns from the present iteration, not from the whole .each so it doesn't short-circuit.
break isn't allowed outside of a loop, and .each {} and .eachWithIndex {} are not considered loop constructs.
I can't use .find() instead of .findAll() because my program logic requires that I scan all elements of the list, nut just stop at the first.
There are plenty of reasons not to use try..catch blocks, but the best I've read is from here:
Exceptions are basically non-local goto statements with all the
consequences of the latter. Using exceptions for flow control
violates the principle of least astonishment, make programs hard to read
(remember that programs are written for programmers first).
Some of the usual ways around this problem are detailed here including a solution based on a new flavour of .each. This is the closest to a solution I've found so far, but I need to use .eachWithIndex() for my use case (in progress.)
Here's my own poor attempt at a short-circuiting functional solution:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
def shortestUniqueName(String nameToShorten) {
def found = ''
def final separator = '/'
def nameComponents = nameToShorten.tokenize(separator).reverse()
nameComponents.eachWithIndex { String _, int i ->
if (!found) {
def candidate = nameComponents[0..i].reverse().join(separator)
def matches = fullyQualifiedNames.findAll { String fqName ->
fqName.endsWith candidate
}
if (matches.size() == 1) {
found = candidate
}
}
}
return found
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
Please shoot me down if there is a more idiomatic way to short-circuit in Groovy that I haven't thought of. Thank you!
There's probably a cleaner looking (and easier to read) solution, but you can do this sort of thing:
String shortestUniqueName(String nameToShorten) {
// Split the name to shorten, and make a list of all sequential combinations of elements
nameToShorten.split('/').reverse().inject([]) { agg, l ->
if(agg) agg + [agg[-1] + l] else agg << [l]
}
// Starting with the smallest element
.find { elements ->
fullyQualifiedNames.findAll { name ->
name.endsWith(elements.reverse().join('/'))
}.size() == 1
}
?.reverse()
?.join('/')
?: ''
}

Check if string is Alphanumeric

Is there a standard function in D to check if a string is alphanumeric? If not what'd be the most efficient way to do it? I'm guessing there are better ways than looping through the string and checking if the character is in between a range?
I don't think there's a single pre-made function for it, but you could compose two phobos functions (which imo is just as good!):
import std.algorithm, std.ascii;
bool good = all!isAlphaNum(your_string);
I think that does unnecessary utf decoding, so it wouldn't be maximally efficient but that's likely irrelevant for this anyway since the strings are surely short. But if that matters to you perhaps using .representation (from std.string iirc) or foreach(char c; your_string) isAlphaNum(c); yourself would be a bit faster.
I think Adam D. Ruppe's solution may be a better one, but this can also be done using regular expressions. You can view an explanation of the regular expression here.
import std.regex;
import std.stdio;
void main()
{
// Compile-time regexes are preferred
// auto alnumRegex = regex(`^[A-Za-z][A-Za-z0-9]*$`);
// Backticks represent raw strings (convenient for regexes)
enum alnumRegex = ctRegex!(`^[A-Za-z][A-Za-z0-9]*$`);
auto testString = "abc123";
auto matchResult = match(testString, alnumRegex);
if(matchResult)
{
writefln("Match(es) found: %s", matchResult);
}
else
{
writeln("Match not found");
}
}
Of course, this only works for ASCII as well.

What's the name of this programming feature?

In some dynamic languages I have seen this kind of syntax:
myValue = if (this.IsValidObject)
{
UpdateGraph();
UpdateCount();
this.Name;
}
else
{
Debug.Log (Exceptions.UninitializedObject);
3;
}
Basically being able to return the last statement in a branch as the return value for a variable, not necessarily only for method returns, but they could be achieved as well.
What's the name of this feature?
Can this also be achieved in staticly typed languages such as C#? I know C# has ternary operator, but I mean using if statements, switch statements as shown above.
It is called "conditional-branches-are-expressions" or "death to the statement/expression divide".
See Conditional If Expressions:
Many languages support if expressions, which are similar to if statements, but return a value as a result. Thus, they are true expressions (which evaluate to a value), not statements (which just perform an action).
That is, if (expr) { ... } is an expression (could possible be an expression or a statement depending upon context) in the language grammar just as ?: is an expression in languages like C, C# or Java.
This form is common in functional programming languages (which eschew side-effects) -- however, it is not "functional programming" per se and exists in other language that accept/allow a "functional like syntax" while still utilizing heavy side-effects and other paradigms (e.g. Ruby).
Some languages like Perl allow this behavior to be simulated. That is, $x = eval { if (true) { "hello world!" } else { "goodbye" } }; print $x will display "hello world!" because the eval expression evaluates to the last value evaluated inside even though the if grammar production itself is not an expression. ($x = if ... is a syntax error in Perl).
Happy coding.
To answer your other question:
Can this also be achieved in staticly typed languages such as C#?
Is it a thing the language supports? No. Can it be achieved? Kind of.
C# --like C++, Java, and all that ilk-- has expressions and statements. Statements, like if-then and switch-case, don't return values and there fore can't be used as expressions. Also, as a slight aside, your example assigns myValue to either a string or an integer, which C# can't do because it is strongly typed. You'd either have to use object myValue and then accept the casting and boxing costs, use var myValue (which is still static typed, just inferred), or some other bizarre cleverness.
Anyway, so if if-then is a statement, how do you do that in C#? You'd have to build a method to accomplish the goal of if-then-else. You could use a static method as an extension to bools, to model the Smalltalk way of doing it:
public static T IfTrue(this bool value, Action doThen, Action doElse )
{
if(value)
return doThen();
else
return doElse();
}
To use this, you'd do something like
var myVal = (6 < 7).IfTrue(() => return "Less than", () => return "Greater than");
Disclaimer: I tested none of that, so it may not quite work due to typos, but I think the principle is correct.
The new IfTrue() function checks the boolean it is attached to and executes one of two delegates passed into it. They must have the same return type, and neither accepts arguments (use closures, so it won't matter).
Now, should you do that? No, almost certainly not. Its not the proper C# way of doing things so it's confusing, and its much less efficient than using an if-then. You're trading off something like 1 IL instruction for a complex mess of classes and method calls that .NET will build behind the scenes to support that.
It is a ternary conditional.
In C you can use, for example:
printf("Debug? %s\n", debug?"yes":"no");
Edited:
A compound statement list can be evaluated as a expression in C. The last statement should be a expression and the whole compound statement surrounded by braces.
For example:
#include <stdio.h>
int main(void)
{
int a=0, b=1;
a=({
printf("testing compound statement\n");
if(b==a)
printf("equals\n");
b+1;
});
printf("a=%d\n", a);
return 0;
}
So the name of the characteristic you are doing is assigning to a (local) variable a compound statement. Now I think this helps you a little bit more. For more, please visit this source:
http://www.chemie.fu-berlin.de/chemnet/use/info/gcc/gcc_8.html
Take care,
Beco.
PS. This example makes more sense in the context of your question:
a=({
int c;
if(b==a)
c=b+1;
else
c=a-1;
c;
});
In addition to returning the value of the last expression in a branch, it's likely (depending on the language) that myValue is being assigned to an anonymous function -- or in Smalltalk / Ruby, code blocks:
A block of code (an anonymous function) can be expressed as a literal value (which is an object, since all values are objects.)
In this case, since myValue is actually pointing to a function that gets invoked only when myValue is used, the language probably implements them as closures, which are originally a feature of functional languages.
Because closures are first-class functions with free variables, closures exist in C#. However, the implicit return does not occur; in C# they're simply anonymous delegates! Consider:
Func<Object> myValue = delegate()
{
if (this.IsValidObject)
{
UpdateGraph();
UpdateCount();
return this.Name;
}
else
{
Debug.Log (Exceptions.UninitializedObject);
return 3;
}
};
This can also be done in C# using lambda expressions:
Func<Object> myValue = () =>
{
if (this.IsValidObject) { ... }
else { ... }
};
I realize your question is asking about the implicit return value, but I am trying to illustrate that there is more than just "conditional branches are expressions" going on here.
Can this also be achieved in staticly
typed languages?
Sure, the types of the involved expressions can be statically and strictly checked. There seems to be nothing dependent on dynamic typing in the "if-as-expression" approach.
For example, Haskell--a strict statically typed language with a rich system of types:
$ ghci
Prelude> let x = if True then "a" else "b" in x
"a"
(the example expression could be simpler, I just wanted to reflect the assignment from your question, but the expression to demonstrate the feature could be simlpler:
Prelude> if True then "a" else "b"
"a"
.)

Resources