I'm encountering a hurdle in parsing nested blocks.
Specifically, I would like to parse (a small, defined subset of) LaTeX, and I'm having an issue at properly parsing nested \begin{} and \end{} pairs:
\begin{document}
Some text.
\begin{quote}
A quote.
\end{quote}
\end{document}
The expected parse tree would be something along the lines of:
- env
- - expression
- - env
- - - expression
But what I get instead is the second \begin{quote} block gets parsed as a command statement, rather than as a block statement:
- env
- - expression
- - command
- - expression
- - command
Here's the grammar:
// Top level rule is `document`.
document = {
SOI ~
(section? ~ newline_char)* ~ section? ~
EOI
}
section = {
env_stmt |
cmd_stmt |
expression
}
// Expression grammar
expression = { ( cmd_stmt | literal )* }
literal = #{ char+ }
char = #{ ASCII_ALPHANUMERIC | punctuation }
punctuation = {
"," | "." | ";" | "(" | ")" | "[" | "]" | "|" | "<" | ">" | ":"
}
// Control Statement Grammar
cmd_stmt = { ctrl_character ~ name ~ cmd_stmt_opt? ~ "{" ~ expression ~ "}" }
cmd_stmt_opt = { "[" ~ name ~ "]" }
name = #{ ASCII_ALPHA+ }
COMMENT = _{ "%" ~ (!newline_char ~ ANY)* ~ newline_char }
WHITESPACE = _{ " " }
newline_char = _{"\n"}
ctrl_character = _{ "\\" }
// Environment Grammar
env_stmt = { env_begin ~ env_content ~ env_end }
env_content = { (section? ~ newline_char)* }
env_begin = #{ ctrl_character ~ "begin" ~ "{" ~ PUSH(name) ~ "}" }
env_end = #{ ctrl_character ~ "end" ~ "{" ~ PEEK ~ "}" }
I'm guessing it has something to do with PUSH/PEEK. When I remove them, only the inner block gets parsed as an environment, the outer one is parsed as a control statement.
Could someone please point me towards what I'm doing wrong?
Related
I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.
from what i have read the way i defined my 'expression' this should provide me with the following:
This is my input:
xyz = a + b + c + d
This should be my output:
xyz = ( ( a + b ) + ( c + d) )
But instead i get:
xyz = ( a + (b + (c + d) ) )
I bet this has been solved before and i just wasnt able to find the solution.
statementList : s=statement sl=statementList #multipleStatementList
| s=statement #singleStatementList
;
statement : statementAssign
| statementIf
;
statementAssign : var=VAR ASSIGN expr=expression #overwriteStatementAssign
| var=VAR PLUS ASSIGN expr=expression #addStatementAssign
| var=VAR MINUS ASSIGN expr=expression #subStatementAssign
;
;
expression : BRACKET_OPEN expr=expression BRACKET_CLOSE #priorityExp
| left=expression operand=('*'|'/') right=expression #mulDivExp
| left=expression operand=('+'|'-') right=expression #addSubExp
| <assoc=right> left=expression POWER right=expression #powExp
| variable=VAR #varExp
| number=NUMBER #numExp
;
BRACKET_OPEN : '(' ;
BRACKET_CLOSE : ')' ;
ASTERISK : '*' ;
SLASH : '/' ;
PLUS : '+' ;
MINUS : '-' ;
POWER : '^' ;
MODULO : '%' ;
ASSIGN : '=' ;
NUMBER : [0-9]+ ;
VAR : [a-z][a-zA-Z0-9\-]* ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
If you want the expression
xyz = ( ( a + b ) + ( c + d) )
Then you'll need to control the binding of the + operator with parentheses just as you've done in your expected output. Otherwise, with
xyz = ( a + (b + (c + d) ) )
is the way the parser is going to parse it because all the + operators have the same precedence, and the parser continues parsing until it reaches the end of the expression.
It recursively applies
left=expression operand=('+'|'-') right=expression
until the expression is completed.
and you get the grouping you got. So use those parentheses, that's what they're for if you want to force the order of expression evaluation. ;)
If you change your input to
xyz = a * b + c + d
you'll see what I mean about the precedence, because the multiplication rule appears before the addition rule -- and hence binds earlier -- which is a mathematical convention (lacking parentheses to group terms.)
You're doing it right and the parser is too. Just group what you want if you want a specific binding order.
I have this grammar section in a happy parser, given on the Happy official site, but I need some deeper explanation of the meaning of the rules in brackets. Here is the token definition
%token
let { TokenLet }
in { TokenIn }
int { TokenInt $$ }
var { TokenVar $$ }
'=' { TokenEq }
'+' { TokenPlus }
'-' { TokenMinus }
'*' { TokenTimes }
'/' { TokenDiv }
'(' { TokenOB }
')' { TokenCB }
and here the grammar section
Exp : let var '=' Exp in Exp { Let $2 $4 $6 }
| Exp1 { Exp1 $1 }
Exp1 : Exp1 '+' Term { Plus $1 $3 }
| Exp1 '-' Term { Minus $1 $3 }
| Term { Term $1 }
Term : Term '*' Factor { Times $1 $3 }
| Term '/' Factor { Div $1 $3 }
| Factor { Factor $1 }
Factor
: int { Int $1 }
| var { Var $1 }
| '(' Exp ')' { Brack $2 }
What I understand is that the lexer, defined below in the file, should produce tokens only of the type definined and then build the parse tree using the grammar. But what exactley mean "{Let $2 $4 $6}"? I know that $2 refers to the second rule argument and so on but if someone can give me a "human read version" of the rules I would be really happy. Hope I've been clear.
Thanks in advance.
In the %token section the left column is the token names used elsewhere in the grammar, and the right is a pattern that can be used in a case statement. Where you see $$ Happy will substitute its own variable. So if the resulting parser is expecting an Integer at some point then Happy will have a case statement with a pattern including TokenInt v1234 where the v1234 bit is a variable name created by Happy.
The "Let" is the constructor for the grammar expression being recognised. If you look a little lower in the example page you will see
data Exp
= Let String Exp Exp
| Exp1 Exp1
deriving Show
So the Let constructor takes a string and two sub-expressions (of type 'Exp'). If you look at the grammar you can see that there six elements in the let rule. The first is just the constant string "let". That is used by the generated parser to figure out that its looking at a "let" clause, but the resulting parse tree doesn't need it. So $1 doesn't appear. Instead the first argument to the Let constructor has to be the name of the variable being bound, which is the second item in the grammar rule. Hence this is $2. The other things are the two sub-expressions, which are $4 and $6 by the same logic. Both of these can be arbitrarily complex expressions: Happy figures out where they begin and end and parses them by following the other rules for what constitutes expressions.
This line is one rule for creating (parsing) the production Exp:
Exp : let var '=' Exp in Exp { Let $2 $4 $6 }
It corresponds to the rule:
if you see "let" ($1)
followed by a variable name ($2)
followed by "=" ($3)
followed by an Exp ($4)
followed by "in" ($5)
followed by another Exp ($6)
then return the value Let $2 $4 $6. The $n parameters will be replaced with the values of each sub-production. So if this rule is matched, the Let function (which is probably some data constructor) will be called with:
the value of the var token as the first parameter,
the first Exp parsed ($4) as the second parameter
and the second parsed Exp ($6) as the third parameter.
I believe here the value of the var token is the variable name.
Are character classes supported in ANTLR 4 lexers? I saw some examples that looked like this is OK:
LITERAL: [a-zA-z]+;
but what I found is that it matches the string "OR[" with the opening bracket. Using ranges worked:
LITERAL: ('a'..'z' | 'A'..'Z')+;
and only identified "OR" as the LITERAL. Here is an example:
grammar Test;
#members {
private void log(String msg) {
System.out.println(msg);
}
}
parse
: expr EOF
;
expr
: atom {log("atom(" + $atom.text + ")");}
| l=expr OR r=expr {log("IOR:left(" + $l.text + ") right(" + $r.text + "}");}
| (OR '[' la=atom ra=atom ']') {log("POR:left(" + $la.text + ") right(" + $ra.text + "}");}
;
atom
: LITERAL
;
OR : O R ;
LITERAL: [a-zA-z]+;
//LITERAL: ('a'..'z' | 'A'..'Z')+;
SPACE
: [ \t\r\n] -> skip
;
fragment O: ('o'|'O');
fragment R: ('r'|'R');
When given the input "OR [ cat dog ]" it parses correctly, but "OR[ cat dog ]" does not.
You can use character sets in ANTLR 4 lexers, but the ranges are case sensitive. You used [a-zA-z] where I believe you meant [a-zA-Z].
Hello good programmers,
I have built following grammar in happy (haskell):
P : program C {Prog $2}
E : int {Num $1}
| ident {Id $1}
| true {BoolConst True}
| false {BoolConst False}
| read {ReadInput}
| '(' E ')' {Parens $2}
| E '+' E { Add $1 $3 }
| E '-' E { Sub $1 $3 }
| E '*' E { Mult $1 $3 }
| E '/' E { Div $1 $3 }
| E '=' E { Eq $1 $3 }
| E '>' E { Gt $1 $3 }
| E '<' E { Lt $1 $3 }
C : '(' C ')' {$2}
| ident assign E {Assign $1 $3}
| if E then C else C {Cond $2 $4 $6}
| output E {OutputComm $2}
| while E do C {While $2 $4 }
| begin D ';' C end {Declare $2 $4}
| C ';' C {Seq $1 $3 }
D : D ';' D {DSeq $1 $3 }
| '(' D ')' {$2}
| var ident assign E {Var $2 $4}
Now, While loop includes all the commands that follow after 'do'. How to change this behavior? I've already tried %left, %right... :(
Invent two kinds of C, one that allows sequencing, and one that doesn't. Something like this, perhaps:
Cnoseq : '(' Cseq ')'
| ident assign E
| while E do Cnoseq
Cseq : Cnoseq ';' Cseq
| Cnoseq
Consider this code fragment:
while True do output 1 ; output 2
This could be parsed as either
(while True do output 1) ; (output 2)
or
while True do (output 1 ; output 2)
This sort of ambiguity is the source of the conflict.
If you don't want to change the grammar as #DanielWagner suggested, you can use precedence rules to resolve the ambiguity. Exactly how depends on what the precedence should be, however it's probably something like this:
%right do
%right then else
%right ';'
Precedences are listed from low to high, so in this case the above example would be parsed in the latter manner. Just add the precedence rules below the tokens but before the %% line.