PEG rules fails to EOI - rust

I am trying to use PEG expression to take parse the file.
My PEG expression is:
WHITESPACE = _{" "}
level = {ASCII_DIGIT*}
verb = {ASCII_ALPHA{,4}}
value = {ASCII_ALPHANUMERIC*}
structure = { level ~ verb ~ value }
file = { SOI ~ (structure? ~ NEWLINE)* ~ EOI }
I parse this text:
0 HEAD
1 VERB test
2 STOP
file parse text successfully only, if I have an extra \n at the end of the text. If I remove the \n, parse fails due to 'expected EOI'. I understood that this happens, because of my rule for file. I tried to use different rules for file and got infinite loop. So, practically I don't know how to solve this issue. I am using rust and latest pest.

This seems to work. It can handle arbitrary number of newlines at the beginning or end as well:
file = { SOI ~ NEWLINE* ~ structure ~ (NEWLINE ~ structure)* NEWLINE* ~ EOI }
WHITESPACE = _{" "}
level = {ASCII_DIGIT+}
verb = {ASCII_ALPHA{1,4}}
value = {ASCII_ALPHANUMERIC*}
structure = { level ~ verb ~ value }

I changed the rules to
level = {ASCII_DIGIT+}
verb = {ASCII_ALPHA{1,4}}
file = { SOI ~ (structure? ~ NEWLINE)* ~ structure? ~ EOI }
and that seemed to work just fine, regardless of the trailing newline. But maybe I overlooked something. If you could edit your question to show the rules and input that caused an infinite loop with this, that'd be great.

WHITESPACE = _{ " " }
level = {ASCII_DIGIT+}
verb = {ASCII_ALPHA{,4}}
value = {ASCII_ALPHANUMERIC*}
stop = { level ~ "STOP" }
structure = { level ~ verb ~ value }
line = {structure | trlr}
file = { SOI ~ (line ~ NEWLINE?)* ~ EOI }
Checked on https://pest.rs/

Related

Extraneous input error when using "lexer rule actions" and "lexer commands"

I'm seeing an "extraneous input" error with input "\aa a" and the following grammar:
Cool.g4
grammar Cool;
import Lex;
expr
: STR_CONST # str_const
;
Lex.g4
lexer grammar Lex;
#lexer::members {
public static boolean initial = true;
public static boolean inString = false;
public static boolean inStringEscape = false;
}
BEGINSTRING: '"' {initial}? {
inString = true;
initial = false;
System.out.println("Entering string");
} -> more;
INSTRINGSTARTESCAPE: '\\' {inString && !inStringEscape}? {
inStringEscape = true;
System.out.println("The next character will be escaped!");
} -> more;
INSTRINGAFTERESCAPE: ~[\n] {inString && inStringEscape}? {
inStringEscape = false;
System.out.println("Escaped a character.");
} -> more;
INSTRINGOTHER: (~[\n\\"])+ {inString && !inStringEscape}? {
System.out.println("Consumed some other characters in the string!");
} -> more;
STR_CONST: '"' {inString && !inStringEscape}? {
inString = false;
initial = true;
System.out.println("Exiting string");
};
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
ID: [a-z][_A-Za-z0-9]*;
Here's the output:
$ grun Cool expr -tree
"\aa a"
Entering string
The next character will be escaped!
Escaped a character.
Consumed some other characters in the string!
Exiting string
line 1:0 extraneous input '"\aa' expecting STR_CONST
(expr "\aa a")
Interestingly, if I remove the ID rule, antlr parses the input fine. Here's the output when I remove the ID rule:
$ grun Cool expr -tree
"\aa a"
Entering string
The next character will be escaped!
Escaped a character.
Consumed some other characters in the string!
Exiting string
(expr "\aa a")
Any idea what might be going on? Why does antlr throw an error when ID is one of the Lexer rules?
That's a surprisingly complex way to parse strings with escape sequences. Did you print the resulting tokens to see what your lexer produced?
I recommond a different (and much simpler) approach:
STR_CONST: '"' ('\\"' | .)*? '"';
Then in your semantic phase, when you post process your parse tree, examine the matched text to find escape sequences. Convert them to the real chars and print a good error message, when an invalid escape sequence was found (something you cannot do when trying to match escape sequences in the lexer).
Copying the answer I received from #sharwell on GitHub.
"Your ID rule is unpredicated, so it matches aa following the \ (aa is longer than the a matched by INSTRINGAFTERESCAPE, so it's preferred even though it's later in the grammar). If you add a println to WS and ID you'll see the strange behavior in the output."

OCaml: Issue manipulating string read from file

I am trying to read a file, line by line in OCaml. Each line in the file represents a string I want to parse, in the correct format expected by the Parsing tool. I am saving each line in a list structure.
I an finding an issue parsing the string contained in each element of the list. I am using OCamllex and Menhir as parsing tools.
If I try to use print_string to print the contents of the list at every element, I get the correct file contents.
If I try to pass a string that I defined within the program to the function, then I get the desired output.
However, if I try to parse the string which I have just read from the file, I get an error: Fatal error: exception Failure ("lexing empty token")
Note: that all of this has been tested against the same string.
Here is a snippet of the code:
let parse_mon m = Parser.monitor Lexer.token (from_string m)
let parse_and_print (mon: string)=
print_endline (print_monitor (parse_mon mon) 0)
let get_file_contents file =
let m_list = ref [] in
let read_contents = open_in file in
try
while true; do
m_list := input_line read_contents :: !m_list
done; !m_list
with End_of_file -> close_in read_contents; List.rev !m_list
let rec print_file_contents cont_list = match cont_list with
| [] -> ()
| m::ms -> parse_and_print m
let pt = print_file_contents (get_file_contents filename)
Ocamllex throws an exception Failure "lexing: empty token" when a text in the stream doesn't match any scanner pattern. Therefore, you will need to match with "catch-all" patterns such as ., _, or eof.
{ }
rule scan = parse
| "hello" as w { print_string w; scan lexbuf }
(* need these two for catch-all *)
| _ as c { print_char c; scan lexbuf }
| eof { exit 0 }
Without seeing your grammar and file I can only offer a wild guess: Could it be that the file contains an empty line at the end? Depending on the .mll that might result in the error you see. The reason being that get_file appends new lines to the front of the list and print_file_contents only looks at the head of that list.
I agree with kne, hard to say without seeing the file, but what you can do is trying to isolate the line that causes the trouble by doing :
let rec print_file_contents cont_list =
match cont_list with
| [] -> ()
| m::ms ->
try parse_and_print m
with Failure _ -> print_string m

Fastest way to remove all characters between '/' in a String

I have .txt files inside directory
I want to change the name of those .txt files, when saving.
I.e. /root/user/workspace/DataSet/dataset/file0.txt
I have already solved the problem in an inefficient way
for (int i = 0; i < img_n.length(); i++) {
char a = img_n.charAt(i);
if (a == '/') {
c++;
}
if (c >= 6) {
out += a;
}
}
return out;
I knew the 6 times '/' will come so when c>=6 add char to the new string.
So This is NOT how to remove all '/' in an input string
If you see my code clearly
It is also not taking the chars between '/'.
Therefore the Question is:
You don't know how many times '/' comes but, you also want to remove the characters between '/'.
How can I do this more generic and efficient way?
How about this:
int ix = img_n.lastIndexOf('/');
out = ix < 0 ? img_n : img_n.substring(ix+1);

Remove function from file using sed or awk

I want to remove the function engine "map" { ... "foobar" ... }.
I tried in so many ways, it's so hard because it has empty lines and '}' at the end, delimiters doesn't work
mainfunc {
var = "baz"
engine "map" {
func {
var0 = "foo"
border = { 1, 1, 1, 1 }
var1 = "bar"
}
}
}
mainfunc {
var = "baz"
engine "map" {
func {
var0 = "foo"
border = { 1, 1, 1, 1 }
var1 = "foobar"
}
}
}
... # more functions like 'mainfunc'
I tried
sed '/engine/,/^\s\s}$/d' file
but removes every engine function, I just need the one containing "foobar", maybe a pattern match everything even newlines until foobar something like this:
sed '/engine(.*)foobar/,/^\s\s}$/d' file
Is it possible?
Try:
sed '/engine/{:a;N;/foobar/{N;N;d};/ }/b;ba}' filename
or:
awk '/engine/{c=1}c{b=b?b"\n"$0:$0;if(/{/)a++;if(/}/)a--;if(!a){if(b!~/foobar/)print b;c=0;b="";next}}!c' filename
I would simple count the numbers of open / close brackets when you match engine "map", cannot say if this only works in gawk
awk '
/^[ \t]*engine "map"/ {
ship=1; # ship is used as a boolean
b=0 # The factor between open / close brackets
}
ship {
b += split($0, tmp, "{"); # Count numbers of { in line
b -= split($0, tmp, "}"); # Count numbers of } in line
# If open / close brackets are equal the function ends
if(b==0) {
ship = 0;
}
# Ship the rest (printing)
next;
}
1 # Print line
' file
Split returns the number of matches: split(string, array [, fieldsep [, seps ] ]):
Divide
string into pieces defined by fieldpat
and store the pieces in array and the separator strings in the
seps array. The first piece is stored in
array[1], the second piece in array[2], and so
forth. The third argument, fieldpat, is
a regexp describing the fields in string (just as FPAT is
a regexp describing the fields in input records).
It may be either a regexp constant or a string.
If fieldpat is omitted, the value of FPAT is used.
patsplit() returns the number of elements created.

ANTLR4 lexer rule with #init block

I have this lexer rule defined in my ANTLR v3 grammar file - it maths text in double quotes.
I need to convert it to ANTLR v4. ANTLR compiler throws an error 'syntax error: mismatched input '#' expecting COLON while matching a lexer rule' (in #init line). Can lexer rule contain a #init block ? How this should be rewritten ?
DOUBLE_QUOTED_CHARACTERS
#init
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
: ('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
;
Sample data:
" " -> throws error "Illegal empty quotes!";
"asd -> throws error "Missing closing double quote!"
"text" -> returns text (valid input, content of "...")
I think this is the right way to do this.
DOUBLE_QUOTED_CHARACTERS
:
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
(
('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
)
;
There are some other errors as well in above like WS .. => ... but I am not correcting them as part of this answer. Just to keep things simple. I took hint from here
Just to hedge against that link moving or becoming invalid after sometime, quoting the text as is:
Lexer actions can appear anywhere as of 4.2, not just at the end of the outermost alternative. The lexer executes the actions at the appropriate input position, according to the placement of the action within the rule. To execute a single action for a role that has multiple alternatives, you can enclose the alts in parentheses and put the action afterwards:
END : ('endif'|'end') {System.out.println("found an end");} ;
The action conforms to the syntax of the target language. ANTLR copies the action’s contents into the generated code verbatim; there is no translation of expressions like $x.y as there is in parser actions.
Only actions within the outermost token rule are executed. In other words, if STRING calls ESC_CHAR and ESC_CHAR has an action, that action is not executed when the lexer starts matching in STRING.
I in countered this problem when my .g4 grammar imported a lexer file. Importing grammar files seems to trigger lots of undocumented shortcomings in ANTLR4. So ultimately I had to stop using import.
In my case, once I merged the LEXER grammar into the parser grammar (one single .g4 file) my #input and #after parsing errors vanished. I should submit a test case + bug, at least to get this documented. I will update here once I do that.
I vaguely recall 2-3 issues with respect to importing lexer grammar into my parser that triggered undocumented behavior. Much is covered here on stackoverflow.

Resources