ANTLR lexer token not being used - python-3.x

I have this small example, trying to parse a key:value type string (real examples may be more complex, but I want to essentially have a [a-zA-Z0-9] style string then a colon, then whatever else on that line to be the value (not including the colon)
https://gist.github.com/nmz787/4888cfadf707a575de0662f8a3914ce0
Unfortunately it isn't working, the INTERMEDIATE lexer token is not being found... I just can't figure it out. This is a really simple example extracted from a more complex parser and lexer that I've been handed to work on adding more features to. So I hope it's sufficient for this forum.

Using ANTLR4 to parse a key/value store is pretty much over-the-top. All you would need is to split your input into individual lines. Then split each line at the colon, trim the resulting strings and there you have it. No need for a parser at all.

Related

String literal in ANTLR4

I'm using antlr4 C++ runtime and I'd like to create a string literal in my lexer definition file. How can I do this?
What I have so far:
V_STRING : '"' ~('\\' | '"')* '"';
I doesn't work with
printf("string literal\n");
but works with
printf("string literal\\n");
I don't want to explicitly escape the new line character.
my assumptions are that antlr interprets the new line character as a regular new line (when reading a file, for example).
Thanks in advance.
It's always a good idea to list out your token stream to see if your Lexer rules really do what you expect. (Look into the tokens option of the TestRig; also, some plugins will show you your tokens)
In your case your rule essentially says that a String is " a " followed by 0 or more characters that are not a \ or a " and then a "".
So, when the Lexer encounters your \, matches the ~('\\\\'|'")* part of the rule and then looks for a " (which it does not find, since the \ is followed by a n), so It won't recognize "string literal\n" as a V_STRING token (it also fails to match "string literal\\n" as well, here, so I'm not quite sure what's going on with the example that "works").
try:
V_STRING: '"' ~["]* '"';
Note: this is a very simple String rule, but it accepts your input. You probably want to examine grammars for other languages to see how you might want to handle strings in your language; there are several approaches (and many of them involve using Lexer modes). You can find examples here)
If you want the "\n" to be treated as a newline, just understand that the parser won't do that for you, you'll just see the characters "" and "n". It'll be up to you to handle encoding the escaped characters (and it's once you try to handle " that it'll get more complicated and you'll need to look into Lexer modes)

LabVIEW Search Multiple Strings

I am trying to search for multiple strings in a text log alltogether with the following pattern:
s(n)KEY: some data
s(n)Measurement: some data
s(n)Units: some data
Where s(n) is the number of spaces that varies. KEY will change at every iteration in the loop as it comes from the .ini file. As an example see the following snippet the of log:
WHITE On Axis Lum_010 OPTICAL_TEST_01 some.seq
WHITE On Axis Lum_010 Failed
Bezel1 Luminance-Light Source: Passed
Measurement: 148.41
Units: fc
WHITE On Axis Lum_010: Failed
Measurement: 197.5
Units: fL
In this case, I only want to detect when the key (WHITE On Axis Lum_010) appears along with Measurement and I don't want to detect if it appears anywhere else in the log. My ultimate goal is to get the measurement and unit data from file.
Any help will be greatly appreciated. Thank you, Rav.
I'd do it similar to Salome, using regular expressions. Since those are a little tricky, I have a test VI for them:
The RegEx is:
^\s{2}(.*?):\s*(\S*)\n\s*Measurement:\s*(\S*)\n\s*Units:\s*(\S*)
and means:
^ Find a beginning of a line
\s{2} followed by exactly two whitespaces
(.*?) followed by multible characters
: followed by a ':'
\s* followed by several whitespaces
(\S*) followed by several non-whitespaces
\n followed by a newLine
\s* followed by several whitespaces
Measurement: followed by this string
\s* followed by several whitespaces
(\S*) followed by several non-whitespaces
\n followed by a newLine
... and the same for the 'Unit'
The parentheses denote groups, and allow to easily collect the interesting parts of the string.
The RegEx string might need more tuning if the format of the data is not as expected, but this is a starting point.
To find more data in your string, put this in a while loop and use a shift register to feed the offset past match into the offset of the next iteration, and stop if it's =-1.
It's easier to search through and to implement.
LabVIEW also has VIs to create and manage JSONs.
Alternatively, you could use Regular Expressions in a while-loop to look if it exists in your log, maybe something like this:
WHITE On Axis Lum_010:(\s)*((Failed)|(Pass))\n(\s)+Measurement:(\s)*[0-9]*((\.)[0-9]*){0,1}\n(\s)*Units:\s*\w*
Then you can split the string or pick lines and take the information.
But I would not recommend that, as it is impractical to change and not useful if you want to use the code for other keys.
I hope it helps you :)

Antlr4 pass text back to parser from the lexer as a string not individual characters

I have a grammar that needs to process comments starting with '{* and ending at *}' at any point in the input stream. Also it needs to process template markers which start with '{' followed by a '$' or and identifier and end on a '}' and pass everything else through as text.
The only way to achieve this seem to be is to pass any thing that isn't a comment or a token back to the parser as individual characters and let the parser build the string. This is incredibly inefficient as the parser has to build a node for every character that it receives and then I have to walk the nodes an build a string from them. I would be a lot simpler an faster if the lexer could just return the text as a large string.
On an I7 running the program as a 32bit #C program on a 90K text file with no tokens or comments, just text, it takes about 15 minutes before it crashes with and out on memory exception.
The grammar basically is
Parser:
text: ANY_CHAR+;
Lexer:
COMMENT: '{*' .*? '*}' -> skip;
... Token Definitions .....
ANY_CHAR: [ -~];
If I try to accumulate the text in the lexer it swallows everything and doesn't recognize the comments or tokens because something like ANY_CHAR+ matches everything and returns comments and template markers in the string.
Does anybody know a way around this problem? At the moment it looks like I have to hand write a lexer.
Yes, that is inefficient, but also not the way to do it. The solution is completely in lexer.
I understood that you want to detect comments, template markers and text. For this, you should use lexer modes. Every time you hit "{" go into some lexer mode, say MODE1 where you can detect only "*" or "$" or (since I didn't understand what you meant by '{' followed by a '$' or and identifier) something else, and depending on what you hit go into MODE2 or MODE3. After that (MODE2 or MODE3) wait for '}' and switch back to default mode. Of course, there is the possibility to make even more modes in between, depends on what you want do to, but for what I've just written:
MODE1 would be in which you determine if you area now detecting comment or template marker. Only two tokens in this mode '' and everything else. If it's '' go to MODE2, if anything else go to MODE3
MODE2 there is only one token here that you need and that is COMMENT,but you also need to detect '*}' or '}' (depending how you want to handle it)
MODE3 similarly as MODE2 - detect what you need and have a token that will switch back to default mode.

Replace character with a safe character and vice-versa

Here's my problem:
I need to store sentences "somewhere" (it doesn't matter where).
The sentences must not contain spaces.
When I extract the sentences from that "somewhere", I need to restore the spaces.
So, before storing the sentence "I am happy" I could replace the spaces with a safe character, such as &. In C#:
theString.Replace(' ', '&');
This would yield 'I&am&happy'.
And when retrieving the sentence, I would to the reverse:
theString.Replace('&', ' ');
But what if the original sentence already contains the '&' character?
Say I would do the same thing with the sentence 'I am happy & healthy'. With the design above, the string would come back as 'I am happy healthy', since the '&' char has been replaced with a space.
(Of course, I could change the & character to a more unlikely symbol, such as ยค, but I want this to be bullet proof)
I used to know how to solve this, but I forgot how.
Any ideas?
Thanks!
Fredrik
Maybe you can use url encoding (percent encoding) as an inspiration.
Characters that are not valid in a url are escaped by writing %XX where XX is a numeric code that represents the character. The % sign itself can also be escaped in the same way, so that way you never run into problems when translating it back to the original string.
There are probably other similar encodings, and for your own application you can use an & just as well as a %, but by using an existing encoding like this, you can probably also find existing functions to do the encoding and decoding for you.

Why doesn't Vims errorformat take regular expressions?

Vims errorformat (for parsing compile/build errors) uses an arcane format from c for parsing errors.
Trying to set up an errorformat for nant seems almost impossible, I've tried for many hours and can't get it. I also see from my searches that alot of people seem to be having the same problem. A regex to solve this would take minutesto write.
So why does vim still use this format? It's quite possible that the C parser is faster but that hardly seems relevant for something that happens once every few minutes at most. Is there a good reason or is it just an historical artifact?
It's not that Vim uses an arcane format from C. Rather it uses the ideas from scanf, which is a C function. This means that the string that matches the error message is made up of 3 parts:
whitespace
characters
conversion specifications
Whitespace is your tabs and spaces. Characters are the letters, numbers and other normal stuff. Conversion specifications are sequences that start with a '%' (percent) character. In scanf you would typically match an input string against %d or %f to convert to integers or floats. With Vim's error format, you are searching the input string (error message) for files, lines and other compiler specific information.
If you were using scanf to extract an integer from the string "99 bottles of beer", then you would use:
int i;
scanf("%d bottles of beer", &i); // i would be 99, string read from stdin
Now with Vim's error format it gets a bit trickier but it does try to match more complex patterns easily. Things like multiline error messages, file names, changing directory, etc, etc. One of the examples in the help for errorformat is useful:
1 Error 275
2 line 42
3 column 3
4 ' ' expected after '--'
The appropriate error format string has to look like this:
:set efm=%EError\ %n,%Cline\ %l,%Ccolumn\ %c,%Z%m
Here %E tells Vim that it is the start of a multi-line error message. %n is an error number. %C is the continuation of a multi-line message, with %l being the line number, and %c the column number. %Z marks the end of the multiline message and %m matches the error message that would be shown in the status line. You need to escape spaces with backslashes, which adds a bit of extra weirdness.
While it might initially seem easier with a regex, this mini-language is specifically designed to help with matching compiler errors. It has a lot of shortcuts in there. I mean you don't have to think about things like matching multiple lines, multiple digits, matching path names (just use %f).
Another thought: How would you map numbers to mean line numbers, or strings to mean files or error messages if you were to use just a normal regexp? By group position? That might work, but it wouldn't be very flexible. Another way would be named capture groups, but then this syntax looks a lot like a short hand for that anyway. You can actually use regexp wildcards such as .* - in this language it is written %.%#.
OK, so it is not perfect. But it's not impossible either and makes sense in its own way. Get stuck in, read the help and stop complaining! :-)
I would recommend writing a post-processing filter for your compiler, that uses regular expressions or whatever, and outputs messages in a simple format that is easy to write an errorformat for it. Why learn some new, baroque, single-purpose language unless you have to?
According to :help quickfix,
it is also possible to specify (nearly) any Vim supported regular
expression in format strings.
However, the documentation is confusing and I didn't put much time into verifying how well it works and how useful it is. You would still need to use the scanf-like codes to pull out file names, etc.
They are a pain to work with, but to be clear: you can use regular expressions (mostly).
From the docs:
Pattern matching
The scanf()-like "%*[]" notation is supported for backward-compatibility
with previous versions of Vim. However, it is also possible to specify
(nearly) any Vim supported regular expression in format strings.
Since meta characters of the regular expression language can be part of
ordinary matching strings or file names (and therefore internally have to
be escaped), meta symbols have to be written with leading '%':
%\ The single '\' character. Note that this has to be
escaped ("%\\") in ":set errorformat=" definitions.
%. The single '.' character.
%# The single '*'(!) character.
%^ The single '^' character. Note that this is not
useful, the pattern already matches start of line.
%$ The single '$' character. Note that this is not
useful, the pattern already matches end of line.
%[ The single '[' character for a [] character range.
%~ The single '~' character.
When using character classes in expressions (see |/\i| for an overview),
terms containing the "\+" quantifier can be written in the scanf() "%*"
notation. Example: "%\\d%\\+" ("\d\+", "any number") is equivalent to "%*\\d".
Important note: The \(...\) grouping of sub-matches can not be used in format
specifications because it is reserved for internal conversions.
lol try looking at the actual vim source code sometime. It's a nest of C code so old and obscure you'll think you're on an archaeological dig.
As for why vim uses the C parser, there are plenty of good reasons starting with that it's pretty universal. But the real reason is that sometime in the past 20 years someone wrote it to use the C parser and it works. No one changes what works.
If it doesn't work for you the vim community will tell you to write your own. Stupid open source bastards.

Resources