parse program with tokens at fixed positions in a line - antlr4

I am new to Antlr and need to write a parser for a legacy assembly code that may have line numbers in fixed columns. Also, certain columns have significance - whether its a comment, continuation, etc. How can I detect these?
To give some examples:
000001 proc proc1
000002* comment
* comment without line numbers
continuation marker set ==> X
Arbitrary text as continuation
Thanks
xAn

I've faced something like this when programming an Antlr grammar to parse Cobol sources. Cobol have some characteristics like yours (fixed columns, column is significant, etc).
The only solution I've found for this problem: "pre-process" the input and turned it into some that Antlr could parse without problems!
Ex: In Cobol, an asterisk in column 7 indicates that the line is a comment line; I changed it (the asterisk itself) to ">>" and specified in my grammar that ">>" means that this line is a comment line.

Related

Technical difference between "line break" and "newline"?

Yesterday I added an answer to How to add a line break to text in UI5?. While trying to specify the question with more tags, I realized that there are two similar tags available on Stack Overflow: newline and line-breaks.
The differentiation between LF and CR is pretty clear. But what is the difference between the terms "newline" and "line break"? Aren't they synonym to each other?
The newline excerpt here on Stack Overflow says:
Newline refers to [...] a line break.
... while the disambiguation page of "Line break" from Wikipedia says:
Line break may refer to [...] newline.
And from that "Newline" page:
To denote a single line break, Unix programs use line feed [...] while most programs common to [...] Windows use carriage return+line feed.
Are there any technical standards or guidelines that make a clear distinction between those two terms in the software industry? Or can they be used interchangeably, having no technical difference?
My current assumption is that there is no difference: "line break" describes the result from either soft return (⇧ Shift+↵ Enter) or hard return (↵ Enter), whereas "newline" is a technical term for "line break" but has the same result.
I interpret line break as a semantic meaning. What you tell a typographer when you see a \n. C language "new line" means a line break. This is like A (0x41) in ASCII is the upper case of Latin letter a. So a name, a code, and a meaning.
It is like "SP" (\u0020) is a space character. There are many other space characters: C recognizes also TAB as space character. HTML recognize new line as space character (and not as a line break (but in <pre>).
CR and LF were defined by ASCII, and used together because typewritter usually worked in such manner, and to give some more time to move (they were connected serially, so with a timed inputs). You may find the scanned document of old ASCII about the meaning of control characters, but that changed a lot. \0 is now a end of string, before it was just a filling character (or to quit a sequence). Apple used CR, Unix LF as what typographers used as line break.
If you want a list of names and aliases, Unicode provides such names (but as usually, there were errors, so BELL is now used for a emoji, and so the old alias of BEL is not more valid, for control code). There are also several sources about standardization of control codes (mainly used for escape sequances, and so). But also this is not fully follower. Terminals tend to have quirks.

How to edit this file using grep or using cat or using vim or using another tool?

One of my elder brother who is studying in Statistics. Now, he is writing his thesis paper in LaTeX. Almost all contents are written for the paper. And he took 5 number after point(e.g. 5.55534) for each value those are used for his calculation. But, at the last time his instructor said to change those to 3 number after point(e.g. 5.555) which falls my brother in trouble. Finding and correcting those manually is not easy. So, he told me to help.
I believe there is also a easy solution which is know to me. The snapshot of a portion of the thesis looks like-
&se($\hat\beta_1$)&0.35581&0.35573&0.35573\\
&mse($\hat\beta_1$)&.12945&.12947&.12947\\
\addlinespace
&$\hat\beta_2$&0.03329&0.03331&0.03331 \\
&se($\hat\beta_2$)&0.01593&0.01592&0.01591\\
&mse($\hat\beta_2$)&.000265&.000264&.000264 \\
\midrule
{n=100} & $\hat\beta_1$&-.52006&-.52001&-.51946\\
&se($\hat\beta_1$)&.22819&.22814&.22795\\
&mse($\hat\beta_1$)&.05247&.05244&.05234\\
\addlinespace
&$\hat\beta_2$&0.03134&0.03134&0.03133 \\
&se($\hat\beta_2$)&0.00979&0.00979&0.00979\\
&mse($\hat\beta_2$)&.000098&.000098&.000098
I want -
&se($\hat\beta_1$)&0.355&0.355&0.355\\
&mse($\hat\beta_1$)&.129&.129&.129\\
......................................................................
........................................................................
........................................................................
Note: Don't feel boring for the syntax(These are LaTeX syntax).
If anybody has solution or suggestion, please provide. Thank you.
In sed:
$ sed 's/\(\.[0-9]\{3\}\)[0-9]*/\1/g' file
&se($\hat\beta_1$)&0.355&0.355&0.355\\
&mse($\hat\beta_1$)&.129&.129&.129\\
ie. replace period starting numeric strings with at least 3 numbers with the leading period and three first numbers.
Here is the command in vim:
:%s/\.\d\{3}\zs\d\+//g
Explanation:
: entering command-mode
% is the range of all lines of the file
s substitution command
\.\d\{3}\zs\d\+ pattern you would like to change
\. literal point (.)
\d\{3} match 3 consecutive digits
\zs start substitution from here
\d\+ one or more digits
g Replace all occurrences in the line
Concerning grep and cat they have nothing to do with replacing text. These commands are only for searching and printing contents of files.
Instead, what you are looking is substitution there are lots of commands in Linux that can do that mainly sed, perl, awk, ex etc.

Where are line breaks allowed within Haskell expressions?

Background
Most style guides recommend keeping line lengths to 79 characters or less. In Haskell, indentation rules mean that expressions frequently need to be broken up with new lines.
Questions:
Within expressions, where is it legal to place a new line?
Is this documented somewhere?
Extended question: I see GHC formatting my code when it reports an error so someone has figured out how to automate the process of breaking long lines. Is there a utility that I can put haskell code into and have it spit that code back nicely formatted?
You can place a newline anywhere between lexical tokens of an expression. However, there are constraints about how much indentation may follow the newline. The easy rule of thumb is to indent the next line to start to the right of the line containing the expression. Beyond that, some style things:
If you are indenting an expression that appears in a definition name = expression, it's good style to indent to the right of the = sign.
If you are indenting an expression that appears on the right-hand side of a do binding or a list comprehension, it's good style to indent to the right of the <- sign.
The authoritative documentation is probably the Haskell 98 Report (Chapter 2 on lexical structure), but personally I don't find this material very easy to read.

How to have a carriage return without bringing about a linebreak in VIM?

Is it possible to have a carriage return without bringing about a linebreak ?
For instance I want to write the following sentences in 2 lines and not 4 (and I do not want to type spaces of course) :
On a ship at sea: a tempestuous noise of thunder and lightning heard.
Enter a Master and a Boatswain
Master : Boatswain!
Boatswain : Here, master: what cheer?
Thanks in advance for your help
Thierry
In a text file, the expected line-end character or character sequence is platform dependent. On Windows, the sequence "carriage return (CR, \r) + line feed (LF, \n)" is used, while Unix systems use newline only (LF \n). Macintoshes traditionally used \r only, but these days on OS X I see them dealing with just about any version. Text editors on any system are often able to support all three versions, and to convert between them.
For VIM, see this article for tips how to convert/set line end character sequences.
However, I'm not exactly sure what advantage the change would have for you: Whichever sequence or character you use, it is just the marker for the end of the line (so there should be one of these, at the end of the first line and you'd have a 2 line text file in any event). However, if your application expects a certain character, you can either change the application -- many programming languages support some form of "universal" newline -- or change the data.
Just in case this is what you're looking for:
:set wrap
:set linebreak
The first tells vim to wrap long lines, and the second tells it to only break lines at word breaks, instead of in the middle of words when it reaches the window size.

Why doesn't Vims errorformat take regular expressions?

Vims errorformat (for parsing compile/build errors) uses an arcane format from c for parsing errors.
Trying to set up an errorformat for nant seems almost impossible, I've tried for many hours and can't get it. I also see from my searches that alot of people seem to be having the same problem. A regex to solve this would take minutesto write.
So why does vim still use this format? It's quite possible that the C parser is faster but that hardly seems relevant for something that happens once every few minutes at most. Is there a good reason or is it just an historical artifact?
It's not that Vim uses an arcane format from C. Rather it uses the ideas from scanf, which is a C function. This means that the string that matches the error message is made up of 3 parts:
whitespace
characters
conversion specifications
Whitespace is your tabs and spaces. Characters are the letters, numbers and other normal stuff. Conversion specifications are sequences that start with a '%' (percent) character. In scanf you would typically match an input string against %d or %f to convert to integers or floats. With Vim's error format, you are searching the input string (error message) for files, lines and other compiler specific information.
If you were using scanf to extract an integer from the string "99 bottles of beer", then you would use:
int i;
scanf("%d bottles of beer", &i); // i would be 99, string read from stdin
Now with Vim's error format it gets a bit trickier but it does try to match more complex patterns easily. Things like multiline error messages, file names, changing directory, etc, etc. One of the examples in the help for errorformat is useful:
1 Error 275
2 line 42
3 column 3
4 ' ' expected after '--'
The appropriate error format string has to look like this:
:set efm=%EError\ %n,%Cline\ %l,%Ccolumn\ %c,%Z%m
Here %E tells Vim that it is the start of a multi-line error message. %n is an error number. %C is the continuation of a multi-line message, with %l being the line number, and %c the column number. %Z marks the end of the multiline message and %m matches the error message that would be shown in the status line. You need to escape spaces with backslashes, which adds a bit of extra weirdness.
While it might initially seem easier with a regex, this mini-language is specifically designed to help with matching compiler errors. It has a lot of shortcuts in there. I mean you don't have to think about things like matching multiple lines, multiple digits, matching path names (just use %f).
Another thought: How would you map numbers to mean line numbers, or strings to mean files or error messages if you were to use just a normal regexp? By group position? That might work, but it wouldn't be very flexible. Another way would be named capture groups, but then this syntax looks a lot like a short hand for that anyway. You can actually use regexp wildcards such as .* - in this language it is written %.%#.
OK, so it is not perfect. But it's not impossible either and makes sense in its own way. Get stuck in, read the help and stop complaining! :-)
I would recommend writing a post-processing filter for your compiler, that uses regular expressions or whatever, and outputs messages in a simple format that is easy to write an errorformat for it. Why learn some new, baroque, single-purpose language unless you have to?
According to :help quickfix,
it is also possible to specify (nearly) any Vim supported regular
expression in format strings.
However, the documentation is confusing and I didn't put much time into verifying how well it works and how useful it is. You would still need to use the scanf-like codes to pull out file names, etc.
They are a pain to work with, but to be clear: you can use regular expressions (mostly).
From the docs:
Pattern matching
The scanf()-like "%*[]" notation is supported for backward-compatibility
with previous versions of Vim. However, it is also possible to specify
(nearly) any Vim supported regular expression in format strings.
Since meta characters of the regular expression language can be part of
ordinary matching strings or file names (and therefore internally have to
be escaped), meta symbols have to be written with leading '%':
%\ The single '\' character. Note that this has to be
escaped ("%\\") in ":set errorformat=" definitions.
%. The single '.' character.
%# The single '*'(!) character.
%^ The single '^' character. Note that this is not
useful, the pattern already matches start of line.
%$ The single '$' character. Note that this is not
useful, the pattern already matches end of line.
%[ The single '[' character for a [] character range.
%~ The single '~' character.
When using character classes in expressions (see |/\i| for an overview),
terms containing the "\+" quantifier can be written in the scanf() "%*"
notation. Example: "%\\d%\\+" ("\d\+", "any number") is equivalent to "%*\\d".
Important note: The \(...\) grouping of sub-matches can not be used in format
specifications because it is reserved for internal conversions.
lol try looking at the actual vim source code sometime. It's a nest of C code so old and obscure you'll think you're on an archaeological dig.
As for why vim uses the C parser, there are plenty of good reasons starting with that it's pretty universal. But the real reason is that sometime in the past 20 years someone wrote it to use the C parser and it works. No one changes what works.
If it doesn't work for you the vim community will tell you to write your own. Stupid open source bastards.

Resources