Lexer: Handling unterminated strings whilst tokenizing - string

I have started writing my own lexer and ended up with a problem with tokenising strings as they have a start (") and an end (") character associated with them.
Does anyone know of a common technique where a lexer can cope and continue lexing with having an unterminated string?
I think ANTLR can do this, is this handled by a ATN in ANTLR?
I can see there being two issues here assuming that strings must terminate on a single line:
String termination occurs on a separate line -- therefore warning the user that strings can only be put on a single line.
String termination does not occur, then when do you know a valid point to continue is at? Use a heuristic of the next valid token after the new line.
i.e.
char *mystring = "my string which is unterminated....
int id = 20;

If your language prohibits newlines in string literals, then just terminating the string at the end of the line is likely to be acceptable. It is reasonably unlikely that there will be a declaration or keyword statement on the same line as the string literal (and there is no reason to encourage bad style by trying to compensate for it.)
You might skip a useful close parenthesis:
printf("%s\n, line);
but you likely have recovery rules in place which can cope with that.
If string literals can contain newlines -- and there is ample evidence that this feature is often desired -- then recovery is rather more difficult and you might well find that the simplest solution all round is to just throw a syntax error which clearly states where the offending string started.

Related

How do I parse a float from a string that might contain left-over characters without doing manual parsing?

How do I parse a float from a string that may contain left-over characters, and figure out where it ends, without doing manual parsing (e.g. parseFloat() in JS)?
An example would be the string "2e-1!". I want to evaluate the next float from the string such that I can split this string into "2e-1" and "!" (for example, if I wanted to implement a calculator.) How would I do this without writing a parser to see where the float ends, taking that as an &str, and then using .parse()?
I am aware that this means that I may have to parse nonstandard things such as "3e+1+e3" to 3e+1 (30) and "+e3". This is intended; I do not know all of the ways to format floating point numbers, especially the ones valid to Rust's parse::<f64>(), but I want to handle them regardless.
How can I do this, preferably without external libraries?
As mentioned in the comments, you need to either implement your own floating-point parser or use an external library. The parser in the standard library always errors out when it encounters additional junk in the input string – it doesn't even allow leading or trailing whitespace.
A good external crate to use is nom. It comes with an integrated parser for floating-point numbers that meets your requirements. Examples:
use nom::number::complete::double;
let parser = double::<_, ()>;
assert_eq!(parser("2.0 abc"), Ok((" abc", 2.0)));
let result = parser("Nanananananana hey, Jude!").unwrap();
assert_eq!(result.0, "ananananana hey, Jude!");
assert!(result.1.is_nan());
The parser expects the floating-point number at the very beginning of the string. If you want to allow leading whitespace, you can remove it first using trim_start().

Read substrings from a string containing multiplication [duplicate]

This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread
You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.
Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source
line shall be eligible for being part of such a splice.
(§5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Haskell Parsec parser for an expression starting with a word or parenthesis, ending with end of word or a parenthesis

I'm trying figure out how to write a Haskell Parsec parser for consuming any of these Ruby expressions:
hello("test", 'test2')
my_variable
hello(world("test"))
(hello + " " + world)
When the parser starts parsing at the beginning of any of these items, it should return the whole string and stop parsing at the end of item. If any of these items is followed by a comma, that comma should not be consumed.
I've tried a few times to write a parser for these types of expressions but with no success. It's not necessary to parse the sub-components of these expressions -- I don't need a full AST. I just need to consume and capture these sorts of chunks.
I thought maybe an adequate heuristic could involve just balancing any parentheses and eating all the content within outer balanced parentheses, in addition to any preceding identifier. But I need some help writing a parser that works this way.
It doesn't make sense to try to parse without parsing everything. Either (a) write a structured, correct parser, or (b) write something that eats the input, does some counting and tracking but doesn't actually parse it. You'll find it hard to do (b) with parsec. The key question is correctness: how will you parse this(example + "(with" + (weird ("bracketing)?")+"(")) unless you parse strings? You should bite the bullet and write a string parser first, then an identifier parser, then mutually recursive expression, argumentList and function parsers. You don't have to return an AST.

Should I use a lexer when using a parser combinator library like Parsec?

When writing a parser in a parser combinator library like Haskell's Parsec, you usually have 2 choices:
Write a lexer to split your String input into tokens, then perform parsing on [Token]
Directly write parser combinators on String
The first method often seems to make sense given that many parsing inputs can be understood as tokens separated by whitespace.
In other places, I have seen people recommend against tokenizing (or scanning or lexing, how some call it), with simplicity being quoted as the main reason.
What are general trade-offs between lexing and not doing it?
The most important difference is that lexing will translate your input domain.
A nice result of this is that
You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle space parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.
You can think about your input in a piece-by-piece manner, which is easy for humans.
However, if you do perform lexing, you get the problems that
You cannot use common parsers on String anymore - e.g. for parsing a number with a library Function parseFloat :: Parsec String s Float (that operates on a String input stream), you have to do something like takeNextToken :: TokenParser String and execute the parseFloat parser on it, inspecting the parse result (usually Either ErrorMessage a). This is messy to write and limits composability.
You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all SourcePos values).
Error reporting is generally worse. Running string "hello" *> space *> float on wrong input like "hello4" will tell you precisely that there is expected whitespace missing after the hello, while a lexer will just claim to have found an "invalid token".
Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly "hello world" are not two tokens "hello and world" anymore (but only, of course, if quotes are not escaped, like \") - while this is very natural for a parser, it means complicated rules and special cases for a lexer.
You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a String, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.
You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some Float token. At some point, you want to introduce negative literals (-3.4 and - 3.4) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.
To summarize, I would recommend writing lexer-free parsers for most cases.
In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one.
* From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness).

Resources