Implementing String Interpolation in Flex/Bison - string

I'm currently writing an interpreter for a language I have designed.
The lexer/parser (GLR) is written in Flex/Bison and the main interpreter in D - and everything working flawlessly so far.
The thing is I want to also add string interpolation, that is identify string literals that contain a specific pattern (e.g. "[some expression]") and convert the included expression. I think this should be done at parser level, from within the corresponding Grammar action.
My idea is converting/treating the interpolated string as what it would look like with simple concatenation (as it works right now).
E.g.
print "this is the [result]. yay!"
to
print "this is the " + result + ". yay!"
However, I'm a bit confused as to how I could do that in Bison: basically, how do I tell it to re-parse a specific string (while constructing the main AST)?
Any ideas?

You could reparse the string, if you really wanted you, by generating a reentrant parser. You would probably want a reentrant scanner, as well, although I suppose you could kludge something together with a default scanner, using flex's buffer stack. Indeed, it's worth learning how to build reentrant parsers and scanners on the general principle of avoiding unnecessary globals, whether or not you need them for this particular purpose.
But you don't really need to reparse anything; you can do the entire parse in one pass. You just need enough smarts in your scanner so that it knows about nested interpolations.
The basic idea is to let the scanner split the string literal with interpolations into a series of tokens, which can easily be assembled into an appropriate AST by the parser. Since the scanner may return more than one token out of a single string literal, we'll need to introduce a start condition to keep track of whether the scan is currently inside a string literal or not. And since interpolations can, presumably, be nested we'll use flex's optional start condition stack, enabled with %option stack, to keep track of the nested contexts.
So here's a rough sketch.
As mentioned, the scanner has extra start conditions: SC_PROGRAM, the default, which is in effect while the scanner is scanning regular program text, and SC_STRING, in effect while the scanner is scanning a string. SC_PROGRAM is only needed because flex does not provide an official interface to check whether the start condition stack is empty; aside from nesting, it is identical to the INITIAL top-level start condition. The start condition stack is used to keep track of interpolation markers ([ and ] in this example), and it is needed because an interpolated expression might use brackets (as array subscripts, for example) or might even include a nested interpolated string. Since SC_PROGRAM is, with one exception, identical to INITIAL, we'll make it an inclusive rule.
%option stack
%s SC_PROGRAM
%x SC_STRING
%%
Since we're using a separate start condition to analyse string literals, we can also normalise escape sequences as we parse. Not all applications will want to do this, but it's pretty common. But since that's not really the point of this answer, I've left out most of the details. More interesting is the way that embedded interpolation expressions are handled, particularly deeply nested ones.
The end result will be to turn string literals into a series of tokens, possibly representing a nested structure. In order to avoid actually parsing in the scanner, we don't make any attempt to create AST nodes or otherwise rewrite the string literal; instead, we just pass the quote characters themselves through to the parser, delimiting the sequence of string literal pieces:
["] { yy_push_state(SC_STRING); return '"'; }
<SC_STRING>["] { yy_pop_state(); return '"'; }
A very similar set of rules is used for interpolation markers:
<*>"[" { yy_push_state(SC_PROGRAM); return '['; }
<INITIAL>"]" { return ']'; }
<*>"]" { yy_pop_state(); return ']'; }
The second rule above avoids popping the start condition stack if it is empty (as it will be in the INITIAL state). It's not necessary to issue an error message in the scanner; we can just pass the unmatched close bracket through to the parser, which will then do whatever error recovery seems necessary.
To finish off the SC_STRING state, we need to return tokens for pieces of the string, possibly including escape sequences:
<SC_STRING>{
[^[\\"]+ { yylval.str = strdup(yytext); return T_STRING; }
\\n { yylval.chr = '\n'; return T_CHAR; }
\\t { yylval.chr = '\t'; return T_CHAR; }
/* ... Etc. */
\\x[[:xdigit]]{2} { yylval.chr = strtoul(yytext, NULL, 16);
return T_CHAR; }
\\. { yylval.chr = yytext[1]; return T_CHAR; }
}
Returning escaped characters like that to the parser is probably not the best strategy; normally I would use an internal scanner buffer to accumulate the entire string. But it was simple for illustrative purposes. (Some error handling is omitted here; there are various corner cases, including newline handling and the annoying case where the last character in the program is a backslash inside an unterminated string literal.)
In the parser, we just need to insert a concatenation node for interpolated strings. The only complication is that we don't want to insert such a node for the common case of a string literal without any interpolations, so we use two syntax productions, one for a string with exactly one contained piece, and one for a string with two or more pieces:
string : '"' piece '"' { $$ = $2; }
| '"' piece piece_list '"' { $$ = make_concat_node(
prepend_to_list($2, $3));
}
piece : T_STRING { $$ = make_literal_node($1); }
| '[' expr ']' { $$ = $2; }
piece_list
: piece { $$ = new_list($1); }
| piece_list piece { $$ = append_to_list($1, $2); }

Related

Difference between single quote and double quote string *types* in octave? Reason of warning?

I am aware that in octave escape sequences are treated differently in single/double quotes. Nevertheless, there seems to be a type difference:
Whereas class("bla") and class('bla') are both char,
typeinfo("bla") is string, whereas typeinfo('bla') is sq_string,
which may be short for single quote string.
More interesting, warning("on", "Octave:mixed-string-concat") activates warning
that these two types are mixed.
So after activation, ["bla" 'bla'] yields a warning.
Note that typeinfo(["bla" "bla"]) is string,
whereas if one of the two strings concatenated is single quote, so is the result,
e.g. typeinfo(['bla' "bla"]) is sq_string.
I have a situation where someone activates the warning
and so I want to program so to avoid these.
Thus my question: is there a way to convert sq_string to string?
The core of my problem is that fieldnames seem to be single quoted strings.
What an interesting question. I've never thought one might have a need for such a warning or conversion ... though now that I think about it, it makes sense if you want to collect 'raw' strings, and have their escape sequences interpreted and vice versa ...
After some experimentation, I have found a way to do what you want: use sprintf. This seems to return a (double-quoted) string if your formatted string is in double quotes, and an sq_string if it's in single quotes. If your formatted string is simply "%s", then you can pass a bunch of strings as subsequent arguments, and these will be concatenated (as a double-quoted string).
If you'd prefer to go in the reverse direction and ensure your strings are always single quoted, you can then still do the above with a single-quoted formatted string, or you can just use strcat: this does not trigger your warning, can also be called with a single argument, and seems to always return an sq_string.
Also, since I would generally recommend using either of these with "cell-generated sequence" syntax for convenience, this means that you would be better off "collecting" individual strings in cells more generally. E.g.
a = { 'one', 'two', 'three' }
b = { "four", "five", "six" }
typeinfo( sprintf( "%s", a{:} ) ) % outputs: string
typeinfo( strcat( b{:} ) ) % outputs: sq_string

Pyparsing - matching the outermost set of nested brackets

I'm trying to use pyparsing to build a parser that will match on all text within an arbitrarily nested set of brackets. If we consider a string like this:
"[A,[B,C],[D,E,F],G] Random Middle text [H,I,J]"
What I would like is for a parser to match in a way that it returns two matches:
[
"[A,[B,C],[D,E,F],G]",
"[H,I,J]"
]
I was able to accomplish a somewhat-working version of this using a barrage of originalTextFor mashed up with nestedExpr, but this breaks when your nesting is deeper than the number of OriginalTextFor expressions.
Is there a straightforward way to only match on the outermost expression grabbed by nestedExpr, or a way to modify its logic so that everything after the first paired match is treated as plaintext rather than being parsed?
update: One thing that seems to come close to what I want to accomplish is this modified version of the logic from nestedExpr:
def mynest(opener='{', closer='}'):
content = (empty.copy()+CharsNotIn(opener+closer+ParserElement.DEFAULT_WHITE_CHARS))
ret = Forward()
ret <<= ( Suppress(opener) + originalTextFor(ZeroOrMore( ret | content )) + Suppress(closer) )
return ret
This gets me most of the way there, although there's an extra level of list wrapping in there that I really don't need, and what I'd really like is for those brackets to be included in the string (without getting into an infinite recursion situation by not suppressing them).
parser = mynest("[","]")
result = parser.searchString("[A,[B,C],[D,E,F],G] Random Middle text [H,I,J]")
result.asList()
>>> [['A,[B,C],[D,E,F],G'], ['H,I,J']]
I know I could strip these out with a simple list comprehension, but it would be ideal if I could just eliminate that second, redundant level.
Not sure why this wouldn't work:
sample = "[A,[B,C],[D,E,F],G] Random Middle text [H,I,J]"
scanner = originalTextFor(nestedExpr('[',']'))
for match in scanner.searchString(sample):
print(match[0])
prints:
'[A,[B,C],[D,E,F],G]'
'[H,I,J]'
What is the situation where "this breaks when your nesting is deeper than the number of OriginalTextFor expressions"?

Searching in an indexed string

The plot
There is a rather complicatedly formatted string, like there's no such readable regex that parses it. And the aim is to get a specific substring for example, and to get it's original position. That substring is reached after parsing a bit, like trimming, removing the beginning something and searching the n-th element for example. I just want to demonstrate you the complexity with this example, otherwise it's pretty general.
For demonstration, see this rudimentary example. The way it is isn't really important, just to reach a pretty complicated parse model. Obviously, there can be more rule and you can write a simplier model as well.
FirstBlock{Index1, Index2} SecondBlock ThirdBlock
{ FirstBlock {Index1,Index2} SecondBlock}
{FirstBlock SecondBlock ThirdBlock FourthBlock}
I've tried to make as random as it could be. The parsing model is like:
string text = "{ FirstBlock {Index1,Index2} SecondBlock}";
text = text.Trim();
if (text.First() == '{')
{
text = text.SubString(1, text.Length - 2);
}
text = text.Trim();
string firstBlock = text.Split(new char[] { ' ', '{' })[0];
text = text.Remove(0, firstBlock.Length).Trim();
string indices = "";
if (text.First() == '{')
{
indices = text.Split(new char[] { '{', '}' })[0];
text = text.Remove(0, indices.Length).Trim();
}
string[] blocks = text.Split(' ');
The easy way
There is a way that is pretty easy to implement and straightforward. But does not give you the correct result sometimes. That way you parse the string and get the substring and then you make a re-search, for example string.IndexOf() and get the position. But if there are two match for example, you are given the first one even though it is not sure you wanted that one.
My notion
The way I think is quite elegant but still not consummate is to index the characters of the string at the beginning, then parse it, and eventually you end up with the proper characters and their position also. My problem there is that then you can't really use the functions the library gives, and I don't know a way to do that. Using the snippet above:
List<Tuple<int, char>> indexedText = text
.Select((ch, index) => new Tuple<int, char>(index, ch))
.ToList();
And with this structure you can still process the string without library methods but you are given the position indices eventually. For example, trim:
indexedText = indexedText
.SkipWhile(indexedChar => char.IsWhiteSpace(indexedChar.Item2))
.ToList();
The actual question
The question can either be a new solution or the way you can use library methods with indexed strings. The aim is to get the indices back after parsing a string. It is possible that there is a very simple way that is just out of my scope but I haven't found a proper solution yet. The solution I don't want is to simplify the parsing system, as I said it is just for demonstration.

What is the r#""# operator in Rust?

I saw the operator r#"" in Rust but I can't find what it does. It came in handy for creating JSON:
let var1 = "test1";
let json = r#"{"type": "type1", "type2": var1}"#;
println!("{}", json) // => {"type2": "type1", "type2": var1}
What's the name of the operator r#""? How do I make var1 evaluate?
I can't find what it does
It has to do with string literals and raw strings. I think it is explained pretty well in this part of the documentation, in the code block that is posted there you can see what it does:
"foo"; r"foo"; // foo
"\"foo\""; r#""foo""#; // "foo"
"foo #\"# bar";
r##"foo #"# bar"##; // foo #"# bar
"\x52"; "R"; r"R"; // R
"\\x52"; r"\x52"; // \x52
It negates the need to escape special characters inside the string.
The r character at the start of a string literal denotes a raw string literal. It's not an operator, but rather a prefix.
In a normal string literal, there are some characters that you need to escape to make them part of the string, such as " and \. The " character needs to be escaped because it would otherwise terminate the string, and the \ needs to be escaped because it is the escape character.
In raw string literals, you can put an arbitrary number of # symbols between the r and the opening ". To close the raw string literal, you must have a closing ", followed by the same number of # characters as there are at the start. With zero or more # characters, you can put literal \ characters in the string (\ characters do not have any special meaning). With one or more # characters, you can put literal " characters in the string. If you need a " followed by a sequence of # characters in the string, just use the same number of # characters plus one to delimit the string. For example: r##"foo #"# bar"## represents the string foo #"# bar. The literal doesn't stop at the quote in the middle, because it's only followed by one #, whereas the literal was started with two #.
To answer the last part of your question, there's no way to have a string literal that evaluates variables in the current scope. Some languages, such as PHP, support that, but not Rust. You should consider using the format! macro instead. Note that for JSON, you'll still need to double the braces, even in a raw string literal, because the string is interpreted by the macro.
fn main() {
let var1 = "test1";
let json = format!(r#"{{"type": "type1", "type2": {}}}"#, var1);
println!("{}", json) // => {"type2": "type1", "type2": test1}
}
If you need to generate a lot of JSON, there are many crates that will make it easier for you. In particular, with serde_json, you can define regular Rust structs or enums and have them serialized automatically to JSON.
The first time I saw this weird notation is in glium tutorials (old crate for graphics management) and is used to "encapsulate" and pass GLSL code (GL Shading language) to shaders of the GPU
https://github.com/glium/glium/blob/master/book/tuto-02-triangle.md
As far as I understand, it looks like the content of r#...# is left untouched, it is not interpreted in any way. Hence raw string.

Multiline string literal in Matlab?

Is there a multiline string literal syntax in Matlab or is it necessary to concatenate multiple lines?
I found the verbatim package, but it only works in an m-file or function and not interactively within editor cells.
EDIT: I am particularly after readbility and ease of modifying the literal in the code (imagine it contains indented blocks of different levels) - it is easy to make multiline strings, but I am looking for the most convenient sytax for doing that.
So far I have
t = {...
'abc'...
'def'};
t = cellfun(#(x) [x sprintf('\n')],t,'Unif',false);
t = horzcat(t{:});
which gives size(t) = 1 8, but is obviously a bit of a mess.
EDIT 2: Basically verbatim does what I want except it doesn't work in Editor cells, but maybe my best bet is to update it so it does. I think it should be possible to get current open file and cursor position from the java interface to the Editor. The problem would be if there were multiple verbatim calls in the same cell how would you distinguish between them.
I'd go for:
multiline = sprintf([ ...
'Line 1\n'...
'Line 2\n'...
]);
Matlab is an oddball in that escape processing in strings is a function of the printf family of functions instead of the string literal syntax. And no multiline literals. Oh well.
I've ended up doing two things. First, make CR() and LF() functions that just return processed \r and \n respectively, so you can use them as pseudo-literals in your code. I prefer doing this way rather than sending entire strings through sprintf(), because there might be other backslashes in there you didn't want processed as escape sequences (e.g. if some of your strings came from function arguments or input read from elsewhere).
function out = CR()
out = char(13); % # sprintf('\r')
function out = LF()
out = char(10); % # sprintf('\n');
Second, make a join(glue, strs) function that works like Perl's join or the cellfun/horzcat code in your example, but without the final trailing separator.
function out = join(glue, strs)
strs = strs(:)';
strs(2,:) = {glue};
strs = strs(:)';
strs(end) = [];
out = cat(2, strs{:});
And then use it with cell literals like you do.
str = join(LF, {
'abc'
'defghi'
'jklm'
});
You don't need the "..." ellipses in cell literals like this; omitting them does a vertical vector construction, and it's fine if the rows have different lengths of char strings because they're each getting stuck inside a cell. That alone should save you some typing.
Bit of an old thread but I got this
multiline = join([
"Line 1"
"Line 2"
], newline)
I think if makes things pretty easy but obviously it depends on what one is looking for :)

Resources