(F)Lex checking symbol without "consuming" it - string

The purpose of this is to concatenate strings (with (f)lex if possible) if they're written consecutively separated only by whitespace.
Strings start and end with "s.
The thing is I used states and while it can concatenate the strings it also consumes the next character/symbol that comes right after the strings.
For example -- "this " "is only " "1 string"id -- this will concatenate the strings ("this is only 1 string") but it will also "consume" the i in id thus destroying one token.
Is there a way to check the next char/symbol without actually "consuming/disposing" (can't really think of a term) it.
\" yy_push_state(X_STRING); yylval.s = new std::string("");
<X_STRING>\" yy_push_state(X_CONC);
<X_STRING>. yylval.s += yytext;
<X_STRING>\n yyerror("newline in string");
<X_CONC>[ ^\n] ;
<X_CONC>\" yy_pop_state();
<X_CONC>. yy_pop_state(); yy_pop_state(); return STRING
Any way to do it?

You can use yyless(0) to cause the current token to be rescanned. Make sure you change start condition, or you'll end up with an endless loop.
By the way, I think your code would be more readable if you switched start conditions with BEGIN rather than using the state stack. In fact, you could easily avoid start conditions, but that would make interpreting escape sequences more complicated. Possibly better would be to just avoid X_CONC by using a rule for \"[[:space:]]*\"

Related

Implementing String Interpolation in Flex/Bison

I'm currently writing an interpreter for a language I have designed.
The lexer/parser (GLR) is written in Flex/Bison and the main interpreter in D - and everything working flawlessly so far.
The thing is I want to also add string interpolation, that is identify string literals that contain a specific pattern (e.g. "[some expression]") and convert the included expression. I think this should be done at parser level, from within the corresponding Grammar action.
My idea is converting/treating the interpolated string as what it would look like with simple concatenation (as it works right now).
E.g.
print "this is the [result]. yay!"
to
print "this is the " + result + ". yay!"
However, I'm a bit confused as to how I could do that in Bison: basically, how do I tell it to re-parse a specific string (while constructing the main AST)?
Any ideas?
You could reparse the string, if you really wanted you, by generating a reentrant parser. You would probably want a reentrant scanner, as well, although I suppose you could kludge something together with a default scanner, using flex's buffer stack. Indeed, it's worth learning how to build reentrant parsers and scanners on the general principle of avoiding unnecessary globals, whether or not you need them for this particular purpose.
But you don't really need to reparse anything; you can do the entire parse in one pass. You just need enough smarts in your scanner so that it knows about nested interpolations.
The basic idea is to let the scanner split the string literal with interpolations into a series of tokens, which can easily be assembled into an appropriate AST by the parser. Since the scanner may return more than one token out of a single string literal, we'll need to introduce a start condition to keep track of whether the scan is currently inside a string literal or not. And since interpolations can, presumably, be nested we'll use flex's optional start condition stack, enabled with %option stack, to keep track of the nested contexts.
So here's a rough sketch.
As mentioned, the scanner has extra start conditions: SC_PROGRAM, the default, which is in effect while the scanner is scanning regular program text, and SC_STRING, in effect while the scanner is scanning a string. SC_PROGRAM is only needed because flex does not provide an official interface to check whether the start condition stack is empty; aside from nesting, it is identical to the INITIAL top-level start condition. The start condition stack is used to keep track of interpolation markers ([ and ] in this example), and it is needed because an interpolated expression might use brackets (as array subscripts, for example) or might even include a nested interpolated string. Since SC_PROGRAM is, with one exception, identical to INITIAL, we'll make it an inclusive rule.
%option stack
%s SC_PROGRAM
%x SC_STRING
%%
Since we're using a separate start condition to analyse string literals, we can also normalise escape sequences as we parse. Not all applications will want to do this, but it's pretty common. But since that's not really the point of this answer, I've left out most of the details. More interesting is the way that embedded interpolation expressions are handled, particularly deeply nested ones.
The end result will be to turn string literals into a series of tokens, possibly representing a nested structure. In order to avoid actually parsing in the scanner, we don't make any attempt to create AST nodes or otherwise rewrite the string literal; instead, we just pass the quote characters themselves through to the parser, delimiting the sequence of string literal pieces:
["] { yy_push_state(SC_STRING); return '"'; }
<SC_STRING>["] { yy_pop_state(); return '"'; }
A very similar set of rules is used for interpolation markers:
<*>"[" { yy_push_state(SC_PROGRAM); return '['; }
<INITIAL>"]" { return ']'; }
<*>"]" { yy_pop_state(); return ']'; }
The second rule above avoids popping the start condition stack if it is empty (as it will be in the INITIAL state). It's not necessary to issue an error message in the scanner; we can just pass the unmatched close bracket through to the parser, which will then do whatever error recovery seems necessary.
To finish off the SC_STRING state, we need to return tokens for pieces of the string, possibly including escape sequences:
<SC_STRING>{
[^[\\"]+ { yylval.str = strdup(yytext); return T_STRING; }
\\n { yylval.chr = '\n'; return T_CHAR; }
\\t { yylval.chr = '\t'; return T_CHAR; }
/* ... Etc. */
\\x[[:xdigit]]{2} { yylval.chr = strtoul(yytext, NULL, 16);
return T_CHAR; }
\\. { yylval.chr = yytext[1]; return T_CHAR; }
}
Returning escaped characters like that to the parser is probably not the best strategy; normally I would use an internal scanner buffer to accumulate the entire string. But it was simple for illustrative purposes. (Some error handling is omitted here; there are various corner cases, including newline handling and the annoying case where the last character in the program is a backslash inside an unterminated string literal.)
In the parser, we just need to insert a concatenation node for interpolated strings. The only complication is that we don't want to insert such a node for the common case of a string literal without any interpolations, so we use two syntax productions, one for a string with exactly one contained piece, and one for a string with two or more pieces:
string : '"' piece '"' { $$ = $2; }
| '"' piece piece_list '"' { $$ = make_concat_node(
prepend_to_list($2, $3));
}
piece : T_STRING { $$ = make_literal_node($1); }
| '[' expr ']' { $$ = $2; }
piece_list
: piece { $$ = new_list($1); }
| piece_list piece { $$ = append_to_list($1, $2); }

Macros and string interpolation (Julia)

Let's say I make this simple string macro
macro e_str(s)
return string("I touched this: ",s)
end
If I apply it to a string with interpolation, I
obtain:
julia> e"foobar $(log(2))"
"I touched this: foobar \$(log(2))"
Whereas I would like to obtain:
julia> e"foobar $(log(2))"
"I touched this: foobar 0.6931471805599453"
What changes do I have to make to my macro declaration?
It's better to parse the string at compile-time than to delegate to Julia. Basically, put the string into an IOBuffer, scan the string for $ signs, and use the parse function whenever they come up.
macro e_str(s)
components = []
buf = IOBuffer(s)
while !eof(buf)
push!(components, rstrip(readuntil(buf, '$'), '$'))
if !eof(buf)
push!(components, parse(buf; greedy=false))
end
end
quote
string($(map(esc, components)...))
end
end
This doesn't work with escaped $ characters, but that can be resolved with some minor changes to handle \ also. I have included a basic example at the bottom of this post.
I wrote it this way because string macros are generally not for emulating Julia strings — regular macros with regular string literals are better for that purpose. So writing up the parsing yourself isn't that bad, especially because it allows customized extensions. If you really want parsing to be identical to how Julia parses it, you could escape the string and then reparse it, as #MattB suggested:
macro e_str(s)
esc(parse("\"$(escape_string(s))\""))
end
The resulting expression is a :string expression which you could dump and inspect, and then analyse the usual way.
String macros do not come with built-in interpolation facilities. However, it is possible to manually implement this functionality. Note that it is not possible to embed without escaping string literals that have the same delimiter as the surrounding string macro; that is, although """ $("x") """ is possible, " $("x") " is not. Instead, this must be escaped as " $(\"x\") ".
There are two approaches to implementing interpolation manually: implement parsing manually, or get Julia to do the parsing. The first approach is more flexible, but the second approach is easier.
Manual parsing
macro interp_str(s)
components = []
buf = IOBuffer(s)
while !eof(buf)
push!(components, rstrip(readuntil(buf, '$'), '$'))
if !eof(buf)
push!(components, parse(buf; greedy=false))
end
end
quote
string($(map(esc, components)...))
end
end
Julia parsing
macro e_str(s)
esc(parse("\"$(escape_string(s))\""))
end
This method escapes the string (but note that escape_string does not escape the $ signs) and passes it back to Julia's parser to parse. Escaping the string is necessary to ensure that " and \ do not affect the string's parsing. The resulting expression is a :string expression, which can be examined and decomposed for macro purposes.

Remove first space if string contains exactly 2 spaces

I'm having issues when trying to remove the first space of a string if that string has 2 spaces in it. For example it should be turning "Fully Functional Method" into "FullyFunctional Method", but "Functional Method" should not be changed because it only has 1 space. I can't really think of a way to remove first space if the string contains 2 spaces.
I don't know exactly what you want to do, but you may search into RegExp and String.replace() to replace some stuff in a String.
Here is another link to understand the Characters, metacharacters, and metasequences.
var myPattern1:RegExp = / /g;
var str1:String = "This is a string that contains double spaces.";
trace(str1.replace(myPattern1, " "));
//this replaces all " " by " "...
//outputs : This is a string that contains double spaces.
Or in your case (I suppose) something like this
var myPattern2:RegExp = / /;
var str2:String = "Fully Functional Method";
trace(str2.replace(myPattern2, ""));
//If you omit the g, only the first space will be replaced by ""
//outputs : FullyFunctional Method
There is so much things you can do by using RegExp, that I will not explain this here...
Just check on the Adobe website...
This is a quick and efficient way to work on Strings.
I hope this will help.
Since you check at those links, you will understand that my example is pure rough and should be modified to have a FullyFunctional Method. :D
Do a linear scan through the string. Count the number of spaces and record the index of the first space, if any. If there are two spaces, return a string that is the concatenation of the characters up to but not including the first space, and the characters after the first space.
Keep it simple. It is possible to solve your problem with regex, but keep in mind that the worst case time complexity of finding a particular character in an unsorted set is always going to be O(N), so it won't be faster.

Multiline string literal in Matlab?

Is there a multiline string literal syntax in Matlab or is it necessary to concatenate multiple lines?
I found the verbatim package, but it only works in an m-file or function and not interactively within editor cells.
EDIT: I am particularly after readbility and ease of modifying the literal in the code (imagine it contains indented blocks of different levels) - it is easy to make multiline strings, but I am looking for the most convenient sytax for doing that.
So far I have
t = {...
'abc'...
'def'};
t = cellfun(#(x) [x sprintf('\n')],t,'Unif',false);
t = horzcat(t{:});
which gives size(t) = 1 8, but is obviously a bit of a mess.
EDIT 2: Basically verbatim does what I want except it doesn't work in Editor cells, but maybe my best bet is to update it so it does. I think it should be possible to get current open file and cursor position from the java interface to the Editor. The problem would be if there were multiple verbatim calls in the same cell how would you distinguish between them.
I'd go for:
multiline = sprintf([ ...
'Line 1\n'...
'Line 2\n'...
]);
Matlab is an oddball in that escape processing in strings is a function of the printf family of functions instead of the string literal syntax. And no multiline literals. Oh well.
I've ended up doing two things. First, make CR() and LF() functions that just return processed \r and \n respectively, so you can use them as pseudo-literals in your code. I prefer doing this way rather than sending entire strings through sprintf(), because there might be other backslashes in there you didn't want processed as escape sequences (e.g. if some of your strings came from function arguments or input read from elsewhere).
function out = CR()
out = char(13); % # sprintf('\r')
function out = LF()
out = char(10); % # sprintf('\n');
Second, make a join(glue, strs) function that works like Perl's join or the cellfun/horzcat code in your example, but without the final trailing separator.
function out = join(glue, strs)
strs = strs(:)';
strs(2,:) = {glue};
strs = strs(:)';
strs(end) = [];
out = cat(2, strs{:});
And then use it with cell literals like you do.
str = join(LF, {
'abc'
'defghi'
'jklm'
});
You don't need the "..." ellipses in cell literals like this; omitting them does a vertical vector construction, and it's fine if the rows have different lengths of char strings because they're each getting stuck inside a cell. That alone should save you some typing.
Bit of an old thread but I got this
multiline = join([
"Line 1"
"Line 2"
], newline)
I think if makes things pretty easy but obviously it depends on what one is looking for :)

Modifying a character in a string in Lua

Is there any way to replace a character at position N in a string in Lua.
This is what I've come up with so far:
function replace_char(pos, str, r)
return str:sub(pos, pos - 1) .. r .. str:sub(pos + 1, str:len())
end
str = replace_char(2, "aaaaaa", "X")
print(str)
I can't use gsub either as that would replace every capture, not just the capture at position N.
Strings in Lua are immutable. That means, that any solution that replaces text in a string must end up constructing a new string with the desired content. For the specific case of replacing a single character with some other content, you will need to split the original string into a prefix part and a postfix part, and concatenate them back together around the new content.
This variation on your code:
function replace_char(pos, str, r)
return str:sub(1, pos-1) .. r .. str:sub(pos+1)
end
is the most direct translation to straightforward Lua. It is probably fast enough for most purposes. I've fixed the bug that the prefix should be the first pos-1 chars, and taken advantage of the fact that if the last argument to string.sub is missing it is assumed to be -1 which is equivalent to the end of the string.
But do note that it creates a number of temporary strings that will hang around in the string store until garbage collection eats them. The temporaries for the prefix and postfix can't be avoided in any solution. But this also has to create a temporary for the first .. operator to be consumed by the second.
It is possible that one of two alternate approaches could be faster. The first is the solution offered by Paŭlo Ebermann, but with one small tweak:
function replace_char2(pos, str, r)
return ("%s%s%s"):format(str:sub(1,pos-1), r, str:sub(pos+1))
end
This uses string.format to do the assembly of the result in the hopes that it can guess the final buffer size without needing extra temporary objects.
But do beware that string.format is likely to have issues with any \0 characters in any string that it passes through its %s format. Specifically, since it is implemented in terms of standard C's sprintf() function, it would be reasonable to expect it to terminate the substituted string at the first occurrence of \0. (Noted by user Delusional Logic in a comment.)
A third alternative that comes to mind is this:
function replace_char3(pos, str, r)
return table.concat{str:sub(1,pos-1), r, str:sub(pos+1)}
end
table.concat efficiently concatenates a list of strings into a final result. It has an optional second argument which is text to insert between the strings, which defaults to "" which suits our purpose here.
My guess is that unless your strings are huge and you do this substitution frequently, you won't see any practical performance differences between these methods. However, I've been surprised before, so profile your application to verify there is a bottleneck, and benchmark potential solutions carefully.
You should use pos inside your function instead of literal 1 and 3, but apart from this it looks good. Since Lua strings are immutable you can't really do much better than this.
Maybe
"%s%s%s":format(str:sub(1,pos-1), r, str:sub(pos+1, str:len())
is more efficient than the .. operator, but I doubt it - if it turns out to be a bottleneck, measure it (and then decide to implement this replacement function in C).
With luajit, you can use the FFI library to cast the string to a list of unsigned charts:
local ffi = require 'ffi'
txt = 'test'
ptr = ffi.cast('uint8_t*', txt)
ptr[1] = string.byte('o')

Resources