Searching in an indexed string - string

The plot
There is a rather complicatedly formatted string, like there's no such readable regex that parses it. And the aim is to get a specific substring for example, and to get it's original position. That substring is reached after parsing a bit, like trimming, removing the beginning something and searching the n-th element for example. I just want to demonstrate you the complexity with this example, otherwise it's pretty general.
For demonstration, see this rudimentary example. The way it is isn't really important, just to reach a pretty complicated parse model. Obviously, there can be more rule and you can write a simplier model as well.
FirstBlock{Index1, Index2} SecondBlock ThirdBlock
{ FirstBlock {Index1,Index2} SecondBlock}
{FirstBlock SecondBlock ThirdBlock FourthBlock}
I've tried to make as random as it could be. The parsing model is like:
string text = "{ FirstBlock {Index1,Index2} SecondBlock}";
text = text.Trim();
if (text.First() == '{')
{
text = text.SubString(1, text.Length - 2);
}
text = text.Trim();
string firstBlock = text.Split(new char[] { ' ', '{' })[0];
text = text.Remove(0, firstBlock.Length).Trim();
string indices = "";
if (text.First() == '{')
{
indices = text.Split(new char[] { '{', '}' })[0];
text = text.Remove(0, indices.Length).Trim();
}
string[] blocks = text.Split(' ');
The easy way
There is a way that is pretty easy to implement and straightforward. But does not give you the correct result sometimes. That way you parse the string and get the substring and then you make a re-search, for example string.IndexOf() and get the position. But if there are two match for example, you are given the first one even though it is not sure you wanted that one.
My notion
The way I think is quite elegant but still not consummate is to index the characters of the string at the beginning, then parse it, and eventually you end up with the proper characters and their position also. My problem there is that then you can't really use the functions the library gives, and I don't know a way to do that. Using the snippet above:
List<Tuple<int, char>> indexedText = text
.Select((ch, index) => new Tuple<int, char>(index, ch))
.ToList();
And with this structure you can still process the string without library methods but you are given the position indices eventually. For example, trim:
indexedText = indexedText
.SkipWhile(indexedChar => char.IsWhiteSpace(indexedChar.Item2))
.ToList();
The actual question
The question can either be a new solution or the way you can use library methods with indexed strings. The aim is to get the indices back after parsing a string. It is possible that there is a very simple way that is just out of my scope but I haven't found a proper solution yet. The solution I don't want is to simplify the parsing system, as I said it is just for demonstration.

Related

Implementing String Interpolation in Flex/Bison

I'm currently writing an interpreter for a language I have designed.
The lexer/parser (GLR) is written in Flex/Bison and the main interpreter in D - and everything working flawlessly so far.
The thing is I want to also add string interpolation, that is identify string literals that contain a specific pattern (e.g. "[some expression]") and convert the included expression. I think this should be done at parser level, from within the corresponding Grammar action.
My idea is converting/treating the interpolated string as what it would look like with simple concatenation (as it works right now).
E.g.
print "this is the [result]. yay!"
to
print "this is the " + result + ". yay!"
However, I'm a bit confused as to how I could do that in Bison: basically, how do I tell it to re-parse a specific string (while constructing the main AST)?
Any ideas?
You could reparse the string, if you really wanted you, by generating a reentrant parser. You would probably want a reentrant scanner, as well, although I suppose you could kludge something together with a default scanner, using flex's buffer stack. Indeed, it's worth learning how to build reentrant parsers and scanners on the general principle of avoiding unnecessary globals, whether or not you need them for this particular purpose.
But you don't really need to reparse anything; you can do the entire parse in one pass. You just need enough smarts in your scanner so that it knows about nested interpolations.
The basic idea is to let the scanner split the string literal with interpolations into a series of tokens, which can easily be assembled into an appropriate AST by the parser. Since the scanner may return more than one token out of a single string literal, we'll need to introduce a start condition to keep track of whether the scan is currently inside a string literal or not. And since interpolations can, presumably, be nested we'll use flex's optional start condition stack, enabled with %option stack, to keep track of the nested contexts.
So here's a rough sketch.
As mentioned, the scanner has extra start conditions: SC_PROGRAM, the default, which is in effect while the scanner is scanning regular program text, and SC_STRING, in effect while the scanner is scanning a string. SC_PROGRAM is only needed because flex does not provide an official interface to check whether the start condition stack is empty; aside from nesting, it is identical to the INITIAL top-level start condition. The start condition stack is used to keep track of interpolation markers ([ and ] in this example), and it is needed because an interpolated expression might use brackets (as array subscripts, for example) or might even include a nested interpolated string. Since SC_PROGRAM is, with one exception, identical to INITIAL, we'll make it an inclusive rule.
%option stack
%s SC_PROGRAM
%x SC_STRING
%%
Since we're using a separate start condition to analyse string literals, we can also normalise escape sequences as we parse. Not all applications will want to do this, but it's pretty common. But since that's not really the point of this answer, I've left out most of the details. More interesting is the way that embedded interpolation expressions are handled, particularly deeply nested ones.
The end result will be to turn string literals into a series of tokens, possibly representing a nested structure. In order to avoid actually parsing in the scanner, we don't make any attempt to create AST nodes or otherwise rewrite the string literal; instead, we just pass the quote characters themselves through to the parser, delimiting the sequence of string literal pieces:
["] { yy_push_state(SC_STRING); return '"'; }
<SC_STRING>["] { yy_pop_state(); return '"'; }
A very similar set of rules is used for interpolation markers:
<*>"[" { yy_push_state(SC_PROGRAM); return '['; }
<INITIAL>"]" { return ']'; }
<*>"]" { yy_pop_state(); return ']'; }
The second rule above avoids popping the start condition stack if it is empty (as it will be in the INITIAL state). It's not necessary to issue an error message in the scanner; we can just pass the unmatched close bracket through to the parser, which will then do whatever error recovery seems necessary.
To finish off the SC_STRING state, we need to return tokens for pieces of the string, possibly including escape sequences:
<SC_STRING>{
[^[\\"]+ { yylval.str = strdup(yytext); return T_STRING; }
\\n { yylval.chr = '\n'; return T_CHAR; }
\\t { yylval.chr = '\t'; return T_CHAR; }
/* ... Etc. */
\\x[[:xdigit]]{2} { yylval.chr = strtoul(yytext, NULL, 16);
return T_CHAR; }
\\. { yylval.chr = yytext[1]; return T_CHAR; }
}
Returning escaped characters like that to the parser is probably not the best strategy; normally I would use an internal scanner buffer to accumulate the entire string. But it was simple for illustrative purposes. (Some error handling is omitted here; there are various corner cases, including newline handling and the annoying case where the last character in the program is a backslash inside an unterminated string literal.)
In the parser, we just need to insert a concatenation node for interpolated strings. The only complication is that we don't want to insert such a node for the common case of a string literal without any interpolations, so we use two syntax productions, one for a string with exactly one contained piece, and one for a string with two or more pieces:
string : '"' piece '"' { $$ = $2; }
| '"' piece piece_list '"' { $$ = make_concat_node(
prepend_to_list($2, $3));
}
piece : T_STRING { $$ = make_literal_node($1); }
| '[' expr ']' { $$ = $2; }
piece_list
: piece { $$ = new_list($1); }
| piece_list piece { $$ = append_to_list($1, $2); }

Parsing a string to find next delimiter - Processing

So the idea here is that I'm taking a .csv into a string and each value needs to be stored into a variable. I am unsure how to properly parse a string to do this.
My idea is a function that looks like
final char delim = ',';
int nextItem(String data, int startFrom) {
if (data.charAt(startFrom) != delim) {
return data.charAt(startFrom)
} else {
return nextItem(data, startFrom + 1);
}
}
so if I passed it something like
nextItem("45,621,9", 0);
it would return 45
and if I passed it
nextItem("45,621,9", 3);
it would return 621
I'm not sure if I have that setup properly to be recursive, but I could also use a For loop I suppose, only real stipulation is I can't use the Substring method.
Please don't use recursion for a matter that can be easily done iteratively. Recursion is expensive in terms of stack and calling frames: A very long string could produce a StackOverflowError.
I suggest you take a look to standard method indexOf of java.lang.String:
A good alternative is Regular Expressions.
You can seperate the words considering comma ',' as delimeter
Code
String[] nextItem(String data) {
String[] words=data.split(",");
return words;
}
This will return an array of strings that is the words in your input string. Then you can use the array in anyway you need.
Hope it helps ;)
Processing comes with a split() function that does exactly what you're describing.
From the reference:
String men = "Chernenko,Andropov,Brezhnev";
String[] list = split(men, ',');
// list[0] is now "Chernenko", list[1] is "Andropov"...
Behind the scenes it's using the String#split() function like H. Sodi's answer, but you should just use this function instead of defining your own.

Pyparsing - matching the outermost set of nested brackets

I'm trying to use pyparsing to build a parser that will match on all text within an arbitrarily nested set of brackets. If we consider a string like this:
"[A,[B,C],[D,E,F],G] Random Middle text [H,I,J]"
What I would like is for a parser to match in a way that it returns two matches:
[
"[A,[B,C],[D,E,F],G]",
"[H,I,J]"
]
I was able to accomplish a somewhat-working version of this using a barrage of originalTextFor mashed up with nestedExpr, but this breaks when your nesting is deeper than the number of OriginalTextFor expressions.
Is there a straightforward way to only match on the outermost expression grabbed by nestedExpr, or a way to modify its logic so that everything after the first paired match is treated as plaintext rather than being parsed?
update: One thing that seems to come close to what I want to accomplish is this modified version of the logic from nestedExpr:
def mynest(opener='{', closer='}'):
content = (empty.copy()+CharsNotIn(opener+closer+ParserElement.DEFAULT_WHITE_CHARS))
ret = Forward()
ret <<= ( Suppress(opener) + originalTextFor(ZeroOrMore( ret | content )) + Suppress(closer) )
return ret
This gets me most of the way there, although there's an extra level of list wrapping in there that I really don't need, and what I'd really like is for those brackets to be included in the string (without getting into an infinite recursion situation by not suppressing them).
parser = mynest("[","]")
result = parser.searchString("[A,[B,C],[D,E,F],G] Random Middle text [H,I,J]")
result.asList()
>>> [['A,[B,C],[D,E,F],G'], ['H,I,J']]
I know I could strip these out with a simple list comprehension, but it would be ideal if I could just eliminate that second, redundant level.
Not sure why this wouldn't work:
sample = "[A,[B,C],[D,E,F],G] Random Middle text [H,I,J]"
scanner = originalTextFor(nestedExpr('[',']'))
for match in scanner.searchString(sample):
print(match[0])
prints:
'[A,[B,C],[D,E,F],G]'
'[H,I,J]'
What is the situation where "this breaks when your nesting is deeper than the number of OriginalTextFor expressions"?

Is there an equivalent to the string function String(format: ...) using Swift formatting

I'm starting to like the Swift string formatting since it uses variable names in the string rather than ambiguous formatting tags like "%#"
I want to load a large string from a file that has Swift-style formatting in it (like this)
Now is the time for all good \(who) to come to babble incoherently.
Then I want to feed the contents of that String variable into a statement that lest me replace
\(who)
with the contents of the constant/variable who at runtime.
The code below works with a string constant as the formatting string.
let who = "programmers"
let aString = "Now is the time for all good \(who) to come to babble incoherently."
That code does formatting of a quoted string that appears in-line in my code.
Instead I want something like the code
let formatString = "Now is the time for all good %# to come to babble incoherently."
aString = String(format: formatString, who)
But where I can pass in a Swift-style format string in a constant/variable I read from a file.
Is that possible? I didn't have any luck searching for it since I wasn't exactly sure what search terms to use.
I can always use C-style string formatting and the String class' initWithFormat method if I have to...
I don't think there's a way to do this. String interpolation is implemented via conforming to the StringInterpolationConvertible protocol, and presumably you're hoping to tap into that in the same way you can tap into the methods required by StringLiteralConvertible, a la:
let someString = toString(42)
// this is the method String implements to conform to StringLiteralConvertible
let anotherString = String(stringLiteral: someString)
// anotherString will be "42"
print(anotherString)
Unfortunately, you can't do quite the same trick with StringInterpolationConvertible. Seeing how the protocol works may help:
struct MyString: Printable {
let actualString: String
var description: String { return actualString }
}
extension MyString: StringInterpolationConvertible {
// first, this will get called for each "segment"
init<T>(stringInterpolationSegment expr: T) {
println("Processing segment: " + toString(expr))
actualString = toString(expr)
}
// here is a type-specific override for Int, that coverts
// small numbers into words:
init(stringInterpolationSegment expr: Int) {
if (0..<4).contains(expr) {
println("Embigening \(expr)")
let numbers = ["zeo","one","two","three"]
actualString = numbers[expr]
}
else {
println("Processing segment: " + toString(expr))
actualString = toString(expr)
}
}
// finally, this gets called with an array of all of the
// converted segments
init(stringInterpolation strings: MyString...) {
// strings will be a bunch of MyString objects
actualString = "".join(strings.map { $0.actualString })
}
}
let number = 3
let aString: MyString = "Then shalt thou count to \(number), no more, no less."
println(aString)
// prints "Then shalt thou count to three, no more, no less."
So, while you can call String.init(stringInterpolation:) and String.init(stringInterpolationSegment:) directly yourself if you want (just try String(stringInterpolationSegment: 3.141) and String(stringInterpolation: "blah", "blah")), this doesn't really help you much. What you really need is a facade function that coordinates the calls to them. And unless there's a handy pre-existing function in the standard library that does exactly that which I've missed, I think you're out of luck. I suspect it's built into the compiler.
You could maybe write your own to achieve your goal, but a lot of effort since you'd have to break up the string you want to interpolate manually into bits and handle it yourself, calling the segment init in a loop. Also you'll hit problems with calling the combining function, since you can't splat an array into a variadic function call.
I don't think so. The compiler needs to be able to resolve the interpolated variable at compile time.
I'm not a Swift programmer, specifically, but I think you can workaround it to something pretty close to what you want using a Dictionary and standard string-replacing and splitting methods:
var replacement = [String: String]()
replacement["who"] = "programmers"
Having that, you can try to find the occurrences of "\(", reading what is next and prior to a ")", (this post can help with the split part, this one, with the replacing part), finding it in the dictionary, and reconstructing your string from the pieces you get.
this one works like a charm:
let who = "programmers"
let formatString = "Now is the time for all good %# to come to babble incoherently."
let aString = String(format: formatString, who)

repeat string with LINQ/extensions methods [duplicate]

This question already has answers here:
Is there an easy way to return a string repeated X number of times?
(21 answers)
Closed 9 years ago.
Just a curiosity I was investigating.
The matter: simply repeating (multiplying, someone would say) a string/character n times.
I know there is Enumerable.Repeat for this aim, but I was trying to do this without it.
LINQ in this case seems pretty useless, because in a query like
from X in "s" select X
the string "s" is being explored and so X is a char. The same is with extension methods, because for example "s".Aggregate(blablabla) would again work on just the character 's', not the string itself. For repeating the string something "external" would be needed, so I thought lambdas and delegates, but it can't be done without declaring a variable to assign the delegate/lambda expression to.
So something like defining a function and calling it inline:
( (a)=>{return " "+a;} )("a");
or
delegate(string a){return " "+a}(" ");
would give a "without name" error (and so no recursion, AFAIK, even by passing a possible lambda/delegate as a parameter), and in the end couldn't even be created by C# because of its limitations.
It could be that I'm watching this thing from the wrong perspective. Any ideas?
This is just an experiment, I don't care about performances, about memory use... Just that it is one line and sort of autonomous. Maybe one could do something with Copy/CopyTo, or casting it to some other collection, I don't know. Reflection is accepted too.
To repeat a character n-times you would not use Enumerable.Repeat but just this string constructor:
string str = new string('X', 10);
To repeat a string i don't know anything better than using string.Join and Enumerable.Repeat
string foo = "Foo";
string str = string.Join("", Enumerable.Repeat(foo, 10));
edit: you could use string.Concat instead if you need no separator:
string str = string.Concat( Enumerable.Repeat(foo, 10) );
If you're trying to repeat a string, rather than a character, a simple way would be to use the StringBuilder.Insert method, which takes an insertion index and a count for the number of repetitions to use:
var sb = new StringBuilder();
sb.Insert(0, "hi!", 5);
Console.WriteLine(sb.ToString());
Otherwise, to repeat a single character, use the string constructor as I've mentioned in the comments for the similar question here. For example:
string result = new String('-', 5); // -----
For the sake of completeness, it's worth noting that StringBuilder provides an overloaded Append method that can repeat a character, but has no such overload for strings (which is where the Insert method comes in). I would prefer the string constructor to the StringBuilder if that's all I was interested in doing. However, if I was already working with a StringBuilder, it might make sense to use the Append method to benefit from some chaining. Here's a contrived example to demonstrate:
var sb = new StringBuilder("This item is ");
sb.Insert(sb.Length, "very ", 2) // insert at the end to append
.Append('*', 3)
.Append("special")
.Append('*', 3);
Console.WriteLine(sb.ToString()); // This item is very very ***special***

Resources