How to split strings sperated by comma with escapes? - string

I have a string looks like this:
(The whole code block is a string, aka, this string contains quotation marks.)
"he\"llo", "world\n", "fro,m"
[update] Aka, the "actual" string is this:
"\"he\\\"llo\", \"world\\n\", \"fro,m\""
I want to get an array of strings like this:
[ "\"he\\\"llo\"", "\"world\\n\"", "\"fro,m\"" ]
[update] Comma inside quotation marks should be remained.
In my opinion, there are several ways to solve this:
build a automata (DFA or NFA) for this syntax
using several status flags like inQuote, handle judging logics with lots of if else
write a complex but clever Regular Expression for this
Are there any general solutions to this problem? Or how should I actually do using those thinkings above?
P.S. It couldn't be better if some syntax errors like "unclosed quotation mark" can be found.

You need to first define your grammar. This is a simple grammar for your case:
document = *WS [string *WS *(',' *WS string *WS)]
string = %x22 *char %x22
char = %x20-21 / %x23-5B / escape / %x5D-10FFFF
escape = %x5C (%x5C / %x22 / 't' / 'n' / 'r')
WS = %x9 / %x20
You can read it as:
A document may begin/end with a white space, then may have one or more strings separated by commas. Before and after each comma there may be some white space.
A string is made of characters and begins and ends with double quotes Unicode/ASCII hex code 22.
Each character (char), may be: 1) any non control Unicode character before the double quotes i.e. hex 20 (space) or hex 21 (exclamation mark); 2) any character after the double quotes and before the escape slash \ (hex 5C); 3) an escape character sequence; 4) any other Unicode character after the slash (hex 5C).
The escape sequence (rule escape) begins with the escape slash \ and is followed by another slash, or the characters t for tab, n for line feed and r for carriage return. You may add and other escapable characters if you want, as for a C++ string syntax you can see here: https://en.cppreference.com/w/cpp/language/escape .
A white space (WS) is a tab or space, you may add and %xA and %xD for line feed and carriage return respectively.
By the use of this grammar you will get this tree for your input:
The screenshort is from the Tunnel Grammar Studio online laboratory that can run ABNF grammars (as the one above), and I work on.
After you have the grammar, you may use tools to generate a parser, or you can write one yourself. If you want to do it by hand (preferable for so small and simple grammar), you may have one function per each grammar rule that reads one character and checks is it the expected one. If your input finishes when you are parsing the string rule, then you have an input with a started but not finished string.
Your actual string syntax tree will look like that:

Related

Replace line in text containing special characters (mathematical equation) linux text

I want to replace a line, that represents a part of mathematical equation:
f(x,z,time,temp)=-(2.0)/(exp(128*((x-2.5*time)*(x-2.5*time)+(z-0.2)*(z-0.2))))+(
with a new one similar to the above. Both new and old lines are saved in bash variables.
Main problem is that mathematical equation is full with special characters that do not allow proper search and replace in bash mode, even when I used as delimiter special character that is not used in equation.
I used
sed -n "s|$OLD|$NEW|g" restart.k
and
sed -i "s|$OLD|$NEW|g" restart.k
but all times I get wrong results.
Any idea to solve this?
There is only * in your pattern here that is special for sed, so escape it and do replacement as usual:
sed "s:$(sed 's:[*]:\\&:g' <<<"$old"):$new:" infile
if there are more special characters in your real sample, then you will need to add them inside bracket []; there are some exceptions like:
if ^ character: it can be place anywhere in [] but not first character, because ^ character at first negates the characters within its bracket expression.
if ] character: it should be the first character, because this character is also used to end the bracket expression.
if - character: it should be the first or last character, because this character is also can be used for defining the range of characters too.

How to break a string over multiple lines and preserve spaces in YAML?

Please note, that the question is similar like this one, but still different so that those answers won't solve my problem:
For insertion of control characters like e.g. \x08, it seems that I have to use double quotes ".
All spaces needs to be preserved exactly as given. For line breaks I use explicitly \n.
I have some string data which I need to store in YAML, e.g.:
" This is my quite long string data "
"This is my quite long string data"
"This_is_my_quite_long_string_data"
"Sting data\nwhich\x08contains control characters"
and need it in YAML as something like this:
Key: " This is my" +
" quite long " +
" string data "
This is no problem as long as I stay on a single line, but I don't know how to put the string content to multiple lines.
YAML block scalar styles (>, |) won't help here, because they don't allow escaping and they even do some whitespace stripping, newline / space substitution which is useless for my case.
Looks that the only way seems to be using double quoting " and backslashes \, like this:
Key: "\
This is \
my quite \
long string data\
"
Trying this in YAML online parser results in "This is my quite long string data" as expected.
But it unfortunately fail if one of the "sub-lines" has leading space, like this:
Key: "\
This is \
my quite\
long st\
ring data\
"
This results in "This is my quitelong string data", removed the space between the words quite and long of this example. The only thing that comes to my mind to solve that, is to replace the first leading space of each sub-line by \x20 like this:
Key: "\
This is \
my quite\
\x20long st\
ring data\
"
As I'd chosen YAML to have a best possible human readable format, I find that \x20 a bit ugly solution. Maybe someone know a better approach?
For keeping human readable, I also don't want to use !!binary for this.
Instead of \x20, you can simply escape the first non-indentation space on the line:
Key: "\
This is \
my quite\
\ long st\
ring data\
"
This works with multiple spaces, you only need to escape the first one.
You are right in your observation that control characters can only be represented in double quoted scalars.
However the parser doesn't fail if the sub-lines (in YAML speak: continuation lines) have a leading space. It is your interpretation of the YAML standard that is incorrect. The standard explicitly states that for multi-line double quoted scalars:
All leading and trailing white space characters are excluded from the content.
So you can put as many spaces as you want before long as you want, it will not make a difference.
The representer for double quoted scalars for Python (both in ruamel.yaml and PyYAML) always does represent newlines as \n. I am not aware of YAML representers in other languages where you have more control over this (and e.g. get double newlines to represent \n in your double quoted scalars). So you probably have to write your own representer.
While writing a representer you can try to make the line breaking be smart, in that it minimizes the number of escaped spaces (by putting them between words on the same line). But especially on strings with a high double space to word ratio, combined with a small width to operate in, it will be hard (if not impossible) to do without escaped spaces.
Such a representer should IMO first check if double quoting is necessary (i.e. there are control characters apart from newlines). If not, and there are newlines you are probably better of representing the string a block style literal scalar (for which spaces at the beginning or end of line are not excluded).

ANTLR: How to write a rule for enforcing line continuation character while writing a string?

I want to write a rule for parsing a string inside double quotes. I want to allow any character, with the only condition being that there MUST be a line continuation character \, when splitting the string on multiple lines.
Example:
variable = "first line \n second line \
still second line \n \
third line"
If the line continuation character is not found before a newline character is found, I want the parser to barf.
My current rule is this:
STRING : '"' (ESC|.)*? '"';
fragment ESC : '\\' [btnr"\\] ;
So I am allowing the string to contain any character, including bunch of escape sequences. But I am not really enforcing that line continuation character \ is a necessity for splitting text.
How can I make the grammar enforce that rule?
Even though there is already an accepted answer let me put in my 2cents. I strongly recommend not to handle this type of error in a lexer rule. The reason is that you will not be able to give the user a good error message. First, lexer errors are usually not reported separately in ANTLR4, they appear as follow up parser errors. Second, the produced error (likely something like: "no viable alt at \n") is all but helpful.
The better solution is to accept both variants (linebreak with or w/o escape) and do a semantic check afterwards. Then you know exactly what is wrong and can the user tell what you really expected.
Solution
fragment ESCAPE
: '\\' .
;
STRING
: '"' (ESCAPE | ~[\n"])* '"'
;
Explanation
Fragment ESCAPE will match escaped characters (especially backslash and a new line character acting as a continuation sign).
Token STRING will match inside double quotation marks:
Escaped characters (fragment ESCAPE)
Everything except new line and double quotation marks.

Ignore escape characters (backslashes) in R strings

While running an R-plugin in SPSS, I receive a Windows path string as input e.g.
'C:\Users\mhermans\somefile.csv'
I would like to use that path in subsequent R code, but then the slashes need to be replaced with forward slashes, otherwise R interprets it as escapes (eg. "\U used without hex digits" errors).
I have however not been able to find a function that can replace the backslashes with foward slashes or double escape them. All those functions assume those characters are escaped.
So, is there something along the lines of:
>gsub('\\', '/', 'C:\Users\mhermans')
C:/Users/mhermans
You can try to use the 'allowEscapes' argument in scan()
X=scan(what="character",allowEscapes=F)
C:\Users\mhermans\somefile.csv
print(X)
[1] "C:\\Users\\mhermans\\somefile.csv"
As of version 4.0, introduced in April 2020, R provides a syntax for specifying raw strings. The string in the example can be written as:
path <- r"(C:\Users\mhermans\somefile.csv)"
From ?Quotes:
Raw character constants are also available using a syntax similar to the one used in C++: r"(...)" with ... any character sequence, except that it must not contain the closing sequence )". The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote.
First you need to get it assigned to a name:
pathname <- 'C:\\Users\\mhermans\\somefile.csv'
Notice that in order to get it into a name vector you needed to double them all, which gives a hint about how you could use regex. Actually, if you read it in from a text file, then R will do all the doubling for you. Mind you it not really doubling the backslashes. It is being stored as a single backslash, but it's being displayed like that and needs to be input like that from the console. Otherwise the R interpreter tries (and often fails) to turn it into a special character. And to compound the problem, regex uses the backslash as an escape as well. So to detect an escape with grep or sub or gsub you need to quadruple the backslashes
gsub("\\\\", "/", pathname)
# [1] "C:/Users/mhermans/somefile.csv"
You needed to doubly "double" the backslashes. The first of each couple of \'s is to signal to the grep machine that what next comes is a literal.
Consider:
nchar("\\A")
# returns `[1] 2`
If file E:\Data\junk.txt contains the following text (without quotes): C:\Users\mhermans\somefile.csv
You may get a warning with the following statement, but it will work:
texinp <- readLines("E:\\Data\\junk.txt")
If file E:\Data\junk.txt contains the following text (with quotes): "C:\Users\mhermans\somefile.csv"
The above readlines statement might also give you a warning, but will now contain:
"\"C:\Users\mhermans\somefile.csv\""
So, to get what you want, make sure there aren't quotes in the incoming file, and use:
texinp <- suppressWarnings(readLines("E:\\Data\\junk.txt"))

Characters to separate value

i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks
A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.
Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).
If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.
A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.
You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.
Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.
The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""

Resources