Is there a definitive documented answer on double quoted string escaping?

Is there a definitive documented answer on double quoted string escaping? - .htaccess

Say, for example, I want to write this header line in an HTTP response:
Content-Disposition: attachment; filename="I can't believe it's not header!.jpg"
It contains a mix of quotes, repeating the quotes doesn't work (despite being the cleanest approach):
Header set always Content-Disposition "attachment; filename=""I can't believe it's not header!.jpg"""
It throws an error Header has too many arguments.
Good old backslash works:
Header always set Content-Disposition "attachment; filename=\"I can't believe it's not header!.jpg\""
but the docs provide examples where backslashes are used unescaped, so I assume that "a\b" is parsed the same as "a\\b" because \b isn't special like \ and ". We know what they say about assumptions. Am I just being dense? Where are the docs?
Update: I opened a bug as I found other oddities.

Backslash-escapes are certainly the standard way of escaping characters in Apache config files, so backslash-escaping double quotes inside a string that is itself delimited by double quotes is certainly the way to go.
However, where is this documented? The page in the Apache docs that covers configuration file syntax does not explicitly cover this. (The only mention of backslashes are in regards to continuing directives across multiple lines - something which is rarely required.)
The Apache docs for mod_log_config (a base module) do state:
Literal quotes and backslashes should be escaped with backslashes.
This is where the argument is (always) enclosed in double quotes. The same happens to apply to pretty much all string arguments in all modules.
but the docs provide examples where backslashes are used unescaped, so I assume that "a\b" is parsed the same as "a\\b" because \b isn't special like \ and ".
I can't see where you are referring to? The link you provide does not seem to include such an example?
If the argument is an ordinary string then "a\b" would be seen as "ab" (the literal b is unnecessarily escaped). And "a\\b" would be "a\b" (the backslash itself is escaped for a literal backslash). However, if the argument takes a regex (as many of those examples on the Apache expressions page do) then \b itself is a special meta-character that asserts a word-boundary - there is no backslash-escape in this instance.
Note that arguments in Apache config files only need to be surrounded in double quotes if the value contains spaces. Many examples in the Apache docs include double quotes, but this is not a requirement. Spaces themselves can often be backslash-escaped (to avoid having to double quote the argument), but this tends to be less readable. For regex arguments it is often preferable use \s instead (any space character).

Related

What are the correct rules for Groovy escaping?

In the Groovy manual you can found these two pieces of text:
Any Groovy expression can be interpolated in all string literals,
apart from single and triple single quoted strings.
Slashy string
...
Only forward slashes need to be escaped with a backslash
They are obviously contradictory, for, according to the second sentence, /$a/ will be interpreted as '$a'. But, according to the first one, it will be interpreted as '-the meaning of variable a-'. In the real life, it will work the second way.
What is interesting, the dollar before something that looks like a variable should be escaped in single-quoted strings, too. Real life examples are here. Groovy tries to read $ as a variable name prefix even when not interpolating.
It seems, the explanation of the dollar slashy strings sets it correctly for ALL strings:
except to escape the dollar of a string subsequence that would start
like a GString placeholder sequence, ...
Could you formulate the correct and uncontradictory rules for groovy escape rules?
The practical tests were made on Gradle plugin for Intellij.

Ignore escape characters (backslashes) in R strings

While running an R-plugin in SPSS, I receive a Windows path string as input e.g.
'C:\Users\mhermans\somefile.csv'
I would like to use that path in subsequent R code, but then the slashes need to be replaced with forward slashes, otherwise R interprets it as escapes (eg. "\U used without hex digits" errors).
I have however not been able to find a function that can replace the backslashes with foward slashes or double escape them. All those functions assume those characters are escaped.
So, is there something along the lines of:
>gsub('\\', '/', 'C:\Users\mhermans')
C:/Users/mhermans

You can try to use the 'allowEscapes' argument in scan()
X=scan(what="character",allowEscapes=F)
C:\Users\mhermans\somefile.csv
print(X)
[1] "C:\\Users\\mhermans\\somefile.csv"

As of version 4.0, introduced in April 2020, R provides a syntax for specifying raw strings. The string in the example can be written as:
path <- r"(C:\Users\mhermans\somefile.csv)"
From ?Quotes:
Raw character constants are also available using a syntax similar to the one used in C++: r"(...)" with ... any character sequence, except that it must not contain the closing sequence )". The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote.

First you need to get it assigned to a name:
pathname <- 'C:\\Users\\mhermans\\somefile.csv'
Notice that in order to get it into a name vector you needed to double them all, which gives a hint about how you could use regex. Actually, if you read it in from a text file, then R will do all the doubling for you. Mind you it not really doubling the backslashes. It is being stored as a single backslash, but it's being displayed like that and needs to be input like that from the console. Otherwise the R interpreter tries (and often fails) to turn it into a special character. And to compound the problem, regex uses the backslash as an escape as well. So to detect an escape with grep or sub or gsub you need to quadruple the backslashes
gsub("\\\\", "/", pathname)
# [1] "C:/Users/mhermans/somefile.csv"
You needed to doubly "double" the backslashes. The first of each couple of \'s is to signal to the grep machine that what next comes is a literal.
Consider:
nchar("\\A")
# returns `[1] 2`

If file E:\Data\junk.txt contains the following text (without quotes): C:\Users\mhermans\somefile.csv
You may get a warning with the following statement, but it will work:
texinp <- readLines("E:\\Data\\junk.txt")
If file E:\Data\junk.txt contains the following text (with quotes): "C:\Users\mhermans\somefile.csv"
The above readlines statement might also give you a warning, but will now contain:
"\"C:\Users\mhermans\somefile.csv\""
So, to get what you want, make sure there aren't quotes in the incoming file, and use:
texinp <- suppressWarnings(readLines("E:\\Data\\junk.txt"))

Characters to separate value

i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks

A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.

Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).

If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.

A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.

You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.

Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.

The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""

New lines in tab delimited or comma delimtted output

I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.

Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.

#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.

For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.

Are quotes a type of string delimiter? Or does 'delimiter' mean other types of characters not including quotes?

When people talk about string delimiters, does that include quotes or does that mean everything except quotes?

It means any character used to define the beginning and end of a string (e.g. quotes but, in other contexts, other characters).

There's a subtle difference, if you're talking about string delimiters that nearly always means quotes, either " or '.
If you're talking about a delimited string, then you're normal talking about a string of tokens, with delimiters between them ie
"this,is,a,delimited,string" -
It's very common to use a comma, as the delimiter, but that leads to issues when the token already contains a comma - for instance
"one,million,dollars,$1,000,000"
In this instance it's common to further delimit the token so we get
"one,million,dollars,"$1,000,000""
another common alternative is to use an unusual character as the delimiter, and there's a minor convention to use the pipe symbol |
"one|million|dollars|$1,000,000"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string