Why Ampersand should be escaped because of XSS injection - security

The five characters that OWASP recommend escape to prevent XSS injections are
&, <, >, ", '.
Among them, I cannot understand why &(ampersand) should be escaped and how it can be used as a vector to inject script. Can somebody give an example that all the other four characters that are escaped but ampersand is not so there will be XSS injection vulnerability.
I have checked the other question but that answer really does not make things any clearer.

The answer here addresses the issue only in a nested JavaScript context within an HTML attribute context, whereas your question asks specifically about pure HTML context escaping.
In that question, the escaping should be as per the OWASP recommendation for JavaScript:
Except for alphanumeric characters, escape all characters with the \uXXXX unicode escaping format (X = Integer).
Which will already handle & because it is not alphanumeric.
To answer you question,
from a practical point of view, why wouldn't you escape ampersand?
The HTML representation of & is &, so it makes a lot of sense to do that. If you didn't, anytime a user entered &amp, &lt, or &gt into your application, your application would render &, <, or > instead of &amp, &lt or &gt.
An edge case? Definitely. A security concern? It shouldn't be.
From the HTML5 syntax Character references section:
Character references must start with a U+0026 AMPERSAND character (&).
Following this, there are three possible kinds of character
references:
Named character references
Decimal numeric character reference
Hexadecimal numeric character reference
When an & is encountered:
Switch to the data state.
Attempt to consume a character reference, with no additional allowed
character.
If nothing is returned, emit a U+0026 AMPERSAND character (&) token.
Otherwise, emit the character tokens that were returned.
Therefore, anything after the & will cause either & to be output, or the character represented. As the following characters have to be alphanumeric or else they won't be consumed, there is no chance of an escape character (e.g. ', ", >, <) being consumed and ignored, therefore there is little security risk of an attacker changing the parsing context. However, you never know if there is a browser bug that doesn't quite follow the standard properly, therefore I would always escape &. Internet Explorer had an issue where you could specify <% and it would be interpreted as < allowing the .NET Request Validation from being bypassed for XSS attack vectors. Always better to be safe than sorry.

Related

Is there a definitive documented answer on double quoted string escaping?

Say, for example, I want to write this header line in an HTTP response:
Content-Disposition: attachment; filename="I can't believe it's not header!.jpg"
It contains a mix of quotes, repeating the quotes doesn't work (despite being the cleanest approach):
Header set always Content-Disposition "attachment; filename=""I can't believe it's not header!.jpg"""
It throws an error Header has too many arguments.
Good old backslash works:
Header always set Content-Disposition "attachment; filename=\"I can't believe it's not header!.jpg\""
but the docs provide examples where backslashes are used unescaped, so I assume that "a\b" is parsed the same as "a\\b" because \b isn't special like \ and ". We know what they say about assumptions. Am I just being dense? Where are the docs?
Update: I opened a bug as I found other oddities.
Backslash-escapes are certainly the standard way of escaping characters in Apache config files, so backslash-escaping double quotes inside a string that is itself delimited by double quotes is certainly the way to go.
However, where is this documented? The page in the Apache docs that covers configuration file syntax does not explicitly cover this. (The only mention of backslashes are in regards to continuing directives across multiple lines - something which is rarely required.)
The Apache docs for mod_log_config (a base module) do state:
Literal quotes and backslashes should be escaped with backslashes.
This is where the argument is (always) enclosed in double quotes. The same happens to apply to pretty much all string arguments in all modules.
but the docs provide examples where backslashes are used unescaped, so I assume that "a\b" is parsed the same as "a\\b" because \b isn't special like \ and ".
I can't see where you are referring to? The link you provide does not seem to include such an example?
If the argument is an ordinary string then "a\b" would be seen as "ab" (the literal b is unnecessarily escaped). And "a\\b" would be "a\b" (the backslash itself is escaped for a literal backslash). However, if the argument takes a regex (as many of those examples on the Apache expressions page do) then \b itself is a special meta-character that asserts a word-boundary - there is no backslash-escape in this instance.
Note that arguments in Apache config files only need to be surrounded in double quotes if the value contains spaces. Many examples in the Apache docs include double quotes, but this is not a requirement. Spaces themselves can often be backslash-escaped (to avoid having to double quote the argument), but this tends to be less readable. For regex arguments it is often preferable use \s instead (any space character).

Why does URL encoding exist for ASCII character set

It is clearly stated in W3Schools that
URLs can only be sent over the Internet using the ASCII character-set.
Why does URL encoding exist for ASCII characters like a , b , c when it can be sent over the internet without any URL encoding ???
Eg: Why encode 'a' when it can send over as 'a'
What are the possible reasons to encode ASCII characters ?? The only reason i can think of are hackers who are trying to make their URL as unreadable as possible to carry out XSS attacks
STD 66, Percent-Encoding:
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
So percent-encoding is a kind of escape mechanism: Some characters have a special meaning in URI components (→ they are reserved). If you want to use such a character without its special meaning, you percent-encode it.
Unreserved characters (like a, b, c, …) can always be used directly, but it’s also allowed to percent-encode them. Such URIs would be equivalent:
URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource.
Why it’s allowed to percent-encode unreserved characters in the first place? The obsolete RFC 2396 contains (bold by me):
Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.
I can’t think of an example for such a "context", but this sentence suggests that there may be some.
Also, maybe some people/implementations like to simply percent-encode everything (except for delimiters etc.), so they don’t have to check if/which characters would need percent-encoding in the corresponding component.
URL encoding exists for the full range of ASCII because it was easier to define an encoding that works for all characters than to define one that only works for the set of characters with special meanings.
URL encoding allows for characters that have special meaning in a URL to be included in a segment, without their special meaning. There are many examples, but the most common ones to require encoding include " ", "?", "=" and "&"
URL encoding was designed so it can encode any ASCII character.
While = is encoded as %3d, ? is encoded as %3f and & is encoded as %26, it makes sense for a to be encoded as %61 and b to be encoded as %62, as the hex number after the % represents the ASCII code of the character.

Replace character with a safe character and vice-versa

Here's my problem:
I need to store sentences "somewhere" (it doesn't matter where).
The sentences must not contain spaces.
When I extract the sentences from that "somewhere", I need to restore the spaces.
So, before storing the sentence "I am happy" I could replace the spaces with a safe character, such as &. In C#:
theString.Replace(' ', '&');
This would yield 'I&am&happy'.
And when retrieving the sentence, I would to the reverse:
theString.Replace('&', ' ');
But what if the original sentence already contains the '&' character?
Say I would do the same thing with the sentence 'I am happy & healthy'. With the design above, the string would come back as 'I am happy healthy', since the '&' char has been replaced with a space.
(Of course, I could change the & character to a more unlikely symbol, such as ¤, but I want this to be bullet proof)
I used to know how to solve this, but I forgot how.
Any ideas?
Thanks!
Fredrik
Maybe you can use url encoding (percent encoding) as an inspiration.
Characters that are not valid in a url are escaped by writing %XX where XX is a numeric code that represents the character. The % sign itself can also be escaped in the same way, so that way you never run into problems when translating it back to the original string.
There are probably other similar encodings, and for your own application you can use an & just as well as a %, but by using an existing encoding like this, you can probably also find existing functions to do the encoding and decoding for you.

New lines in tab delimited or comma delimtted output

I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.
Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.
#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.
For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.

Most reliable split character

Update
If you were forced to use a single char on a split method, which char would be the most reliable?
Definition of reliable: a split character that is not part of the individual sub strings being split.
We currently use
public const char Separator = ((char)007);
I think this is the beep sound, if i am not mistaken.
Aside from 0x0, which may not be available (because of null-terminated strings, for example), the ASCII control characters between 0x1 and 0x1f are good candidates. The ASCII characters 0x1c-0x1f are even designed for such a thing and have the names File Separator, Group Separator, Record Separator, Unit Separator. However, they are forbidden in transport formats such as XML.
In that case, the characters from the unicode private use code points may be used.
One last option would be to use an escaping strategy, so that the separation character can be entered somehow anyway. However, this complicates the task quite a lot and you cannot use String.Split anymore.
You can safely use whatever character you like as delimiter, if you escape the string so that you know that it doesn't contain that character.
Let's for example choose the character 'a' as delimiter. (I intentionally picked a usual character to show that any character can be used.)
Use the character 'b' as escape code. We replace any occurrence of 'a' with 'b1' and any occurrence of 'b' with 'b2':
private static string Escape(string s) {
return s.Replace("b", "b2").Replace("a", "b1");
}
Now, the string doesn't contain any 'a' characters, so you can put several of those strings together:
string msg = Escape("banana") + "a" + Escape("aardvark") + "a" + Escape("bark");
The string now looks like this:
b2b1nb1nb1ab1b1rdvb1rkab2b1rk
Now you can split the string on 'a' and get the individual parts:
b2b1nb1nb1
b1b1rdvb1rk
b2b1rk
To decode the parts you do the replacement backwards:
private static string Unescape(string s) {
return s.Replace("b1", "a").Replace("b2", "b");
}
So splitting the string and unencoding the parts is done like this:
string[] parts = msg.split('a');
for (int i = 0; i < parts.length; i++) {
parts[i] = Unescape(parts[i]);
}
Or using LINQ:
string[] parts = msg.Split('a').Select<string,string>(Unescape).ToArray();
If you choose a less common character as delimiter, there are of course fewer occurrences that will be escaped. The point is that the method makes sure that the character is safe to use as delimiter without making any assumptions about what characters exists in the data that you want to put in the string.
I usually prefer a '|' symbol as the split character. If you are not sure of what user enters in the text then you can restrict the user from entering some special characters and you can choose from those characters, the split character.
It depends what you're splitting.
In most cases it's best to use split chars that are fairly commonly used, for instance
value, value, value
value|value|value
key=value;key=value;
key:value;key:value;
You can use quoted identifiers nicely with commas:
"value", "value", "value with , inside", "value"
I tend to use , first, then |, then if I can't use either of them I use the section-break char §
Note that you can type any ASCII char with ALT+number (on the numeric keypad only), so § is ALT+21
\0 is a good split character. It's pretty hard (impossible?) to enter from keyboard and it makes logical sense.
\n is another good candidate in some contexts.
And of course, .Net strings are unicode, no need to limit yourself with the first 255. You can always use a rare Mongolian letter or some reserved or unused Unicode symbol.
There are overloads of String.Split that take string separators...
I'd personally say that it depends on the situation entirely; if you're writing a simple TCP/IP chat system, you obviously shouldn't use '\n' as the split.. But '\0' is a good character to use due to the fact that the users can't ever use it!
First of all, in C# (or .NET), you can use more than one split characters in one split operation.
String.Split Method (Char[]) Reference here
An array of Unicode characters that delimit the substrings in this instance, an empty array that contains no delimiters, or null reference (Nothing in Visual Basic).
In my opinion, there's no MOST reliable split character, however some are more suitable than others.
Popular split characters like tab, comma, pipe are good for viewing the un-splitted string/line.
If it's only for storing/processing, the safer characters are probably those that are seldom used or those not easily entered from the keyboard.
It also depend on the usage context. E.g. If you are expecting the data to contain email addresses, "#" is a no no.
Say we were to pick one from the ASCII set. There are quite a number to choose from. E.g. " ` ", " ^ " and some of the non-printable characters. Do beware of some characters though, not all are suitable. E.g. 0x00 might have adverse effect on some system.
It depends very much on the context in which it's used. If you're talking about a very general delimiting character then I don't think there is a one-size-fits-all answer.
I find that the ASCII null character '\0' is often a good candidate, or you can go with nitzmahone's idea and use more than one character, then it can be as crazy as you want.
Alternatively, you can parse the input and escape any instances of your delimiting character.
"|" pipe sign is mostly used when you are passing arguments.. to the method accepting just a string type parameter.
This is widely used used in SQL Server SPs as well , where you need to pass an array as the parameter. Well mostly it depends upon the situation where you need it.

Resources