Why does URL encoding exist for ASCII character set

Why does URL encoding exist for ASCII character set - security

It is clearly stated in W3Schools that
URLs can only be sent over the Internet using the ASCII character-set.
Why does URL encoding exist for ASCII characters like a , b , c when it can be sent over the internet without any URL encoding ???
Eg: Why encode 'a' when it can send over as 'a'
What are the possible reasons to encode ASCII characters ?? The only reason i can think of are hackers who are trying to make their URL as unreadable as possible to carry out XSS attacks

STD 66, Percent-Encoding:
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
So percent-encoding is a kind of escape mechanism: Some characters have a special meaning in URI components (→ they are reserved). If you want to use such a character without its special meaning, you percent-encode it.
Unreserved characters (like a, b, c, …) can always be used directly, but it’s also allowed to percent-encode them. Such URIs would be equivalent:
URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource.
Why it’s allowed to percent-encode unreserved characters in the first place? The obsolete RFC 2396 contains (bold by me):
Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.
I can’t think of an example for such a "context", but this sentence suggests that there may be some.
Also, maybe some people/implementations like to simply percent-encode everything (except for delimiters etc.), so they don’t have to check if/which characters would need percent-encoding in the corresponding component.

URL encoding exists for the full range of ASCII because it was easier to define an encoding that works for all characters than to define one that only works for the set of characters with special meanings.

URL encoding allows for characters that have special meaning in a URL to be included in a segment, without their special meaning. There are many examples, but the most common ones to require encoding include " ", "?", "=" and "&"

URL encoding was designed so it can encode any ASCII character.
While = is encoded as %3d, ? is encoded as %3f and & is encoded as %26, it makes sense for a to be encoded as %61 and b to be encoded as %62, as the hex number after the % represents the ASCII code of the character.

Related

Why Ampersand should be escaped because of XSS injection

The five characters that OWASP recommend escape to prevent XSS injections are
&, <, >, ", '.
Among them, I cannot understand why &(ampersand) should be escaped and how it can be used as a vector to inject script. Can somebody give an example that all the other four characters that are escaped but ampersand is not so there will be XSS injection vulnerability.
I have checked the other question but that answer really does not make things any clearer.

The answer here addresses the issue only in a nested JavaScript context within an HTML attribute context, whereas your question asks specifically about pure HTML context escaping.
In that question, the escaping should be as per the OWASP recommendation for JavaScript:
Except for alphanumeric characters, escape all characters with the \uXXXX unicode escaping format (X = Integer).
Which will already handle & because it is not alphanumeric.
To answer you question,
from a practical point of view, why wouldn't you escape ampersand?
The HTML representation of & is &, so it makes a lot of sense to do that. If you didn't, anytime a user entered &amp, &lt, or &gt into your application, your application would render &, <, or > instead of &amp, &lt or &gt.
An edge case? Definitely. A security concern? It shouldn't be.
From the HTML5 syntax Character references section:
Character references must start with a U+0026 AMPERSAND character (&).
Following this, there are three possible kinds of character
references:
Named character references
Decimal numeric character reference
Hexadecimal numeric character reference
When an & is encountered:
Switch to the data state.
Attempt to consume a character reference, with no additional allowed
character.
If nothing is returned, emit a U+0026 AMPERSAND character (&) token.
Otherwise, emit the character tokens that were returned.
Therefore, anything after the & will cause either & to be output, or the character represented. As the following characters have to be alphanumeric or else they won't be consumed, there is no chance of an escape character (e.g. ', ", >, <) being consumed and ignored, therefore there is little security risk of an attacker changing the parsing context. However, you never know if there is a browser bug that doesn't quite follow the standard properly, therefore I would always escape &. Internet Explorer had an issue where you could specify <% and it would be interpreted as < allowing the .NET Request Validation from being bypassed for XSS attack vectors. Always better to be safe than sorry.

%20 in filename, browser sees whitespaces. How to avoid that?

I have imported images to my own server. One of the many filenames has %20 in them, e.g. 50128%202789%20001V%20500.jpg. However the browser sees it as 50128 2789 001V 500.jpg, so it won't display the image.
What solution can I use, to display the image properly?

"%20" is a percent-encoding for a space, and has special meaning in a URL. Basically, you can't use spaces in a URI or URL, and have to replace them with a code to comply with the rules. Things that read URLs (the server for example) translate them back.
Unfortunately, that means if you have a filename containing this sequence, it will be mistaken for a percent-encoded space, as you're seeing.
The solution is to also percent-encode the '%'. The percent-encoding for a '%' character is "%25".
In your example, the name "50128%202789%20001V%20500.jpg" has to be encoded to "50128%25202789%2520001V%2520500.jpg" so that those '%' characters are not mistaken for spaces.
There are of course other things that get encoded. The rules are defined in the URI specification.

Why does question mark show up in web browser?

I was (re)reading Joel's great article on Unicode and came across this paragraph, which I didn't quite understand:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Why is there a question mark, and what does he mean by "or, if you're really good, a box"? And what character is he trying to display?

There is a question mark because the encoding process recognizes that the encoding can't support the character, and substitutes a question mark instead. By "if you're really good," he means, "if you have a newer browser and proper font support," you'll get a fancier substitution character, a box.
In Joel's case, he isn't trying to display a real character, he literally included the Unicode replacement character, U+FFFD REPLACEMENT CHARACTER.

It’s a rather confusing paragraph, and I don’t really know what the author is trying to say. Anyway, different browsers (and other programs) have different ways of handling problems with characters. A question mark “?” may appear in place of a character for which there is no glyph in the font(s) being used, so that it effectively says “I cannot display the character.” Browsers may alternatively use a small rectangle, or some other indicator, for the same purpose.
But the “�” symbol is REPLACEMENT CHARACTER that is normally used to indicate data error, e.g. when character data has been converted from some encoding to Unicode and it has contained some character that cannot be represented in Unicode. Browsers often use “�” in display for a related purpose: to indicate that character data is malformed, containing bytes that do not constitute a character, in the character encoding being applied. This often happens when data in some encoding is being handled as if it were in some other encoding.
So “�” does not really mean “unknown character”, still less “undisplayable character”. Rather, it means “not a character”.

A question mark appears when a byte sequence in the raw data does not match the data's character set so it cannot be decoded properly. That happens if the data is malformed, if the data's charset is explicitally stated incorrectly in the HTTP headers or the HTML itself, the charset is guessed incorrectly by the browser when other information is missing, or the user's browser settings override the data's charset with an incompatible charset.
A box appears when a decoded character does not exist in the font that is being used to display the data.

Just what it says - some browsers show "a weird character" or a question mark for characters outside of the current known character set. It's their "hey, I don't know what this is" character. Get an old version of Netscape, paste some text form Microsoft Word which is using smart quotes, and you'll get question marks.
http://blog.salientdigital.com/2009/06/06/special-characters-showing-up-as-a-question-mark-inside-of-a-black-diamond/ has a decent explanation.

Cyrillic characters in browser address bar

When I put cyrillic symbols in address bar like this:
http://ru2.php.net/manual-lookup.php?pattern=привет
it switches to
http://ru2.php.net/manual-lookup.php?pattern=%EF%F0%E8%E2%E5%F2
What does that characters -- %EF%F0%E8%E2%E5%F2 -- mean? And why is it happening?

The characters are getting URL encoded. A URL may only contain a subset of ASCII characters, so anything outside plain alphanumeric and some special characters must be URL encoded.
Some browsers display non-ASCII characters as human readable characters, but that's entirely up to them. In protocols, URLs are always URL encoded.

Characters to separate value

i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks

A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.

Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).

If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.

A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.

You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.

Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.

The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string