Is there a limitation of the type of characters that are legal for an xsd::string (i.e. only letters and numbers)?
Is a # for example legal to be contained it the string?
According to http://www.w3schools.com/schema/schema_dtypes_string.asp, the xsd:string can contains:
The string data type can contain characters, line feeds, carriage returns, and tab characters.
Also, characters means in the XML syntax every Unicode characters (including accents, symbol, etc.), so the # is allowed. The only constraint is you need to escape &, < and > characters, for obvious reasons. Source: http://www.schemacentral.com/sc/xsd/t-xsd_string.html
Related
It is clearly stated in W3Schools that
URLs can only be sent over the Internet using the ASCII character-set.
Why does URL encoding exist for ASCII characters like a , b , c when it can be sent over the internet without any URL encoding ???
Eg: Why encode 'a' when it can send over as 'a'
What are the possible reasons to encode ASCII characters ?? The only reason i can think of are hackers who are trying to make their URL as unreadable as possible to carry out XSS attacks
STD 66, Percent-Encoding:
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
So percent-encoding is a kind of escape mechanism: Some characters have a special meaning in URI components (→ they are reserved). If you want to use such a character without its special meaning, you percent-encode it.
Unreserved characters (like a, b, c, …) can always be used directly, but it’s also allowed to percent-encode them. Such URIs would be equivalent:
URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource.
Why it’s allowed to percent-encode unreserved characters in the first place? The obsolete RFC 2396 contains (bold by me):
Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.
I can’t think of an example for such a "context", but this sentence suggests that there may be some.
Also, maybe some people/implementations like to simply percent-encode everything (except for delimiters etc.), so they don’t have to check if/which characters would need percent-encoding in the corresponding component.
URL encoding exists for the full range of ASCII because it was easier to define an encoding that works for all characters than to define one that only works for the set of characters with special meanings.
URL encoding allows for characters that have special meaning in a URL to be included in a segment, without their special meaning. There are many examples, but the most common ones to require encoding include " ", "?", "=" and "&"
URL encoding was designed so it can encode any ASCII character.
While = is encoded as %3d, ? is encoded as %3f and & is encoded as %26, it makes sense for a to be encoded as %61 and b to be encoded as %62, as the hex number after the % represents the ASCII code of the character.
I was pasting bibtex references into bibex. some names contain characters that latex skips. for example, á. is there a way in vim or regex to search for all characters that are skipped by latex? one way I would think is to write in regex to search for anything that doesn't contain 0-9, a-z, A-Z and some characters like / \ $
I am not familiar with which characters LaTeX ignores, but if the file you are editing is encoded in UTF-8, you might try searching for characters outside the ASCII repertoire (0–127; or 32–127).
As a search command in Vim:
/[^\d0-\d127]
/[^\d32-\d127]
You can also use hex or octal instead of decimal; see :help /[]. This requires that l and \ not be present in the value of cpoptions (they are not present in the default state).
This should work for any encoding that is “the same as ASCII (where it is defined)” (i.e. UTF-8 and most “latin” encodings). If you are dealing with an encoding that clashes with ASCII, then you will need to refine the range specification.
I am reading this tutorial, and I encountered that bash script uses [...] as a wild card character. So what exactly [...] stands in a bash script?
It's a regex-style character matching syntax; from the Bash Reference Manual, §3.5.8.1 (Pattern Matching):
[...]
Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the ‘[’ is a ‘!’ or a ‘^’ then any character not enclosed is matched. A ‘−’ may be matched by including it as the first or last character in the set. A ‘]’ may be matched by including it as the first character in the set. The sorting order of characters in range expressions is determined by the current locale and the value of the LC_COLLATE shell variable, if set.
For example, in the default C locale, ‘[a-dx-z]’ is equivalent to ‘[abcdxyz]’. Many locales sort characters in dictionary order, and in these locales ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; it might be equivalent to ‘[aBbCcDdxXyYz]’, for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting the LC_COLLATE or LC_ALL environment variable to the value ‘C’.
Within ‘[’ and ‘]’, character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the posix standard:
alnum alpha ascii blank cntrl digit graph lower
print punct space upper word xdigit
A character class matches any character belonging to that class. The word character class matches letters, digits, and the character ‘_’.
Within ‘[’ and ‘]’, an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the character c.
Within ‘[’ and ‘]’, the syntax [.symbol.] matches the collating symbol symbol.
(emphasis added to the most common usage patterns)
It is used in the tutorial to speak about regular expressions in addition to globbing ('*' and '?'). For example [a-z] regular expression will match one lowercase character.
Actually, what is a wildcard is [abc] for example. It matches one of the three letters.
i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks
A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.
Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).
If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.
A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.
You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.
Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.
The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""
When people talk about string delimiters, does that include quotes or does that mean everything except quotes?
It means any character used to define the beginning and end of a string (e.g. quotes but, in other contexts, other characters).
There's a subtle difference, if you're talking about string delimiters that nearly always means quotes, either " or '.
If you're talking about a delimited string, then you're normal talking about a string of tokens, with delimiters between them ie
"this,is,a,delimited,string" -
It's very common to use a comma, as the delimiter, but that leads to issues when the token already contains a comma - for instance
"one,million,dollars,$1,000,000"
In this instance it's common to further delimit the token so we get
"one,million,dollars,"$1,000,000""
another common alternative is to use an unusual character as the delimiter, and there's a minor convention to use the pipe symbol |
"one|million|dollars|$1,000,000"