Acceptable encoding for Cosmos DB IDs to replace illegal characters? - azure

I'm trying to store data in Cosmos DB where the IDs use a slash (/). However slash is an illegal character in Cosmos IDs. I initially tried to resolve this by URL encoding slashes (%2F) as that's the form I'd generally receive them in through API requests. However, though percent (%) is not an illegal character for IDs, Cosmos still chokes on them being unable to retrieve many documents with a percent in the ID (it works for some, but it appears if the % is followed by certain characters it fails).
Is there a encoding that is suitable for Cosmos DB IDs that will replace illegal characters in the original ID text without introducing illegal or unhandled characters (like %) in the encoded ID text? I'd prefer to stay away from things like Base64 which makes the IDs hard to decipher for people. And I'd also like to avoid simple character replacement (/ becomes -) in case an ID uses the replacement character.

I ended up doing simple character replacement, swapping out slashes (/) with pipes (|).
The key thing to make this livable is adding a value converter with EntityFramework.
Expression<Func<string?, string>> toDB = v => v!.Replace("/", "|");
Expression<Func<string, string?>> fromDB = v => v!.Replace("|", "/");
builder.Property(p => p.Id).HasConversion(toDB, fromDB);
This allows the character replacement to happen automatically when reading & writing to the database. The only time you need to worry about the difference is if you're accessing the database directly or from other code without the converter. Or possibly doing custom searches. I manually do the translation for a filtering framework we use, and I suspect that other id search solutions would need the same manual translation.
Ultimately I decided this was acceptable as we are unlikely to have other characters that need translation for our case, the translation is easy to do visually, and it's transparent in most cases with ValueConverters. But it isn't a general solution that would work for any possible string id.
Edit:
On second thought, this solution is deficient. Cosmos does actually allow creating documents with illegal characters in the ID, it just doesn't allow accessing or deleting them easily. An ideal solution would prevent all illegal characters across the board, whether expected or not.

Related

fetching compact version of JSONB in PostgreSQL

How to fetch compact JSONB from PostgreSQL?
All I got when fetching is with spaces:
SELECT data FROM a_table WHERE id = 1; -- data is JSONB column
{"unique": "bla bla", "foo": {"bar": {"in ...
^ ^ ^ ^ ^ --> spaces
What I want is:
{"unique":"bla bla","foo":{"bar":{"in ...
json_strip_nulls() does exactly what you're looking for:
SELECT json_build_object('a', 1);
returns
{"a" : 1}
But
SELECT json_strip_nulls(json_build_object('a', 1));
returns
{"a":1}
This function not only strips nulls as indicated by its function name and as documented, but incidentally also strips insignificant whitespace. The latter is not explicitly documented in PostgreSQL manual.
Tested in PostgreSQL 11.3, but probably works with earlier versions too.
jsonb is rendered in a standardized format on output. You would have to use json instead to preserve insignificant white space. Per documentation:
Because the json type stores an exact copy of the input text, it will
preserve semantically-insignificant white space between tokens, as
well as the order of keys within JSON objects. Also, if a JSON object
within the value contains the same key more than once, all the
key/value pairs are kept. (The processing functions consider the last
value as the operative one.) By contrast, jsonb does not preserve
white space, does not preserve the order of object keys, and does not
keep duplicate object keys.
The whitespace really shouldn't matter for JSON values.
There is a discussion, started in 2016, about a function jsonb_compact() that will solve the problem... But, it could take years (!).
Pretty solution
  (a real solution for this question and this other one)
We must to agree with the PostgreSQL's convention for CAST(var_jsonb AS text). When you need another cast convention, for example to debug or human-readable output, the built-in jsonb_pretty() function is a good choice.
Unfortunately PostgreSQL not offers other choices, like the compact one. So, you can overload jsonb_pretty() with a compact option:
CREATE or replace FUNCTION jsonb_pretty(
jsonb, -- input
compact boolean -- true for compact format
) RETURNS text AS $$
SELECT CASE
WHEN $2=true THEN json_strip_nulls($1::json)::text
ELSE jsonb_pretty($1)
END
$$ LANGUAGE SQL IMMUTABLE;
SELECT jsonb_pretty( jsonb_build_object('a',1, 'bla','bla bla'), true );
-- results {"a":1,"bla":"bla bla"}
Rationale
The JSON standard, RFC 8259 says "... Insignificant whitespace is allowed before or after any of the six structural characters". In other words, the cast from jsonb datatype to text has no canonical form. The PostgreSQL cast convention (using spaces) is arbitrary.
A lot of applications need to minimize a big JSONb output. Two typical ones: minimizing file size of a big JSONb saved by pg_file_write(); output online in a REST interface.
The PostgreSQL team must to appreciate a real CAST procedure, not a parser, but a direct text production from JSONb internal representation.
The workaround — to remove spaces from "JSON text" — is not a simple task, it need a good parser to avoid tampering content. The solution is a parser, it is not a regular expression workaround... And in nowadays the built-in parser is json_strip_nulls(), even as "incidential behavior" parser.

Azure Table Storage RowKey restricted Character Patterns?

Are there restricted character patterns within Azure TableStorage RowKeys? I've not been able to find any documented via numerous searches. However, I'm getting behavior that implies such in some performance testing.
I've got some odd behavior with RowKeys consisting on random characters (the test driver does prevent the restricted characters (/ \ # ?) plus blocking single quotes from occurring in the RowKey). The result is I've got a RowKey that will insert fine into the table, but cannot be queried (the result is InvalidInput). For example:
RowKey: 9}5O0J=5Z,4,D,{!IKPE,~M]%54+9G0ZQ&G34!G+
Attempting to query by this RowKwy (equality) will result in an error (both within our app, using Azure Storage Explorer, and Cloud Storage Studio 2). I took a look at the request being sent via Fiddler:
GET /foo()?$filter=RowKey%20eq%20'9%7D5O0J=5Z,4,D,%7B!IKPE,~M%5D%54+9G0ZQ&G34!G+' HTTP/1.1
It appears the %54 in the RowKey is not escaped in the filter. Interestingly, I get similar behavior for batch requests to table storage with URIs in the batch XML that include this RowKey. I've also seen similar behavior for RowKeys with embedded double quotes, though I have not isolated that pattern yet.
Has anyone co me across this behavior? I can easily restrict additional characters from occurring in RowKeys, but would really like to know the 'rules'.
The following characters are not allowed in PartitionKey and RowKey fields:
The forward slash (/) character
The backslash (\) character
The number sign (#) character
The question mark (?) character
Further Reading: Azure Docs > Understanding the Table service data model
public static readonly Regex DisallowedCharsInTableKeys = new Regex(#"[\\\\#%+/?\u0000-\u001F\u007F-\u009F]");
Detection of Invalid Table Partition and Row Keys:
bool invalidKey = DisallowedCharsInTableKeys.IsMatch(tableKey);
Sanitizing the Invalid Partition or Row Key:
string sanitizedKey = DisallowedCharsInTableKeys.Replace(tableKey, disallowedCharReplacement);
At this stage you may also want to prefix the sanitized key (Partition Key or Row Key) with the hash of the original key to avoid false collisions of different invalid keys having the same sanitized value.
Do not use the string.GetHashCode() though since it may produce different hash code for the same string and shall not be used to identify uniqueness and shall not be persisted.
I use SHA256: https://msdn.microsoft.com/en-us/library/s02tk69a(v=vs.110).aspx
to create the byte array hash of the invalid key, convert the byte array to hex string and prefix the sanitized table key with that.
Also see related MSDN Documentation:
https://msdn.microsoft.com/en-us/library/azure/dd179338.aspx
Related Section from the link:
Characters Disallowed in Key Fields
The following characters are not allowed in values for the PartitionKey and RowKey properties:
The forward slash (/) character
The backslash (\) character
The number sign (#) character
The question mark (?) character
Control characters from U+0000 to U+001F, including:
The horizontal tab (\t) character
The linefeed (\n) character
The carriage return (\r) character
Control characters from U+007F to U+009F
Note that in addition to the mentioned chars in the MSDN article, I also added the % char to the pattern since I saw in a few places where people mention it being problematic. I guess some of this also depends on the language and the tech you are using to access the table storage.
If you detect additional problematic chars in your case, then you can add those to the regex pattern, nothing else needs to change.
I just found out (the hard way) that the '+' sign is allowed, but not possible to query in PartitionKey.
I found that in addition to the characters listed in Igorek's answer, these also can cause problems (e.g. inserts will fail):
|
[]
{}
<>
$^&
Tested with the Azure Node.js SDK.
I transform the key using this function:
private static string EncodeKey(string key)
{
return HttpUtility.UrlEncode(key);
}
This needs to be done for the insert and for the retrieve of course.

Node.js URL-encoding for pre-RFC3986 urls (using + vs %20)

Within Node.js, I am using querystring.stringify() to encode an object into a query string for usage in a URL. Values that have spaces are encoded as %20.
I'm working with a particularly finicky web service that will only accept spaces encoded as +, as used to be commonly done prior to RFC3986.
Is there a way to set an option for querystring so that it encodes spaces as +?
Currently I am simply doing a .replace() to replace all instances of %20 with +, but this is a bit tedious if there is an option I can set ahead of time.
If anyone still facing this issue, "qs" npm package has feature to encode spaces as +
qs.stringify({ a: 'b c' }, { format : 'RFC1738' })
I can't think of any library doing that by default, and unfortunately, I'd say your implementation may be the more efficient way to do this, since any other option would probably either do what you're already doing, or will use slower non-compiled pure JavaScript code.
What about asking the web service provider to follow the RFC?
https://github.com/kvz/phpjs is a node.js package that provides all the php functions. The http_build_query implementation at the time of writing this only supports urlencode (the query string includes + instead of spaces), but hopefully soon will include the enc_type parameter / rawurlencode (%20's for spaces).
See http://php.net/http_build_query.
RFC1738 (+'s) will be the default enc_type either way, so you can use it immediately for your purposes.

replaceAll quotes with backslashed quotes -- Is that enough?

I'm using replaceAll to replace single quotes with "\\\\'" per a colleague's suggestion, but I'm pretty sure that's not enough to prevent all SQL injections.
I did some googling and found this: http://wiki.postgresql.org/wiki/8.1.4_et._al._Security_Release_Technical_Info
This explains it for PostgreSQL, but does the replacing not work for all SQL managers? (Like, MySQL, for example?)
Also, I think I understand how the explanation I linked works for single backslash, but does it extend to my situation where I'm using four backslashes?
Please note that I'm not very familiar with databases and how they parse input, but this is my chance to learn more! Any insight would be appreciated.
Edit: I've gotten some really helpful, useful answers. My next question is, what kind of input would break my implementation? That is, if you give me input and I prepend all single quotes with four backslashes, what kind of input would you give me to inject SQL code? While I am convinced that my approach is naive and wrong, maybe some examples would better teach me how easy it is to inject SQL against my "prevention".
No, because what about backslashes? for instance if you turn ' into \' then the input \' will become \\' which is an unescaped single quote and a "character literal" backslash. For mysql there is mysql_real_escape_string() which should exist for every platform because its in the MySQL library bindings.
But there is another problem. And that is if you have no quote marks around the data segment. In php this looks like:
$query="select * from user where id=".$_GET[id];
The PoC exploit for this is very simple: http://localhost/vuln.php?id=sleep(10)
Even if you do a mysql_real_escape_string($_GET[id]) its still vulnerable to sqli because the attacker doesn't have to break out of quote marks in order to execute sql. The best solution is Parameterized Queries.
No.
This is not enough, and this is not the way to go. And I can say it without even knowing anything about your data, your SQL or even anything about your application. You should never, ever include any user data directly into your SQL. You should use parameterized statements instead.
Besides if you are asking this question you shouldn't write your own SQL by hand in the first place. Use a good ORM instead. Asking if your home-grown regular expression would make your application safe from SQL injection is like asking if your home-grown memory allocation routine that you have written in Assembly language is safe from buffer overruns - to which I would say: if you are asking this question then you should use a memory-safe language in the first place.
A simple case of SQL injection works like this (in pseudocode):
name = form_params["name"]
year = 2011
sql = "INSERT INTO Students (name, year) " +
"VALUES ('" + name + "', " + year + ");"
database_handle.query(sql)
year is supplied by you, the programmer, so it's not tainted, and can be embedded in the query in any way you find suitable; in this case — as an unquoted number.
But name is supplied by the user and so can be anything. Along comes Bobby Tables and inputs this value:
name = "Robert'); DROP TABLE Students; -- "
And the query becomes
INSERT INTO Students (name, year) VALUES ('Robert');
DROP TABLE Students; -- ', 2011);
That substitution turned your one query into two.
The first one gives an error because of the mismatched row count, but that doesn't matter, because the database is able to unambiguously find and run the second query. The attacker can work around the error by fiddling with the input anyway. The -- is a comment so that the rest of the input is ignored.
Note how data suddenly became code — a typical sign of a security problem.
What the suggested replacement does is this:
name = form_params["name"].regex_replace("'", "\\\\'")
How this works is confusing, hence my earlier comment. The string literal "\\\\'" represents the string \\'. The regex_replace function interprets that as the string \'. The database then sees
... VALUES ('Robert\'); DROP TABLE Students; -- ', 2011);
and interprets that correctly as a quite unusual name.
Among other problems this approach is very fragile. If the strings you use in your language don't substitute \\ as \, if your string substitution function doesn't interpret \\ as \ (if it's not a regex function or it uses $1 instead of \1 for backreferences) you could end up with an even number of slashes like
... VALUES ('Robert\\'); DROP TABLE Students; -- ', 2011);
and no SQL injection will be prevented.
The solution is not to check what the language and library does with all possible input you can think of, or to anticipate what it might do in a future version, but rather to use the facilities provided by the database. These usually come in two flavours:
database-aware escaping, which does exactly the right escaping of any data because the client library matches the server and it knows what the character encoding of the database you are querying is:
sql = "... '" + database_handle.escape(name) + "' ..."
out-of-band data submission (usually with prepared statments), so the data isn't even in the same string as the code:
sql = "... VALUES (:n, :y);"
database_handle.query(sql, n = name, y = year)

Delimiter to use within a query string value

I need to accept a list of file names in a query string. ie:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc
Do you have any recommendations on what delimiter to use?
Having query parameters multiple times is legal, and the only way to guarantee no parsing problems in all cases:
http://someSite/someApp/myUtil.ashx?file=file1.txt&file=file2.bmp&file=file3.doc
The semicolon ; must be URI encoded if part of a filename (turned to %3B), yet not if it is separating query parameters which is its reserved use.
See section 2.2 of this rfc:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
If they're filenames, a good choice would be a character which is disallowed in filenames. Suggestions so far included , | & which are generally allowed in filenames and therefore might lead to ambiguities. / on the other hand is generally not allowed, not even on Windows. It is allowed in URIs, and it has no special meaning in query strings.
Example:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc is bad because it may refer to the valid file file1.txt|file2.bmp.
http://someSite/someApp/myUtil.ashx?files=file1.txt/file2.bmp/file3.doc unambiguously refers to 3 files.
I would recommend making each file its own query parameter, i.e.
myUtil.ashx?file1=file1.txt&file2=file2.bmp&file3=file3.doc
This way you can just use standard query parsing and loop
Do you need to list the filenames as a string?
Most languages accepts arrays in the querystring so you could write it like
http://someSite/someApp/myUtil.ashx?files[]=file1.txt&files[]=file2.bmp&files[]=file3.doc
If it doesn't, or you can't use for some other reason, you should stick to a delimiter that is either not allowed or unusual in a filename. Pipe (|) is a good one, otherwise you could urlencode an invisible character since they are quite easy to use in coding, but harder to actually include in a filename.
I usually use arrays when possible and pipe otherwise.
I've always used double pipes "||". I don't have any good evidence to back up why this is a good choice other than 10 years of web programming and it's never been an issue.
This is one common problem. How i handled it was: I created a method which accepted a list of strings, then found a character that was not in any of the strings. (I did this by a simple concatenation of the strings, then testing for various characters.) Once a character was found, concatenated all the strings together but also prepended the string with the separation character. So in the given question, one example wud be:
http://someSite/someApp/myUtil.ashx?files=|file1.txt|file2.bmp|file3.doc
and another wud be:
http://someSite/someApp/myUtil.ashx?files=,file1.txt,file2.bmp,file3.doc
But since i actually use a method that guarantees my separator character is not in the rest of the strings, it is safe. It was a bit of work to create the first time, but i've used it MANY times in various applications.
I think I would consider using commas or semicolons.
I would build on MSalters answer by saying, to generalize, the best delimiter is one that is invalid to the items in the list. For example, if your list is prices, a comma is a bad delimiter because it can be confused with the values. For that reason, as most these answers suggest, I think a good general purpose delimiter is probably "|" as it is rarely a valid value. "/" is maybe not the best delimiter generally as it is valid for paths sometimes.

Resources