Azure Table Storage RowKey restricted Character Patterns? - azure

Are there restricted character patterns within Azure TableStorage RowKeys? I've not been able to find any documented via numerous searches. However, I'm getting behavior that implies such in some performance testing.
I've got some odd behavior with RowKeys consisting on random characters (the test driver does prevent the restricted characters (/ \ # ?) plus blocking single quotes from occurring in the RowKey). The result is I've got a RowKey that will insert fine into the table, but cannot be queried (the result is InvalidInput). For example:
RowKey: 9}5O0J=5Z,4,D,{!IKPE,~M]%54+9G0ZQ&G34!G+
Attempting to query by this RowKwy (equality) will result in an error (both within our app, using Azure Storage Explorer, and Cloud Storage Studio 2). I took a look at the request being sent via Fiddler:
GET /foo()?$filter=RowKey%20eq%20'9%7D5O0J=5Z,4,D,%7B!IKPE,~M%5D%54+9G0ZQ&G34!G+' HTTP/1.1
It appears the %54 in the RowKey is not escaped in the filter. Interestingly, I get similar behavior for batch requests to table storage with URIs in the batch XML that include this RowKey. I've also seen similar behavior for RowKeys with embedded double quotes, though I have not isolated that pattern yet.
Has anyone co me across this behavior? I can easily restrict additional characters from occurring in RowKeys, but would really like to know the 'rules'.

The following characters are not allowed in PartitionKey and RowKey fields:
The forward slash (/) character
The backslash (\) character
The number sign (#) character
The question mark (?) character
Further Reading: Azure Docs > Understanding the Table service data model

public static readonly Regex DisallowedCharsInTableKeys = new Regex(#"[\\\\#%+/?\u0000-\u001F\u007F-\u009F]");
Detection of Invalid Table Partition and Row Keys:
bool invalidKey = DisallowedCharsInTableKeys.IsMatch(tableKey);
Sanitizing the Invalid Partition or Row Key:
string sanitizedKey = DisallowedCharsInTableKeys.Replace(tableKey, disallowedCharReplacement);
At this stage you may also want to prefix the sanitized key (Partition Key or Row Key) with the hash of the original key to avoid false collisions of different invalid keys having the same sanitized value.
Do not use the string.GetHashCode() though since it may produce different hash code for the same string and shall not be used to identify uniqueness and shall not be persisted.
I use SHA256: https://msdn.microsoft.com/en-us/library/s02tk69a(v=vs.110).aspx
to create the byte array hash of the invalid key, convert the byte array to hex string and prefix the sanitized table key with that.
Also see related MSDN Documentation:
https://msdn.microsoft.com/en-us/library/azure/dd179338.aspx
Related Section from the link:
Characters Disallowed in Key Fields
The following characters are not allowed in values for the PartitionKey and RowKey properties:
The forward slash (/) character
The backslash (\) character
The number sign (#) character
The question mark (?) character
Control characters from U+0000 to U+001F, including:
The horizontal tab (\t) character
The linefeed (\n) character
The carriage return (\r) character
Control characters from U+007F to U+009F
Note that in addition to the mentioned chars in the MSDN article, I also added the % char to the pattern since I saw in a few places where people mention it being problematic. I guess some of this also depends on the language and the tech you are using to access the table storage.
If you detect additional problematic chars in your case, then you can add those to the regex pattern, nothing else needs to change.

I just found out (the hard way) that the '+' sign is allowed, but not possible to query in PartitionKey.

I found that in addition to the characters listed in Igorek's answer, these also can cause problems (e.g. inserts will fail):
|
[]
{}
<>
$^&
Tested with the Azure Node.js SDK.

I transform the key using this function:
private static string EncodeKey(string key)
{
return HttpUtility.UrlEncode(key);
}
This needs to be done for the insert and for the retrieve of course.

Related

Acceptable encoding for Cosmos DB IDs to replace illegal characters?

I'm trying to store data in Cosmos DB where the IDs use a slash (/). However slash is an illegal character in Cosmos IDs. I initially tried to resolve this by URL encoding slashes (%2F) as that's the form I'd generally receive them in through API requests. However, though percent (%) is not an illegal character for IDs, Cosmos still chokes on them being unable to retrieve many documents with a percent in the ID (it works for some, but it appears if the % is followed by certain characters it fails).
Is there a encoding that is suitable for Cosmos DB IDs that will replace illegal characters in the original ID text without introducing illegal or unhandled characters (like %) in the encoded ID text? I'd prefer to stay away from things like Base64 which makes the IDs hard to decipher for people. And I'd also like to avoid simple character replacement (/ becomes -) in case an ID uses the replacement character.
I ended up doing simple character replacement, swapping out slashes (/) with pipes (|).
The key thing to make this livable is adding a value converter with EntityFramework.
Expression<Func<string?, string>> toDB = v => v!.Replace("/", "|");
Expression<Func<string, string?>> fromDB = v => v!.Replace("|", "/");
builder.Property(p => p.Id).HasConversion(toDB, fromDB);
This allows the character replacement to happen automatically when reading & writing to the database. The only time you need to worry about the difference is if you're accessing the database directly or from other code without the converter. Or possibly doing custom searches. I manually do the translation for a filtering framework we use, and I suspect that other id search solutions would need the same manual translation.
Ultimately I decided this was acceptable as we are unlikely to have other characters that need translation for our case, the translation is easy to do visually, and it's transparent in most cases with ValueConverters. But it isn't a general solution that would work for any possible string id.
Edit:
On second thought, this solution is deficient. Cosmos does actually allow creating documents with illegal characters in the ID, it just doesn't allow accessing or deleting them easily. An ideal solution would prevent all illegal characters across the board, whether expected or not.

Azure CosmosDB: illegal characters in Document Id

I have the issue that the Ids that are being generated based on certain input contain the character "/". This leads to an error during the Upsert operation as "/" is not allowed in the Document id.
Which characters are not allowed beside that?
What are ways to handle such a situation?
The illegal characters are /, \\, ?, # (see https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.resource.id?view=azure-dotnet)
Ways to deal with such a situation:
Remove the character already in the input used for generating the id
Replace the character in the id with another character / set of characters
Use Hashing / Encoding (e.g. Base64)
If you know a better way please share. Thanks
I'm base64 encoding the plaintext. Then replacing the '/' and '=' that can still be there from base64 with '-' and '_' respectively. This seems to be working well. I ran into other issues when I just tried UrlEncode the value.
Psuedo:
var encoded = String.ConvertToBase64(plainTextId);
var preppedId = encoded.Replace('/','-').Replace('=','_');

U-SQL Error - Change the identifier to use at least one lower case letter

I am fairly new to U-SQL and trying to run a U-SQL script in Azure Data Lake Analytics to process a parquet file using the Parquet extractor functionality. I am getting the below error and I don't find a way to get around it.
Error - Change the identifier to use at least one lower case letter. If that is not possible, then escape that identifier (for example: '[ACTIVITY]'), or embed it in a CSHARP() block (e.g CSHARP(ACTIVITY)).
Unfortunately all the different fields generated in the Parquet file are capitalized and I don't want to to escape these identifiers. I have tried if I could wrap the identifier with CSHARP block and it fails as well (E_CSC_USER_RESERVEDKEYWORDASIDENTIFIER: Reserved keyword CSHARP is used as an identifier.) Is there anyway I could extract the parquet file? Thanks for your help!
Code Snippet:
SET ##FeaturePreviews = "EnableParquetUdos:on";
#var1 =
EXTRACT ACTIVITY string,
AUTHOR_NAME string,
AFFLIATION string
FROM "adl://xxx.azuredatalakestore.net/Abstracts/FY2018_028"
USING Extractors.Parquet();
#var2 =
SELECT *
FROM #var1
ORDER BY ACTIVITY ASC
FETCH 5 ROWS;
OUTPUT #var2
TO "adl://xxx.azuredatalakestore.net/Results/AbstractsResults.csv"
USING Outputters.Csv();
Based on your description you try to say
EXTRACT ALLCAPSNAME int FROM "/data.parquet" USING Extractors.Parquet();
In U-SQL, we reserve all caps identifiers so we can add new keywords in the future without invalidating old scripts.
To work around, you just have to quote the name (escape it) like in any other SQL dialect:
EXTRACT [ALLCAPSNAME] int FROM "/data.parquet" USING Extractors.Parquet();
Note that this is not changing the name of the field. It is just the syntactic way to address the field.
Also note, that in most SQL communities, it is considered a best practice to always quote identifiers to avoid reserved keyword clashes.
If all fields in the Parquet file are all caps, you will have to quote them all... In a future update you will be able to say EXTRACT * FROM … for Parquet (and Orc) files, but you still will need to quote the columns when you refer to them explicitly.

fetching compact version of JSONB in PostgreSQL

How to fetch compact JSONB from PostgreSQL?
All I got when fetching is with spaces:
SELECT data FROM a_table WHERE id = 1; -- data is JSONB column
{"unique": "bla bla", "foo": {"bar": {"in ...
^ ^ ^ ^ ^ --> spaces
What I want is:
{"unique":"bla bla","foo":{"bar":{"in ...
json_strip_nulls() does exactly what you're looking for:
SELECT json_build_object('a', 1);
returns
{"a" : 1}
But
SELECT json_strip_nulls(json_build_object('a', 1));
returns
{"a":1}
This function not only strips nulls as indicated by its function name and as documented, but incidentally also strips insignificant whitespace. The latter is not explicitly documented in PostgreSQL manual.
Tested in PostgreSQL 11.3, but probably works with earlier versions too.
jsonb is rendered in a standardized format on output. You would have to use json instead to preserve insignificant white space. Per documentation:
Because the json type stores an exact copy of the input text, it will
preserve semantically-insignificant white space between tokens, as
well as the order of keys within JSON objects. Also, if a JSON object
within the value contains the same key more than once, all the
key/value pairs are kept. (The processing functions consider the last
value as the operative one.) By contrast, jsonb does not preserve
white space, does not preserve the order of object keys, and does not
keep duplicate object keys.
The whitespace really shouldn't matter for JSON values.
There is a discussion, started in 2016, about a function jsonb_compact() that will solve the problem... But, it could take years (!).
Pretty solution
  (a real solution for this question and this other one)
We must to agree with the PostgreSQL's convention for CAST(var_jsonb AS text). When you need another cast convention, for example to debug or human-readable output, the built-in jsonb_pretty() function is a good choice.
Unfortunately PostgreSQL not offers other choices, like the compact one. So, you can overload jsonb_pretty() with a compact option:
CREATE or replace FUNCTION jsonb_pretty(
jsonb, -- input
compact boolean -- true for compact format
) RETURNS text AS $$
SELECT CASE
WHEN $2=true THEN json_strip_nulls($1::json)::text
ELSE jsonb_pretty($1)
END
$$ LANGUAGE SQL IMMUTABLE;
SELECT jsonb_pretty( jsonb_build_object('a',1, 'bla','bla bla'), true );
-- results {"a":1,"bla":"bla bla"}
Rationale
The JSON standard, RFC 8259 says "... Insignificant whitespace is allowed before or after any of the six structural characters". In other words, the cast from jsonb datatype to text has no canonical form. The PostgreSQL cast convention (using spaces) is arbitrary.
A lot of applications need to minimize a big JSONb output. Two typical ones: minimizing file size of a big JSONb saved by pg_file_write(); output online in a REST interface.
The PostgreSQL team must to appreciate a real CAST procedure, not a parser, but a direct text production from JSONb internal representation.
The workaround — to remove spaces from "JSON text" — is not a simple task, it need a good parser to avoid tampering content. The solution is a parser, it is not a regular expression workaround... And in nowadays the built-in parser is json_strip_nulls(), even as "incidential behavior" parser.

What are the user defined key's value restrictions?

In ArangoDB, when a collection is defined to allow user defined keys, what are the restrictions on the value of the key? For example, it appears that a key of "Name-2" works but a key of "Name,2" gives ArangoError 1221: invalid document key error.
Quoting from the manual
The key must be a string value. Numeric keys are not allowed, but any numeric value can be put into a string and can then be used as document key.
The key must be at least 1 byte and at most 254 bytes long. Empty keys are disallowed when specified (though it may be valid to completely omit the _key attribute from a document)
It must consist of the letters a-z (lower or upper case), the digits 0-9 or any of the following punctuation characters: _ - : . # ( ) + , = ; $ ! * ' %
Any other characters, especially multi-byte UTF-8 sequences, whitespace or punctuation characters cannot be used inside key values
The key must be unique within the collection it is used
Keys are case-sensitive, i.e. myKey and MyKEY are considered to be different keys.
Restrictions (or naming conventions) for user defined keys can be found in docs here.

Resources