Delimiter to use within a query string value - string

I need to accept a list of file names in a query string. ie:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc
Do you have any recommendations on what delimiter to use?

Having query parameters multiple times is legal, and the only way to guarantee no parsing problems in all cases:
http://someSite/someApp/myUtil.ashx?file=file1.txt&file=file2.bmp&file=file3.doc
The semicolon ; must be URI encoded if part of a filename (turned to %3B), yet not if it is separating query parameters which is its reserved use.
See section 2.2 of this rfc:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

If they're filenames, a good choice would be a character which is disallowed in filenames. Suggestions so far included , | & which are generally allowed in filenames and therefore might lead to ambiguities. / on the other hand is generally not allowed, not even on Windows. It is allowed in URIs, and it has no special meaning in query strings.
Example:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc is bad because it may refer to the valid file file1.txt|file2.bmp.
http://someSite/someApp/myUtil.ashx?files=file1.txt/file2.bmp/file3.doc unambiguously refers to 3 files.

I would recommend making each file its own query parameter, i.e.
myUtil.ashx?file1=file1.txt&file2=file2.bmp&file3=file3.doc
This way you can just use standard query parsing and loop

Do you need to list the filenames as a string?
Most languages accepts arrays in the querystring so you could write it like
http://someSite/someApp/myUtil.ashx?files[]=file1.txt&files[]=file2.bmp&files[]=file3.doc
If it doesn't, or you can't use for some other reason, you should stick to a delimiter that is either not allowed or unusual in a filename. Pipe (|) is a good one, otherwise you could urlencode an invisible character since they are quite easy to use in coding, but harder to actually include in a filename.
I usually use arrays when possible and pipe otherwise.

I've always used double pipes "||". I don't have any good evidence to back up why this is a good choice other than 10 years of web programming and it's never been an issue.

This is one common problem. How i handled it was: I created a method which accepted a list of strings, then found a character that was not in any of the strings. (I did this by a simple concatenation of the strings, then testing for various characters.) Once a character was found, concatenated all the strings together but also prepended the string with the separation character. So in the given question, one example wud be:
http://someSite/someApp/myUtil.ashx?files=|file1.txt|file2.bmp|file3.doc
and another wud be:
http://someSite/someApp/myUtil.ashx?files=,file1.txt,file2.bmp,file3.doc
But since i actually use a method that guarantees my separator character is not in the rest of the strings, it is safe. It was a bit of work to create the first time, but i've used it MANY times in various applications.

I think I would consider using commas or semicolons.

I would build on MSalters answer by saying, to generalize, the best delimiter is one that is invalid to the items in the list. For example, if your list is prices, a comma is a bad delimiter because it can be confused with the values. For that reason, as most these answers suggest, I think a good general purpose delimiter is probably "|" as it is rarely a valid value. "/" is maybe not the best delimiter generally as it is valid for paths sometimes.

Related

What is % in Lua [not for numeric operations]?

So I was using string.find() to find "(" in my string. But without "%" before "(" (like "%(") it says: "unfinished capture". What exactly this symbol doing?
Works:
local str = "(text)"
print(str:find("%("))
Don`t work:
local str = "(text)"
print(str:find("("))
It's used in patterns, used in some functions related to finding. For example, %s means "find a single whitespace character". %( searches for the character (. The reason you can't directly write ( is that that will create a capture, which is a mechanism to retrieve a part of a match. For other characters, you directly type them, unless there is a similar restriction.

Acceptable encoding for Cosmos DB IDs to replace illegal characters?

I'm trying to store data in Cosmos DB where the IDs use a slash (/). However slash is an illegal character in Cosmos IDs. I initially tried to resolve this by URL encoding slashes (%2F) as that's the form I'd generally receive them in through API requests. However, though percent (%) is not an illegal character for IDs, Cosmos still chokes on them being unable to retrieve many documents with a percent in the ID (it works for some, but it appears if the % is followed by certain characters it fails).
Is there a encoding that is suitable for Cosmos DB IDs that will replace illegal characters in the original ID text without introducing illegal or unhandled characters (like %) in the encoded ID text? I'd prefer to stay away from things like Base64 which makes the IDs hard to decipher for people. And I'd also like to avoid simple character replacement (/ becomes -) in case an ID uses the replacement character.
I ended up doing simple character replacement, swapping out slashes (/) with pipes (|).
The key thing to make this livable is adding a value converter with EntityFramework.
Expression<Func<string?, string>> toDB = v => v!.Replace("/", "|");
Expression<Func<string, string?>> fromDB = v => v!.Replace("|", "/");
builder.Property(p => p.Id).HasConversion(toDB, fromDB);
This allows the character replacement to happen automatically when reading & writing to the database. The only time you need to worry about the difference is if you're accessing the database directly or from other code without the converter. Or possibly doing custom searches. I manually do the translation for a filtering framework we use, and I suspect that other id search solutions would need the same manual translation.
Ultimately I decided this was acceptable as we are unlikely to have other characters that need translation for our case, the translation is easy to do visually, and it's transparent in most cases with ValueConverters. But it isn't a general solution that would work for any possible string id.
Edit:
On second thought, this solution is deficient. Cosmos does actually allow creating documents with illegal characters in the ID, it just doesn't allow accessing or deleting them easily. An ideal solution would prevent all illegal characters across the board, whether expected or not.

Regular Expression for space separated words

I am trying to match email footers like
Thanks,
Regards
Thanks & Regards,
with the help of regular expression I am able to get first 2 cases,
({language_keyword_footer[0]}|{language_keyword_footer[1]}(.*?)(\S+(.*?)))
language keyword footer is the metadata that I have created to add more cases in future,
keywords = {
"en": {'header':["From:", "Subject:"],'footer':["Regards,", "Thanks,","Thanks & Regards,"]}}
The problem is when I use this approach it captures only Regards, and discards Thanks & Regards,
is there a way I can add it to the existing re and capture this space separated scenario as well, any help is appreciated
You need to sort the items by length in dedcending order as alternation patterns in an NFA regex engine follow "first matched, first served" principle (see the "Remember That The Regex Engine Is Eager" article). Also, do not forget to escape the strings used as literal pattern inside the regex.
You can use something like
pattern = r'({})(.*?)(\S+)'.format("|".join(map(re.escape, sorted(language_keyword_footer, key=len, reverse=True))))
The resulting pattern would look like (Thanks\ \&\ Regards,|Regards,|Thanks,)(.*?)(\S+) here, and will match Thanks & Regards,, Regards, or Thanks, (in Group 1), then any zero or more chars other than line break chars as few as possible (in Group 2), and then one or more non-whitespace chars (with (\S+)).
Note the .*? at the end of a regex pattern has no effect on the final result.

LUA -- gsub problems -- passing a variable to the match string isn't working [duplicate]

This question already has an answer here:
How to match a sentence in Lua
(1 answer)
Closed 1 year ago.
Been stuck on this for over a day.
I'm trying to use gsub to extract a portion of an input string. The exact pattern of the input varies in different cases, so I'm trying to use a variable to represent that pattern, so that the same routine - which is otherwise identical - can be used in all cases, rather than separately coding each.
So, I have something along the lines of:
newstring , n = oldstring:gsub(matchstring[i],"%1");
where matchstring[] is an indexed table of the different possible pattern matches, set up so that "%1" will match the target sequence in each matchstring[].
For instance, matchstring[1] might be
"\[User\] <code:%w*>([^<]*)<\\code>.*" -- extract user name from within the <code>...<\code>
while matchstring[2] could be
"\[World\] (%w)* .*" -- extract user name as first word after prefix '[World] '
and matchstring[3] could be
"<code:%w*>([^<]*)<\\code>.*" -- extract username from within <code>...<\code> at start
This does not work.
Yet when, debugging one of the cases, I replace matchstring[i] with the exact same string -- only now passed as a string literal rather than saved in a variable -- it works.
So.. I'm guessing there must be some 'processing' of the string - stripping out special characters or something - when it's sent as a variable rather than a string literal ... but for the life of me I can't figure out how to adjust the matchstring[] entries to compensate!
Help much appreciated...
FACEPALM
Thankyou, Piglet, you got me on the right track.
Given how this particular platform processes & passes strings, anything within <...> needed the escape character \ for downstream use, but of course - duh - for the lua gsub's processing itself it needed the standard %
much obliged

Why does urllib.parse.urlencode not change '_' into %5F?

I am writing POST request for game I am trying to make scripts for. For this post, I am using the common req = urllib.request.Request(url=url, data=params, headers=headers) First though, I have a dictionary of the data needed for the request, and I must encode it with params = urllib.parse.urlencode(OrderedDict[])
However, this gives me a string, but not the proper one! It will give me:
&x=_1&y_=2&_z_=3
But, the way the game encodes things, it should be:
&x=%5F1&y%5F=2&%5Fz%5F=3
So mine doesn't encode the underscores to be "%5F". How do I fix this? If I can, I have the params that the game uses (in url, pre-encoded for), would I be able to use that in the data field of the request?
Underscores don't need to be encoded, because they are valid characters in URLs.
As per RFC 1738:
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters < and > are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (") is used to
delimit URLs in some systems. The character # is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character % is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are {, }, |, \, ^, ~,
[, ], and `.
All unsafe characters must always be encoded within a URL.
So the reason _ is not replaced by %5F is the same reason that a is not replaced by %61: it's just not necessary. Web servers don't (or shouldn't) care either way.
In case the web server you're trying to use does care (but please check first if that's the case), you'll have to do some manual work, as urllibs quoting does not support quoting _:
urllib.parse.quote(string, safe='/', encoding=None, errors=None)
Replace special characters in string using the %xx escape. Letters, digits, and the characters _.- are never quoted.
You can probably wrap quote() with your own function and pass that to urlencode(). Something like this (fully untested):
def extra_quote(*args, **kwargs):
quoted = urllib.pars.quote(*args, **kwargs)
return str.replace(quoted, '_', '%5F')
urllib.parse.urlencode(query, quote_via=extraquote)

Resources