Why does urllib.parse.urlencode not change '_' into %5F? - python-3.x

I am writing POST request for game I am trying to make scripts for. For this post, I am using the common req = urllib.request.Request(url=url, data=params, headers=headers) First though, I have a dictionary of the data needed for the request, and I must encode it with params = urllib.parse.urlencode(OrderedDict[])
However, this gives me a string, but not the proper one! It will give me:
&x=_1&y_=2&_z_=3
But, the way the game encodes things, it should be:
&x=%5F1&y%5F=2&%5Fz%5F=3
So mine doesn't encode the underscores to be "%5F". How do I fix this? If I can, I have the params that the game uses (in url, pre-encoded for), would I be able to use that in the data field of the request?

Underscores don't need to be encoded, because they are valid characters in URLs.
As per RFC 1738:
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters < and > are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (") is used to
delimit URLs in some systems. The character # is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character % is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are {, }, |, \, ^, ~,
[, ], and `.
All unsafe characters must always be encoded within a URL.
So the reason _ is not replaced by %5F is the same reason that a is not replaced by %61: it's just not necessary. Web servers don't (or shouldn't) care either way.
In case the web server you're trying to use does care (but please check first if that's the case), you'll have to do some manual work, as urllibs quoting does not support quoting _:
urllib.parse.quote(string, safe='/', encoding=None, errors=None)
Replace special characters in string using the %xx escape. Letters, digits, and the characters _.- are never quoted.
You can probably wrap quote() with your own function and pass that to urlencode(). Something like this (fully untested):
def extra_quote(*args, **kwargs):
quoted = urllib.pars.quote(*args, **kwargs)
return str.replace(quoted, '_', '%5F')
urllib.parse.urlencode(query, quote_via=extraquote)

Related

Can I get scape characters still behave as such for a string provided by f:read() in Lua?

I'm working on a simple localization function for my scripts and, although it's starting to work quite well so far, I don't know how to avoid scape/special characters to be shown in UI as part of the text after feeding the widgets with the strings returned by f:read().
For example, if in a certain Strings.ES.txt's line I have: Ignorar \"Etiquetas de capa\", I'd expect backslashes didn't end showing up just like when I feed the widget with a normal string between doble quotes like: "Ignorar \"Etiquetas de capa\"", or at least have a way to avoid it. I've been trial-and-erroring with tostring() and load() functions and different (surely nonsense 🙄) concatenations like: load(tostring("[[" .. f:read()" .. ]]")) and such without any success, so here I'm again...
Do someone know if there is a way to get scape characters in a string returned by f:read() still behave as special as when they are found in a regular one?
I don't know how to avoid [e]scape/special characters to be shown in UI as part of the text
What you want is to "unescape" or "unquote" a string to interpret escape sequences as if it were parsed as a quoted string by Lua.
[...] with the strings returned by f:read() [...]
The fact that this string was obtained using f:read() can be ignored; all that matters is that it is a string literal without quotes using quoted string escapes.
I've been trial-and-erroring with tostring() and load() functions and different [...] concatenations like: load(tostring("[[" .. f:read()" .. ]]")) and such without any success [...]
This is almost how to do it, except you chose the wrong string literal type: "Long" strings using pairs square brackets ([ and ]) do not interpret escape sequences at all; they are intended for including long, raw, possibly multiline strings in Lua programs and often come in handy when you need to represent literal strings with backslashes (e.g. regular expressions - not to be confused with Lua patterns, which use % for escapes, and lack the basic alternation operator of regular expressions).
If you instead use single or double quotes to wrap the string, it will work fine:
local function unescape_string(escaped)
return assert(load(('return "%s"'):format(escaped)))()
end
this will produce a tiny Lua program (a "chunk") for each string, which just consists of return "<contents>". Recall that Lua chunks are just functions. Thus you can simply call the function to obtain the value of the string it returns. That way, Lua will interpret the escape sequences for us. The same approach is often used to use Lua for reading data serialized as Lua code.
Note also the use of assert for error handling: load returns nil, err if there is a syntax error. To deal with this gracefully, we can wrap the call to load in assert: assert returns its first argument (the chunk returned by load) if it is truthy; otherwise, if it is falsy (e.g. nil in this case), assert errors, using its second argument as an error message. If you omit the assert and your input causes a syntax error, you will instead get a cryptic "attempt to call a nil value" error.
You probably want to do additional validation, especially if these escaped strings are user-provided - otherwise a malicious string like str"; os.execute("...") can trivially invoke a remote code execution (RCE) vulnerability, allowing it to both execute Lua e.g. to block (while 1 do end), slow down or hijack your application, as well as shell commands using os.execute. To guard against this, searching for an unescaped closing quote should be sufficient (syntax errors e.g. through invalid escapes will still be possible, but RCE should not be possible excepting Lua interpreter bugs):
local function unescape_string(escaped)
-- match start & end of sequences of zero or more backslashes followed by a double quote
for from, to in escaped:gmatch'()\\*()"' do
-- number of preceding backslashes must be odd for the double quote to be escaped
assert((to - from) % 2 ~= 0, "unescaped double quote")
end
return assert(load(('return "%s"'):format(escaped)))()
end
Alternatively, a more robust (but also more complex) and presumably more efficient way of unescaping this would be to manually implement escape sequences through string.gsub; that way you get full control, which is more suitable for user-provided input:
-- Single-character backslash escapes of Lua 5.1 according to the reference manual: https://www.lua.org/manual/5.1/manual.html#2.1
local escapes = {a = '\a', b = '\b', f = '\b', n = '\n', r = '\r', t = '\t', v = '\v', ['\\'] = '\\', ["'"] = "'", ['"'] = '"'}
local function unescape_string(escaped)
return escaped:gsub("\\(.)", escapes)
end
you may implement escapes here as you see fit; for example, this misses decimal escapes, which could easily be implemented as escaped:gsub("\\(%d%d?%d?)", string.char) (this uses coercion of strings to numbers in string.char and a replacement function as second argument to string.gsub).
This function can finally be used straightforwardly as unescape_string(f:read()).

Cannot find a urldecode function in terraform

I see there is a urlencode in terraform but no urldecode. Is there any reason why it is not there? What would be the workaround to achieve that in terraform?
Thanks,
Ram
The functions built in to the Terraform language tend to (with some notable historical exceptions) focus on solving problems that seem to arise very commonly in typical definitions of infrastructure. URL encoding arises in such situations as building URLs for API calls, whereas URL decoding seems to come up less frequently in the typical scope where Terraform is used and so there haven't been sufficient real-world examples of the need to justify making it a built-in.
The Terraform language does include some features that allow for basic manual string tokenization and transforming, but a key missing piece for fully-general URL decoding is that Terraform does not have a function which can take a number and return the corresponding character as defined by a specific character lookup table, such as ASCII or Unicode.
If you know that in practice your inputs will only use URL escaping sequences for specific reserved symbols then you can approximate URL decoding with a lookup table combined with a tokenizing regular expression:
locals {
input = "foo%3Fx%3Dtest"
tokens = regexall("(?:%[0-9a-fA-F]{2}|[^%]+)", local.input)
replacements = tomap({
"%3f" = "?"
"%3d" = "="
"%25" = "%"
})
result = join("", [
for token in local.tokens : (
substr(token, 0, 1) == "%" ?
local.replacements[lower(token)] :
token
)
])
}
The above works for the limited encoding vocabulary of only ?, =, and % but will fail if there are any encoded characters other than those. You can of course expand this vocabulary to include any additional characters you'd like to include, and you could potentially expand that table to include all 128 ASCII characters if you like.
It would not be possible to decode non-ASCII (i.e. Unicode-only) characters with this simplistic strategy because URL encoding of those involves encoding first as UTF-8 and then encoding the individual bytes that result, and the Terraform language does not include any facilities for working with raw bytes: it works only with unicode strings.

How can I remove all characters inside angle brackets python?

How can I remove all characters inside angle brackets including the brackets in a string? How can I also remove all the text between ("\r\n") and ("."+"any 3 characters") Is this possible? I am currently using the solution by #xkcdjerry
e.g
body = """Dear Students roads etc. you place a tree take a snapshot, then when you place a\r\nbuilding, take a snapshot. Place at least 5-6 objects and then have 5-6\r\nsnapshots. Please keep these snapshots with you as everyone will be asked\r\nto share them during the class.\r\n\r\nI am attaching one PowerPoint containing instructions and one video of\r\nexplanation for your reference.\r\n\r\nKind regards,\r\nTeacher Name\r\n zoom_0.mp4\r\n<https://drive.google.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>"""
d = re.compile("\r\n.+?\\....")
body = d.sub('', body)
a = re.compile("<.*?>")
body = a.sub('', body)
print(body)```
For some reason the output is fine except that it has:
```gle.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>
randomly attached to the end How can I fix it.
Answer
Your problem can be solved by a regex:
Put this into the shell:
import re
a=re.compile("<.*?>")
a.sub('',"Keep this part of the string< Remove this part>Keep This part as well")
Output:
'Keep this part of the stringKeep This part as well'
Second question:
import re
re.compile("\r\n.*?\\..{3}")
a.sub('',"Hello\r\nFilename.png")
Output:
'Hello'
Breakdown
Regex is a robust way of finding, replacing, and mutating small strings inside bigger ones, for further reading,consult https://docs.python.org/3/library/re.html. Meanwhile, here are the breakdowns of the regex information used in this answer:
. means any char.
*? means as many of the before as needed but as little as possible(non-greedy match)
So .*? means any number of characters but as little as possible.
Note: The reason there is a \\. in the second regex is that a . in the match needs to be escaped by a \, which in its turn needs to be escaped as \\
The methods:
re.compile(patten:str) compiles a regex for farther use.
regex.sub(repl:str,string:str) replaces every match of regex in string with repl.
Hope it helps.

ServiceStack does not escape control characters in JSON

ServiceStack's JsonSerializer does not seem to encode control characters correctly.
For example, this C# expression....
JsonSerializer.SerializeToString(new { Text = "\u0010" })
... evaluates to this...
{"Text":"?"}
... where the "?" is the literal control character.
Instead, according to http://www.json.org it should evaluate to this:
{"Text":"\u0010"}
Is this a known bug or am I missing something?
The bad JSON output by my services is causing errors during deserialization by my service consumers.
You need to tell the serializer to escape unicode characters.
JsConfig.EscapeUnicode = true;
JsonSerializer.SerializeToString(new{Text = "\u0010"});
The above evaluates to this:
{"Text":"\u0010"}
Thanks Mike, that works. But I think this approach escapes ALL non-ASCII Unicode characters in addition to control characters.
I'm expecting to have a lot of foreign language characters in my data (Arabic, for example) so this will cause significant size bloat versus just including those unescaped unicode characters in the JSON (which is still standard-compliant).
I imagine the purpose of EscapeUnicode = true is to produce JSON that can be stored or transmitted with simple ASCII encoding, which is certainly useful. And it apparently also encodes ASCII control characters as a side-effect which does solve my problem.
But in my opinion, JsonSerializer should escape control characters regardless of the EscapeUnicode setting since the standard requires it. I consider this a bug.
Since this is primarily a problem for me within my Service Stack services I also found this solution:
SetConfig(new EndpointHostConfig
{
UseBclJsonSerializers = true
});
This tells Service Stack to use .NET's built-in DataContractJsonSerializer instead of Service Stack's JsonSerializer. I have verified that DataContractJsonSerializer does escape control.characters correctly.
So it appears that I need to choose between JsonSerializer with EscapeUnicode = true (faster but with bloated output) and DataContractJsonSerializer (slower but with compact Unicode output).

Delimiter to use within a query string value

I need to accept a list of file names in a query string. ie:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc
Do you have any recommendations on what delimiter to use?
Having query parameters multiple times is legal, and the only way to guarantee no parsing problems in all cases:
http://someSite/someApp/myUtil.ashx?file=file1.txt&file=file2.bmp&file=file3.doc
The semicolon ; must be URI encoded if part of a filename (turned to %3B), yet not if it is separating query parameters which is its reserved use.
See section 2.2 of this rfc:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
If they're filenames, a good choice would be a character which is disallowed in filenames. Suggestions so far included , | & which are generally allowed in filenames and therefore might lead to ambiguities. / on the other hand is generally not allowed, not even on Windows. It is allowed in URIs, and it has no special meaning in query strings.
Example:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc is bad because it may refer to the valid file file1.txt|file2.bmp.
http://someSite/someApp/myUtil.ashx?files=file1.txt/file2.bmp/file3.doc unambiguously refers to 3 files.
I would recommend making each file its own query parameter, i.e.
myUtil.ashx?file1=file1.txt&file2=file2.bmp&file3=file3.doc
This way you can just use standard query parsing and loop
Do you need to list the filenames as a string?
Most languages accepts arrays in the querystring so you could write it like
http://someSite/someApp/myUtil.ashx?files[]=file1.txt&files[]=file2.bmp&files[]=file3.doc
If it doesn't, or you can't use for some other reason, you should stick to a delimiter that is either not allowed or unusual in a filename. Pipe (|) is a good one, otherwise you could urlencode an invisible character since they are quite easy to use in coding, but harder to actually include in a filename.
I usually use arrays when possible and pipe otherwise.
I've always used double pipes "||". I don't have any good evidence to back up why this is a good choice other than 10 years of web programming and it's never been an issue.
This is one common problem. How i handled it was: I created a method which accepted a list of strings, then found a character that was not in any of the strings. (I did this by a simple concatenation of the strings, then testing for various characters.) Once a character was found, concatenated all the strings together but also prepended the string with the separation character. So in the given question, one example wud be:
http://someSite/someApp/myUtil.ashx?files=|file1.txt|file2.bmp|file3.doc
and another wud be:
http://someSite/someApp/myUtil.ashx?files=,file1.txt,file2.bmp,file3.doc
But since i actually use a method that guarantees my separator character is not in the rest of the strings, it is safe. It was a bit of work to create the first time, but i've used it MANY times in various applications.
I think I would consider using commas or semicolons.
I would build on MSalters answer by saying, to generalize, the best delimiter is one that is invalid to the items in the list. For example, if your list is prices, a comma is a bad delimiter because it can be confused with the values. For that reason, as most these answers suggest, I think a good general purpose delimiter is probably "|" as it is rarely a valid value. "/" is maybe not the best delimiter generally as it is valid for paths sometimes.

Resources