I am using Base64.encodeBytes to encode by signed data, but it adds new line character to the generated string (for every 76 characters).
I found out that there is an option to pass DONT_BREAK_LINES to avoid new line chars.
But the description of this fields says /** Don't break lines when encoding (violates strict Base64 specification) */
Can someone please explain, why using this option violates Base64 spec ?
The term Base64 originated from MIME content transfer encoding.
The latest version of the RFC that defines this is here, RFC 5322.
It says:
2.1.1. Line Length Limits
There are two limits that this specification places on the number of
characters in a line. Each line of characters MUST be no more than
998 characters, and SHOULD be no more than 78 characters, excluding
the CRLF.
And since CR and LF are each one character, that leaves 76 characters for the lines.
TBH it only violates the suggestion of the text and really nobody cares. If you had a line longer than 996 characters, then you would be in violation .. and probably nobody would care.
Related
I have a large excel file which I read with pandas.read_excel(). I guess it was compiled from various sources and contains some badly encoded characters.
For instance, a string foo which should be für is printed as fãâ¼r and has foo.__repr__()=='fã\x83â¼r'.
Some other string bar which should be française is printed as franãâ§aise with bar.__repr__()=='franã\x83â§aise'.
And another one baz which should also be française is printed as franãâaise with baz.__repr__()=='franã\x83â\x87aise'.
Same goes for ñ with a __repr__ of ã\x83â\x91, and so on.
What would be the best way to sanitize this input? Or is there a way to avoid the problem altogether?
Also, is there a way to simply search for such characters in my data? A .contains('\x') fails as there is nothing after the unicode escape (SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape)
When parsing /proc/self/mountinfo on Linux some fields of each line describing each a mount may very well contain utf-8 encoded characters. Since the line format of mountinfo separates fields by spaces, mountinfo escapes at least (space) and \ (backslash) as "\040" and "\134" (literally!). How can I convert a field value ("/tmp/a\ 😀", Python string '/tmp/a\\134\\040😀') back into a non-escaped string?
Is there a better way than the following rather involved one (from https://stackoverflow.com/a/26311382)? That is, with less encoding/decoding chaining?
>>> s='/tmp/a\\134\\040😀'
>>> s.encode().decode('unicode-escape').encode('latin-1').decode('utf-8')
'/tmp/a\\ 😀'
PS: Don't ask why anyone sane would use such path names; this is just for illustrational purposes ;)
I'm reading a large (10Gb) bzipped file in python3, which is utf-8-encoded JSON. I only want a few of the lines though, that start with a certain set of bytes, so to save having to decode all the lines into unicode, I'm reading the file in 'rb' mode, like this:
with bz2.open(filename, 'rb') as file:
for line in file:
if line.startswith(b'Hello'):
#decode line here, then do stuff
But I suddenly thought, what if one of the unicode characters contains the same byte as a newline character? By doing for line in file will I risk getting truncated lines? Or does the linewise iterator over a binary file still work by magic?
Line-wise iteration will work for UTF-8 encoded data.
Not by magic, but by design:
UTF-8 was created to be backwards-compatible to ASCII.
ASCII only uses the byte values 0 through 127, leaving the upper half of possible values for extensions of any kind.
UTF-8 takes advantage of this, in that any Unicode codepoint outside ASCII is encoded using bytes in the range 128..255.
For example, the letter "Ċ" (capital letter C with dot above) has the Unicode codepoint value U+010A.
In UTF-8, this is encoded with the byte sequence C4 8A, thus without using the byte 0A, which is the ASCII newline.
In contrast, UTF-16 encodes the same character as 0A 01 or 01 0A (depending on the Endianness).
So I guess UTF-16 is not safe to do line-wise iteration over.
It's not that common as file encoding though.
I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')
I was (re)reading Joel's great article on Unicode and came across this paragraph, which I didn't quite understand:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Why is there a question mark, and what does he mean by "or, if you're really good, a box"? And what character is he trying to display?
There is a question mark because the encoding process recognizes that the encoding can't support the character, and substitutes a question mark instead. By "if you're really good," he means, "if you have a newer browser and proper font support," you'll get a fancier substitution character, a box.
In Joel's case, he isn't trying to display a real character, he literally included the Unicode replacement character, U+FFFD REPLACEMENT CHARACTER.
It’s a rather confusing paragraph, and I don’t really know what the author is trying to say. Anyway, different browsers (and other programs) have different ways of handling problems with characters. A question mark “?” may appear in place of a character for which there is no glyph in the font(s) being used, so that it effectively says “I cannot display the character.” Browsers may alternatively use a small rectangle, or some other indicator, for the same purpose.
But the “�” symbol is REPLACEMENT CHARACTER that is normally used to indicate data error, e.g. when character data has been converted from some encoding to Unicode and it has contained some character that cannot be represented in Unicode. Browsers often use “�” in display for a related purpose: to indicate that character data is malformed, containing bytes that do not constitute a character, in the character encoding being applied. This often happens when data in some encoding is being handled as if it were in some other encoding.
So “�” does not really mean “unknown character”, still less “undisplayable character”. Rather, it means “not a character”.
A question mark appears when a byte sequence in the raw data does not match the data's character set so it cannot be decoded properly. That happens if the data is malformed, if the data's charset is explicitally stated incorrectly in the HTTP headers or the HTML itself, the charset is guessed incorrectly by the browser when other information is missing, or the user's browser settings override the data's charset with an incompatible charset.
A box appears when a decoded character does not exist in the font that is being used to display the data.
Just what it says - some browsers show "a weird character" or a question mark for characters outside of the current known character set. It's their "hey, I don't know what this is" character. Get an old version of Netscape, paste some text form Microsoft Word which is using smart quotes, and you'll get question marks.
http://blog.salientdigital.com/2009/06/06/special-characters-showing-up-as-a-question-mark-inside-of-a-black-diamond/ has a decent explanation.