Can ALL string be decoded as valid binary data? - string

As known, Base-64 encodes binary data into transferable ASCII strings, and we decode these strings back to data.
Now my question is inverted: Can every random string be decoded as binary data, and correctly encoded back to the exact original string?

It depends upon your coding method - some methods use only a limited range of characters so a string containing other characters would not be legal. In Base64 this is the case so the answer is no. With other methods I'm sure its possible but I cannot think of an example other than simply treating the string as binary bytes.

Related

SHA512 to UTF8 to byte encryption

a question:
A parcel service provider requests that the password is encoded in a specific way:
KEY -> UTF8 Encoding -> SHA512
They KEY should be in byte form, not string
currently I have this in Node.js with CryptoJS:
password = CryptoJS.SHA512(CryptoJS.enc.Utf8.parse(key))
or
password = CryptoJS.SHA512(CryptoJS.enc.Utf8.stringify(key))
Don't know which one is the right one.
I need to convert the key to bytes, how do I do that?
Keys are arbitrary sequences of bytes, and SHA-512 works on arbitrary sequences of bytes. However, UTF-8 can't encode arbitrary sequences of bytes. It can only encode Unicode code points. What you're asking for isn't possible. (I suggest posting precisely what the requirement is. It's possible you're misreading it.)
You need another encoding, such as Base64 or Hex. The output of either of those is compatible with UTF-8 (they both output subsets of UTF-8).
That said, this is a very strange request, since you already have exactly the correct input for SHA-512. Converting it to a string and then converting that string back to (likely different) bytes seems a pointless step, but if you need it, you'll need a byte encoding like Base64 or Hex.

How JSONEachRow can support binary data?

This document suggests that JSONEachRow Format can handle binary data.
http://www.devdoc.net/database/ClickhouseDocs_19.4.1.3-docs/interfaces/formats/#jsoneachrow
From my understanding binary data can contain any bytes and because of this data can represent anything and it can break the JSON structure.
For example, a string may contain some bytes that may represent a quote if interpreted in UTF-8 or some invalid bytes.
So,
How are they achieving it?
How do they know when that string is actually ending?
At the end the DB needs to interpret the command and values, they must be using some encoding to do that.
Please correct me if I am wrong somehow.

Concept behind converting Base16 to Base64

I understand how to read decimal, binary, hex and base64; that is I can manually convert numbers/counts expressed as each of those bases to expressions in the other bases.
I'm doing the matasano crypto challenges and the very first assignment got me thinking (https://cryptopals.com/sets/1/challenges/1).
The approaches to this problem that I found convert the hexstring to bytes (binary) and then the bytes to base64. Which I understand. Or so I thought. Could I simply concatenate these bytes and say I have the binarystring expression of the same number?
I noticed they basically read the hexstring 2 hexcharacters at a time (because 2 hexcharacters is one byte at most). This results in a binarystring where each binarycharacter(bit) is "aligned" with the hexcharacter(s) it came from.
Does this mean I can just convert this binarystring to decimal and it will be same "number" that the hexstring represents?
Could a similar character-by-character scheme be done to convert to base64? How many hexcharacters per base64character?
#Flimzy shared this link and the way it answered my question is realizing two things:
base16 is an octet based encoding
base64 is a sextet based encoding

Is it safe to cast binary data from a byte array to a string and back in golang?

Maybe a stupid question, but if I have some arbitrary binary data, can I cast it to string and back to byte array without corrupting it?
Is []byte(string(byte_array)) always the same as byte_array?
The expression []byte(string(byte_slice)) evaluates to a slice with the same length and contents as byte_slice. The capacity of the two slices may be different.
Although some language features assume that strings contain valid UTF-8 encoded text, a string can contain arbitrary bytes.

Python 3 - if a string contains only ASCII, is it equal to the string as bytes?

Consider Python 3 SMTPD - the data received is contained in a string. http://docs.python.org/3.4/library/smtpd.html quote: "and data is a string containing the contents of the e-mail"
Facts (correct?):
Strings in Python 3 are Unicode.
Emails are always ASCII.
Pure ASCII is valid Unicode.
Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?
Thus my question, if I decode the SMTPD DATA string to ASCII, or convert the DATA string to bytes, is this equivalent to the bytes of the actual email message that arrived via SMTP?
Context, (and perhaps a better question) is "How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?" My concern is that when DATA goes through string to bytes conversion then somehow it has been changed from the original bytes that arrived via SMTP.
EDIT: it seems the Python developers think SMTPD should be returning binary data anyway. Doesn't seem to have been fixed... http://bugs.python.org/issue19662
if a string contains only ASCII, is it equal to the string as bytes?
No. It is not equal in Python 3:
>>> '1' == b'1'
False
bytes object is not equal to str (Unicode string) object in a similar way that an integer is not equal to a string:
>>> '1' == 1
False
In some programming languages the above comparisons are true e.g., in Python 2:
>>> b'1' == u'1'
True
and 1 == '1' in Perl:
$ perl -e "print qq(True\n) if 1 == q(1)"
True
Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.
Strings in Python 3 are Unicode.
yes. Strings are immutable sequences of Unicode code points in Python 3.
Emails are always ASCII.
Most emails are transported as 7-bit messages (ASCII range: hex 00-7F). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.
In other words: emails are not always ASCII.
Pure ASCII is valid Unicode.
ASCII is a character encoding. You can decode some byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.
Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?
If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data.
Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321) therefore the content that you get as data may be different even if the input is pure ASCII.
"How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?"
my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.
The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.
You could override SMTPChannel.collect_incoming_data method from smtpd.py to save incoming bytes as is.
"A string of ASCII text is also valid UTF-8 text."
It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).
smptd.py should have used either latin1 (it decodes any byte sequence) or ascii (with 'strict' error handler to fail on any non-ascii byte) instead of utf-8 (it allows some non-ascii bytes -- bad).
Keep in mind:
some emails may have bytes outside ascii range
de-transparency according to RFC 5321 doesn't preserve input bytes as-is even if they are all in ascii range

Resources