Confusion surrounding MongoDB/GridFS BinData - python-3.x

I am using Python Mongoengine for inserting image files into GridFS, with the following method:
product = Product(name='New Product', price=20.0, ...)
with open(<IMAGE_FILE>, 'rb') as product_photo:
product.image.put(product_photo_main, content_type='image/jpeg')
product.save()
When I view this data with NoSQLBooster (or anything else) the data is represented like so:
{
"_id" : ObjectId("5d71263eae9a187374359927"),
"files_id" : ObjectId("5d71263eae9a187374359926"),
"n" : 0,
"data" : BinData(0,"/9j/4AAQSkZJRgABAQEASABIAAD/4V6T... more 261096 bytes - image/jpeg")
},
And knowing that the second part of the tuple in BinData of the "data" field contains base64 encoding, I'm confused at which point the raw bytes given by open(<IMAGE_FILE>, 'rb') becomes encoded with base64?
So further more, being that base64 encoding is 33% - 37% larger in its size, in regards of transferring that data - this is bad, how can I choose the encoding? At least stop it from using base64...
I have found this SO question which mentions a HexData data type.
I also found others mentioning subtypes aswell, which led me to find this about BSON data types.
Binary
Canonical Relaxed
{ "$binary":
{
"base64": "<payload>",
"subtype": "<t>"
}
}
<Same as Canonical>
Where the values are as follows:
"<payload>"
Base64 encoded (with padding as “=”) payload string.
"<t>"
A one- or two-character hex string that corresponds to a BSON binary subtype. See the extended bson documentation
http://bsonspec.org/spec.html for subtypes available.
Which clearly tells us the payload will be base64!
So can I change this, or does it have to be that way?

at which point the raw bytes ... becomes encoded with base64
Direct Answer
Only at the point where you chose to display them on your console or through some other "display" format. The native format that crosses the wire in BSON format won't have this issue.
If you choose not to display the contents to your terminal or debugger, it will never have been encoded to base64 or any other format.
Point of Correction
which led me to find this about BSON data types.
Which clearly tells us the payload will be base64!
The linked page is referring to MongoDB Extended JSON, not the wire BSON format.
It is true that Extended JSON encodes the binary to base64, that is not true about bson itself.
As below, the only time your driver will pass the data through the extended JSON conversion is at the moment you ask it to display the contents to you via a print or debug
Details
BSON's Spec (the internal mongodb serialization format) binaries are native byte format.
The relevant portion of the spec:
binary ::= int32 subtype (byte*)
indicates that a binary object is
length of the byte*,
followed by a 1-byte subtype
followed by the raw bytes
in the case of the bytes "Hello\x00World" which includes a null byte right in the middle
the "wire format" would be
[11] [0x00] [Hello\x00World]
notice, stack overflow, like virtually every driver or display terminal struggles with the embedded null byte, as would just about every display terminal unless the system made evident that the null byte is actually included in the bytes to be displayed.
meaning the integer (packed into a 32bit byte) followed by 1byte subtype, followed by the literal bytes is what will actually cross the wire.
As you pointed out, most languages would have immense trouble rendering this onscreen to a user.
Extended JSON is the specification that involves the most appropriate way to render non-displayable data into drivers.
Object IDs aren't just bytes, they're objects that can represent timestamps.
Timestamps aren't just numbers, they can represent timezones and be converted to display against the user timezone.
Binaries aren't always text, may have problematic bytes in there, and the easiest way to not bork up your terminal/gui/debugger is to simply encode them away in some ASCII format like base64.
Keep in Mind
bson.Binary and GridFS are not really supposed to be displayed/printed/written in their wire format. The wire format exists for the transfer layer.
To ease with debugging and print statements, most drivers implement a easily "displayable" format that yanks the native BSON format through the Extended JSON spec.
If you simply choose not to display/encode as extend JSON/debug/print, the binary bytes will never actually be base64 encoded by the driver.

Related

How JSONEachRow can support binary data?

This document suggests that JSONEachRow Format can handle binary data.
http://www.devdoc.net/database/ClickhouseDocs_19.4.1.3-docs/interfaces/formats/#jsoneachrow
From my understanding binary data can contain any bytes and because of this data can represent anything and it can break the JSON structure.
For example, a string may contain some bytes that may represent a quote if interpreted in UTF-8 or some invalid bytes.
So,
How are they achieving it?
How do they know when that string is actually ending?
At the end the DB needs to interpret the command and values, they must be using some encoding to do that.
Please correct me if I am wrong somehow.

How to save bytes to file as binary mode

I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

Why encode a JSON payload to base64?

On codejam site they are returning a json string as base64 encoded string.
The actual json payload's size is less than the base64 encoded string.
What's the reason behind returning the payload as base64 encoded string?
7 bit schemes are, or tend to be, transport neutral. There is a natural corruption check as least as good as CRC, and the JSON is less likely to be mangled by well meaning library functions (CRLF, anti-injection, SQL parsing). Yes it's going to be longer. 7 goes into anything more times than 8 (or more).

Concept behind converting Base16 to Base64

I understand how to read decimal, binary, hex and base64; that is I can manually convert numbers/counts expressed as each of those bases to expressions in the other bases.
I'm doing the matasano crypto challenges and the very first assignment got me thinking (https://cryptopals.com/sets/1/challenges/1).
The approaches to this problem that I found convert the hexstring to bytes (binary) and then the bytes to base64. Which I understand. Or so I thought. Could I simply concatenate these bytes and say I have the binarystring expression of the same number?
I noticed they basically read the hexstring 2 hexcharacters at a time (because 2 hexcharacters is one byte at most). This results in a binarystring where each binarycharacter(bit) is "aligned" with the hexcharacter(s) it came from.
Does this mean I can just convert this binarystring to decimal and it will be same "number" that the hexstring represents?
Could a similar character-by-character scheme be done to convert to base64? How many hexcharacters per base64character?
#Flimzy shared this link and the way it answered my question is realizing two things:
base16 is an octet based encoding
base64 is a sextet based encoding

Is it possible to specify big/little endian for every field when using binary.Read() to decode a byte stream into a struct?

When decoding bytes, binary.Read() requires you to specify the expected byte order of that operation. binary.Read() also allows you to pass in a struct, but AFAIK, it uses the same byte order to decode the byte stream into every field in the struct.
This is inconvenient when the byte order of encoded integers is in little endian but encoded strings and floats are in big endian.
Is it possible to specify on a per-field basis, which byte-order to use when decoding a stream of bytes into a struct?
No, it doesn't look like it.
The Read method goes through all of the work of deciphering what it needs to read .. then all of the actual read methods have this:
d.order.....
So basically, they use the ByteOrder you've specified directly .. and make no attempt (via struct tags or anything else) to let you specify it on a per-field basis.
Unfortunate .. but I smell an opportunity for someone to come along and make a neat package that can be shared with the community :)

Resources