How JSONEachRow can support binary data? - string

This document suggests that JSONEachRow Format can handle binary data.
http://www.devdoc.net/database/ClickhouseDocs_19.4.1.3-docs/interfaces/formats/#jsoneachrow
From my understanding binary data can contain any bytes and because of this data can represent anything and it can break the JSON structure.
For example, a string may contain some bytes that may represent a quote if interpreted in UTF-8 or some invalid bytes.
So,
How are they achieving it?
How do they know when that string is actually ending?
At the end the DB needs to interpret the command and values, they must be using some encoding to do that.
Please correct me if I am wrong somehow.

Related

Confusion surrounding MongoDB/GridFS BinData

I am using Python Mongoengine for inserting image files into GridFS, with the following method:
product = Product(name='New Product', price=20.0, ...)
with open(<IMAGE_FILE>, 'rb') as product_photo:
product.image.put(product_photo_main, content_type='image/jpeg')
product.save()
When I view this data with NoSQLBooster (or anything else) the data is represented like so:
{
"_id" : ObjectId("5d71263eae9a187374359927"),
"files_id" : ObjectId("5d71263eae9a187374359926"),
"n" : 0,
"data" : BinData(0,"/9j/4AAQSkZJRgABAQEASABIAAD/4V6T... more 261096 bytes - image/jpeg")
},
And knowing that the second part of the tuple in BinData of the "data" field contains base64 encoding, I'm confused at which point the raw bytes given by open(<IMAGE_FILE>, 'rb') becomes encoded with base64?
So further more, being that base64 encoding is 33% - 37% larger in its size, in regards of transferring that data - this is bad, how can I choose the encoding? At least stop it from using base64...
I have found this SO question which mentions a HexData data type.
I also found others mentioning subtypes aswell, which led me to find this about BSON data types.
Binary
Canonical Relaxed
{ "$binary":
{
"base64": "<payload>",
"subtype": "<t>"
}
}
<Same as Canonical>
Where the values are as follows:
"<payload>"
Base64 encoded (with padding as “=”) payload string.
"<t>"
A one- or two-character hex string that corresponds to a BSON binary subtype. See the extended bson documentation
http://bsonspec.org/spec.html for subtypes available.
Which clearly tells us the payload will be base64!
So can I change this, or does it have to be that way?
at which point the raw bytes ... becomes encoded with base64
Direct Answer
Only at the point where you chose to display them on your console or through some other "display" format. The native format that crosses the wire in BSON format won't have this issue.
If you choose not to display the contents to your terminal or debugger, it will never have been encoded to base64 or any other format.
Point of Correction
which led me to find this about BSON data types.
Which clearly tells us the payload will be base64!
The linked page is referring to MongoDB Extended JSON, not the wire BSON format.
It is true that Extended JSON encodes the binary to base64, that is not true about bson itself.
As below, the only time your driver will pass the data through the extended JSON conversion is at the moment you ask it to display the contents to you via a print or debug
Details
BSON's Spec (the internal mongodb serialization format) binaries are native byte format.
The relevant portion of the spec:
binary ::= int32 subtype (byte*)
indicates that a binary object is
length of the byte*,
followed by a 1-byte subtype
followed by the raw bytes
in the case of the bytes "Hello\x00World" which includes a null byte right in the middle
the "wire format" would be
[11] [0x00] [Hello\x00World]
notice, stack overflow, like virtually every driver or display terminal struggles with the embedded null byte, as would just about every display terminal unless the system made evident that the null byte is actually included in the bytes to be displayed.
meaning the integer (packed into a 32bit byte) followed by 1byte subtype, followed by the literal bytes is what will actually cross the wire.
As you pointed out, most languages would have immense trouble rendering this onscreen to a user.
Extended JSON is the specification that involves the most appropriate way to render non-displayable data into drivers.
Object IDs aren't just bytes, they're objects that can represent timestamps.
Timestamps aren't just numbers, they can represent timezones and be converted to display against the user timezone.
Binaries aren't always text, may have problematic bytes in there, and the easiest way to not bork up your terminal/gui/debugger is to simply encode them away in some ASCII format like base64.
Keep in Mind
bson.Binary and GridFS are not really supposed to be displayed/printed/written in their wire format. The wire format exists for the transfer layer.
To ease with debugging and print statements, most drivers implement a easily "displayable" format that yanks the native BSON format through the Extended JSON spec.
If you simply choose not to display/encode as extend JSON/debug/print, the binary bytes will never actually be base64 encoded by the driver.

How to save bytes to file as binary mode

I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

how to write xdr packed data in binary files

I want to convert an array into xdr format and save it in binary format. Here's my code:
# myData is Pandas data frame, whose 3rd col is int (but it could anything else)
import xdrlib
p=xdrlib.Packer()
p.pack_list(myData[2],p.pack_int)
newFile=open("C:\\Temp\\test.bin","wb")
# not sure what to put
# p.get_buffer() returns a string as per document, but how can I provide xdr object?
newFile.write(???)
newFile.close()
How can I provide the XDR-"packed" data to newFile.write function?
Thanks
XDR is a pretty raw data format. Its specification (RFC 1832) doesn't specify any file headers, or anything else, beyond the encoding of various data types.
The binary string you get from p.get_buffer() is the XDR encoding of the data you've fed to p. There is no other kind of XDR object.
I suspect that what you want is simply newFile.write(p.get_buffer()).
Unrelated to the XDR question, I'd suggest using a with statement to take care of closing the file.

Can ALL string be decoded as valid binary data?

As known, Base-64 encodes binary data into transferable ASCII strings, and we decode these strings back to data.
Now my question is inverted: Can every random string be decoded as binary data, and correctly encoded back to the exact original string?
It depends upon your coding method - some methods use only a limited range of characters so a string containing other characters would not be legal. In Base64 this is the case so the answer is no. With other methods I'm sure its possible but I cannot think of an example other than simply treating the string as binary bytes.

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Resources