Are Node.js Buffers as JSON a portable storage format? - node.js

If I create Node.js Buffer containing the bytes of a binary file like a jpg -image, convert it to JSON, can I transport binary content in this way to other machines and have the images viewable on those other machines too?
In other words can I fill a Buffer on one machine with bytes of an image-file, and transport the buffer as JSON to another machine, then restore the image there by simply writing the same buffer to a file with the same name?
Would it work between platforms say Linux Windows and Mac? Does "endiannes" become an issue?
Would TypedArrays be a better solution?

JSON isn't useful for transferring binary data... at least, not efficiently. You would have to base64-encode the data before putting it in JSON, which increases its size by 33% and adds an extra layer of processing on each end.
There is another standard serialization format you can use called CBOR. It's binary in nature and supports a byte string. There are libraries for many languages.

Related

Python3 work effectively with remote binary data

I have some remote devices that are accessed via sftp using paramiko over (usually) a cellular connection. I need to do two things to some binary files on them:
Read them byte-by-byte to enter the contents into a database
Write the entire file out to a networked drive.
I have code to do each of these things that works, but I need to do both things in the most reasonably efficient way.
I could pass around the file_handle, read byte by byte to do the database entry, then do file_handle.seek(0) and pass that to other_file_handle.write, but I'm a little concerned about the flakiness of the cellular connection as I'm reading remote files byte by byte and processing the results and it means effectively iterating the thing twice.
I could fix the double iteration part of that problem by both translating the bytes to meaningful data and simultaneously writing them to a buffer to later dump to disk, but that seems awfully...manual?
I could read the entire remote binary file, write it to disk, open that for byte-by-byte processing but that's really inefficient compared to doing all the work in memory.
I could read in the remote data to an IO stream, and then manually both convert bytes and also put them into a write stream. But then the file writing code is totally coupled to the parsing code and again it's a lot of lower level manipulation.
The last is probably the "best" way but I'm hoping there's a better higher-level abstraction to use that lets me maintain a better separation of concerns. Is there an equivalent of the posix tee command or something here?

Determining where the extra information from squashfs comes from

I extracted the root file system from an IoT device, and I was able to peruse it using unsquashfs. I then changed a single byte in a single file, and recompressed it again using mksquashfs. When I inspect the two files, the original and the one I created, the output from binwalk is identical, except for the size. The original had a size of 1038570 bytes while the one I created had a size of 1086112. I have no idea where the extra data came from. Are there any tools or methods for determining what the difference is?
So it turns out I was missing a flag while creating the squashed file system.
When using the xz compressor method, you can specify another flag, -Xbcj, that further compresses optimized per architecture. Once I added this and chose arm as my architecture, the file size was the same

Ghostscript: Convert PDFs to other filetypes without using the filesystem

I want to use the C API to Ghostscript on Linux to convert PDFs to other things: PDFs with fewer pages and images being two examples.
My understanding was by supplying callback functions with gsapi_set_stdio I could read and write data from them. However from my experimentation and reading, this doesn't seem to be the case.
My motivation for doing this is I will be processing PDFs at scale, and don't want my throughput to be held back by a spinning disk.
Am I missing something?
The stdio API allows you to provide your own replacements for stdin, stdout and stderr, it doesn't affect any activity by the interpreter which doesn't use those.
The pdfwrite device makes extensive use of the filesystem to write temporary files which hold various intermediate portions of the PDF file as it is interpreted, these are then later reassembled into the new PDF file. The temporary files aren't written to stdout or stderr.
There is no way to avoid this behaviour.
Rendering to images again uses the file system, unless you specify stdout as the destination of the bitmap in which case you can use the stdio API call to have stdout redirect elsewhere. If the image is rendered at a high enough resolution then GS will use a display list and again the display list will be stored in a temporary file which will be unaffected by stdio redirection.

Image sanitization library

I have a website that displays images submitted by users. I am concerned about
some wiseguy uploading an image which may exploit some 0-day vulnerability in a
browser rendering engine. Moreover, I would like to purge images of metadata
(like EXIF data), and attempt to compress them further in a lossless manner
(there are several such command line utilities for PNG and JPEG).
With the above in mind, my question is as follows: is there some C/C++
library out there that caters to the above scenario? And even if the
full pipeline of parsing -> purging -> sanitizing -> compressing -> writing
is not available in any single library, can I at least implement the
parsing -> purging -> sanitizing -> writing pipeline (without compressing) in a
library that supports JPEG/PNG/GIF?
Your requirement is impossible to fulfill: if there is a 0-day vulnerability in one of the image reading libraries you use, then your code may be exploitable when it tries to parse and sanitize the incoming file. By "presanitizing" as soon as the image is received, you'd just be moving the point of exploitation earlier rather than later.
The only thing that would help is to parse and sanitize incoming images in a sandbox, so that, at least, if there was a vulnerability, it would be contained to the sandbox. The sandbox could be a separate process running as an unprivileged user in a chroot environment (or VM, for the very paranoid), with an interface consisting only of bytestream in, sanitized image out.
The sanitization itself could be as simple as opening the image with ImageMagick, decoding it to a raster, and reencoding and emitting them in a standard format (say, PNG or JPEG). Note that if the input and output are both lossy formats (like JPEG) then this transformation will be lossy.
I know, I'm 9 years late, but...
You could use a idea similar to the PDF sanitizer in Qubes OS, which copies a PDF to a disposable virtual machine, runs a PDF parser which converts PDF to basically TIFF images, which are sent back to the originating VM and reassembled into a PDF there. This way you reduced your attack surface to TIFF files. Which is tiny.
(image taken from this article: https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html)
If there is really a 0-day exploit for your specific parser in that PDF, it compromises the disposable VM, but since only valid TIFF is accepted by the originating VM and since the disposable VM is discarded once the process is done, this is pointless. Unless of course the attacker also has a either Xen exploit at hand to break out of the disposable VM or a Spectre-type full memory read primitive coupled with a sidechannel to leak data to their machines. Since the disposable VM is not connected to the internet or has any audio hardware assigned, this boils down to creating EM interference by modulating the CPU power consumption, so the attacker probably needs a big antenna and a location close to your server.
It would be an expensive attack.

How do I transparently compress/decompress a file as a program writes to/reads from it?

I have a program that reads and writes very large text files. However, because of the format of these files (they are ASCII representations of what should have been binary data), these files are actually very easily compressed. For example, some of these files are over 10GB in size, but gzip achieves 95% compression.
I can't modify the program but disk space is precious, so I need to set up a way that it can read and write these files while they're being transparently compressed and decompressed.
The program can only read and write files, so as far as I understand, I need to set up a named pipe for both input and output. Some people are suggesting a compressed filesystem instead, which seems like it would work, too. How do I make either work?
Technical information: I'm on a modern Linux. The program reads a separate input and output file. It reads through the input file in order, though twice. It writes the output file in order.
Check out zlibc: http://zlibc.linux.lu/.
Also, if FUSE is an option (i.e. the kernel is not too old), consider: compFUSEd http://www.biggerbytes.be/
named pipes won't give you full duplex operations, so it will be a little bit more complicated if you need to provide just one filename.
Do you know if your applications needs to seek through the file ?
Does your application work with stdin, stdout ?
Maybe a solution is to create a mini compressed file system that contains only a directory with your files
Since you have separate input and output file you can do the following :
mkfifo readfifo
mkfifo writefifo
zcat your inputfile > readfifo &
gzip writefifo > youroutputfile &
launch your program !
Now, you probably will get in trouble with the read twice in order of the input, because as soon as zcat is finished reading the input file, yout program will get a SIGPIPE signal
The proper solution is probably to use a compressed file system like CompFUSE, because then you don't have to worry about unsupported operations like seek.
btrfs:
https://btrfs.wiki.kernel.org/index.php/Main_Page
provides support for pretty fast "automatic transparent compression/decompression" these days, and is present (though marked experimental) in newer kernels.
FUSE options:
http://apps.sourceforge.net/mediawiki/fuse/index.php?title=CompressedFileSystems
Which language are you using?
If you are using Java, take a look at GZipInputStream and GZipOutputStream classes in the API doc.
If you are using C/C++, zlibc is probably the best way to go about it.

Resources