Image sanitization library

Image sanitization library - security

I have a website that displays images submitted by users. I am concerned about
some wiseguy uploading an image which may exploit some 0-day vulnerability in a
browser rendering engine. Moreover, I would like to purge images of metadata
(like EXIF data), and attempt to compress them further in a lossless manner
(there are several such command line utilities for PNG and JPEG).
With the above in mind, my question is as follows: is there some C/C++
library out there that caters to the above scenario? And even if the
full pipeline of parsing -> purging -> sanitizing -> compressing -> writing
is not available in any single library, can I at least implement the
parsing -> purging -> sanitizing -> writing pipeline (without compressing) in a
library that supports JPEG/PNG/GIF?

Your requirement is impossible to fulfill: if there is a 0-day vulnerability in one of the image reading libraries you use, then your code may be exploitable when it tries to parse and sanitize the incoming file. By "presanitizing" as soon as the image is received, you'd just be moving the point of exploitation earlier rather than later.
The only thing that would help is to parse and sanitize incoming images in a sandbox, so that, at least, if there was a vulnerability, it would be contained to the sandbox. The sandbox could be a separate process running as an unprivileged user in a chroot environment (or VM, for the very paranoid), with an interface consisting only of bytestream in, sanitized image out.
The sanitization itself could be as simple as opening the image with ImageMagick, decoding it to a raster, and reencoding and emitting them in a standard format (say, PNG or JPEG). Note that if the input and output are both lossy formats (like JPEG) then this transformation will be lossy.

I know, I'm 9 years late, but...
You could use a idea similar to the PDF sanitizer in Qubes OS, which copies a PDF to a disposable virtual machine, runs a PDF parser which converts PDF to basically TIFF images, which are sent back to the originating VM and reassembled into a PDF there. This way you reduced your attack surface to TIFF files. Which is tiny.
(image taken from this article: https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html)
If there is really a 0-day exploit for your specific parser in that PDF, it compromises the disposable VM, but since only valid TIFF is accepted by the originating VM and since the disposable VM is discarded once the process is done, this is pointless. Unless of course the attacker also has a either Xen exploit at hand to break out of the disposable VM or a Spectre-type full memory read primitive coupled with a sidechannel to leak data to their machines. Since the disposable VM is not connected to the internet or has any audio hardware assigned, this boils down to creating EM interference by modulating the CPU power consumption, so the attacker probably needs a big antenna and a location close to your server.
It would be an expensive attack.

Related

find a video file in memory dump of a process

I have a player that plays encrypted video files and works like this:
I open an encrypted video file with it
it decrypts the video file and writes it to its memory
and plays the file from the memory after that
and I want to copy the decrypted video file from memory and play it with a usual video player like VLC so I tried to create its memory dump with task manager and hoped to find out the video file there. Sadly I don't know enough to find a video file in a large chunk of bits from memory. I tried to find mp4 patterns in a hex editor and done every solution that I find online but nothing worked for me so I hoped someone here maybe has an idea and willing to help me how to make it done.
I upload its memory dump here (after opening a short encrypted video with it)

Most probably, the software doesn't decode whole video file in one go, but instead in streaming fashion. This makes it impossible to catch a moment when the decoded video data is available in the memory dump.
If the player software is open source, compile it with debug symbols and run it under debugger. Otherwise, resort to reverse engineering.

I don't think the question is on-topic for StackOverflow in general, including but not limited to specifically reversing a software solution intended for digital rights management. However I would still leave an answer.
First of all, as comments suggest the topic in question is reversal of specific solution provided by a commercial provider. Ability to recover a media file from memory dump highly depends on implementation of this solution and methods the provider used to complicate the reversal. It is only the simplest and straightforward solution is easy to reverse and the more developer put in to cover traces, the harder - exponentially - is to reverse.
Even though there is a little chance to find the original file in full in memory (through memory dump analysis) it is unlikely to be possible for any media playback application, even such that does not do any decryption. Media playback is typically streaming: the data is loaded from disk, storage, network etc. as necessary for playback and not as a full download. Decryption needs to be applied to certain pieces of data needed momentarily, and then a decent DRM-enabled application would immediately erase the ephemeral clear data once it is no longer needed. That is, a memory dump would - at best - contain a ridiculously small amount of media data.
To capture/restore the original media file one would typically have to place himself as a middleman into some media streaming related process and be able to copy data as it is being streaming durign playback. A static memory dump is of little help here.

Are Node.js Buffers as JSON a portable storage format?

If I create Node.js Buffer containing the bytes of a binary file like a jpg -image, convert it to JSON, can I transport binary content in this way to other machines and have the images viewable on those other machines too?
In other words can I fill a Buffer on one machine with bytes of an image-file, and transport the buffer as JSON to another machine, then restore the image there by simply writing the same buffer to a file with the same name?
Would it work between platforms say Linux Windows and Mac? Does "endiannes" become an issue?
Would TypedArrays be a better solution?

JSON isn't useful for transferring binary data... at least, not efficiently. You would have to base64-encode the data before putting it in JSON, which increases its size by 33% and adds an extra layer of processing on each end.
There is another standard serialization format you can use called CBOR. It's binary in nature and supports a byte string. There are libraries for many languages.

Zip Create Process with Node Express of large ZIP packages

Goal
We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.
Issues & Design Constraints
The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like
to give the user a running progress bar while the ZIP is being
constructed. (We assume the browser will give running updates as to
the progress of the actual download).
While we expect low volume of requests
(1-2 request at a time). However, we do not want to completely tie up our 4
core server processor, so we want to minimize synchronous calls that tie up the express server.
Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
Is there any other issues we should worry about?
Question
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?
archiver: https://www.npmjs.com/package/archiver
jszip: https://www.npmjs.com/package/jszip
easyzip: https://www.npmjs.com/package/easy-zip
expresszip: https://www.npmjs.com/package/express-zip
zipstream: https://www.npmjs.com/package/zip-stream
What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?
Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.

Of the packages listed above. Most were not appropriate
JSZIP is mainly for the browser
EasyZip is a node wrapper for of JSZIP, but it does not provide
progress notifications durring creation
Express-Zip is an in-memory express friendly RES solution (but
probably would not handle the size of the ZIP we are talking about)
ZIP-Stream is underlying utility underleath Archiver. Archiver has
the queuing services, so one should just user archiver
YAZL might work, but the interface is more complex for progress
tracking than Archiver
We chose Archiver, since it had most of the features desired:
Express Friendly
low memory footprint
as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap
As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.
It is messy to find strings int he streams
it causes context switches to read the stream,
you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.

We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
That appears to be a false assumption.
A command line like this will show progress for each file added to the archive on stdout as each new file is added:
7z a -bsp1 -bb3 test.7z *
So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.
Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.
The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.

Need advice for disk access program

I'm envisioning a program I will need to write and need some advice on the language. I will need to be doing raw disk access so I can display hex data, scroll or jump around on the disk, and do calculations from the data. I have been using Java the most and it's portability between OSes for my other projects is certainly a benefit, but raw disk access either isn't possible, would require JNI, or may be possible on *nix when you can access disks as "files". I keep reading different things. By the way I can handle this type of work using Files in Java, but in this project I need to be able to access the disk so disk imaging to files beforehand isn't needed.
It would be nice to make it as portable as I could since there is a real benefit to using different OSes, but it may not be worth it and I should just stick with Windows and a native compiling language. Is there any existing JNI code that could help? I have experience in other languages but I haven't used C++ in a long time. Should I forget about Java and tryout C#? Someone told me that Python has libraries available for this type of thing despite it being an interpreted language so what about Python? What would be best for the project? What would be good for me to learn?
Searching around for raw disk access, Java, Python, does not seem to give any useful results. Thanks for any help!
EDIT
It seems like this will be quite involved, learning what I need to know, and then learning that. It's too bad I couldn't use disk images instead because then I'd be able to start working on it immediately in Java, which I'm comfortable with and I know I could make a good product. I've gotten great throughput in other raw data processing projects with Java so that doesn't worry me. Plus it would be truly portable. Hmm might have to consider it more. I'd probably need a big azz storage system to hold all the images though :)
UPDATE
Just a note for anyone that finds this question... I have figured out this works just by specifying the disk for the File using the PhysicalDrive notation (in Windows) like the answer below by hunsricker. However there are some issues. First if you do a "exists" check File.exists(), it says the file does not exist. Also, the file size is zero, and when I get a "java.io.IOException: The drive cannot find the sector requested" is the way I know I'm at the end of the file. And the worst part- I was getting some odd runtime errors doing this when I was reading some bytes and skipping some (64) bytes in a loop. I altered my program a bit to read different amounts and that changed where the error occurred. I was using BufferedInputStream instead of RandomAccessFile like hunsricker below by the way, not sure if it makes a difference. My only answer for this issue is that since I'm doing physical disk access, it doesn't like that I am not reading in even 512 byte sectors or 1K blocks or such. Indeed when I read even 1K, 2K, 512bytes, etc., and don't skip anything, it works fine and runs to the end. The errors I saw were java.io.ioexception "incorrect function" and java.io.ioexception "the parameter is incorrect". There was no rhyme or reason to them. Then I made image files of the same data and ran my program on those and it would do any combination of reading and skipping bytes with no problem. Physical disk access was more picky I guess.

I was looking by myself for a possibility to access raw data of a physical drive. And now as I got it to work, I just want to tell you how. You can access raw disk data directly from within java ... just run the following code with administrator priviliges:
File diskRoot = new File ("\\\\.\\PhysicalDrive0");
RandomAccessFile diskAccess = new RandomAccessFile (diskRoot, "r");
byte[] content = new byte[1024];
diskAccess.readFully (content);
So you will get the first kB of your first physical drive on the system. To access logical drives - as mentioned above - just replace 'PhysicalDrive0' with the drive letter e.g. 'D:'
oh yes ... I tried with Java 1.7 on a Win 7 system ...
RageDs link brougth me to the solution ... thank you :-)

Disk access will depend on the disk's particular drivers. And since this is such a low-level task, I doubt Java/Python would have such support (these languages are generally used for fast, high-level software package development). Since you will probably not be aware of the disks' particular hardware implementations, you will probably have to end up using an operating system API (which is OS-dependent of course). I would recommend looking into C and/or the particular assembly language for the architecture you plan to do this work on. Then, I would recommend continuing your search to find the appropriate API for your target OS.
EDIT
For Windows, a good place to start is here. More specifically, MSDN's CreateFile() is probably a function you would be interested in.

Uploading & extracting archive (zip, rar, targz, tarbz) automatically - security issue?

I'd like to create following functionality for my web-based application:
user uploads an archive file (zip/rar/tar.gz/tar.bz etc) (content - several image files)
archive is automatically extracted after upload
images are shown in the HTML list (whatever)
Are there any security issues involved with extraction process? E.g. possibility of malicious code execution contained within uploaded files (or well-prepared archive file), or else?

Aside the possibility of exploiting the system with things like buffer overflows if it's not implemented carefully, there can be issues if you blindly extract a well crafted compressed file with a large file with redundant patterns inside (a zip bomb). The compressed version is very small but when you extract, it'll take up the whole disk causing denial of service and possibly crashing the system.
Also, if you are not careful enough, the client might hand a zip file with server-side executable contents (.php, .asp, .aspx, ...) inside and request the file over HTTP, which, if not configured properly can result in arbitrary code execution on the server.

In addition to Medrdad's answer: Hosting user supplied content is a bit tricky. If you are hosting a zip file, then that can be used to store Java class files (also used for other formats) and therefore the "same origin policy" can be broken. (There was the GIFAR attack where a zip was attached to the end of another file, but that no longer works with the Java PlugIn/WebStart.) Image files should at the very least be checked that they actually are image files. Obviously there is a problem with web browsers having buffer overflow vulnerabilities, that now your site could be used to attack your visitors (this may make you unpopular). You may find some client side software using, say, regexs to pass data, so data in the middle of the image file can be executed. Zip files may have naughty file names (for instance, directory traversal with ../ and strange characters).
What to do (not necessarily an exhaustive list):
Host user supplied files on a completely different domain.
The domain with user files should use different IP addresses.
If possible decode and re-encode the data.
There's another stackoverflow question on zip bombs - I suggest decompressing using ZipInputStream and stopping if it gets too big.
Where native code touches user data, do it in a chroot gaol.
White list characters or entirely replace file names.
Potentially you could use an IDS of some description to scan for suspicious data (I really don't know how much this gets done - make sure your IDS isn't written in C!).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string