How to create an archive (tar, 7z, zip) on the fly? - rust

I have an actix-web server and want to have an endpoint where user can download a dynamically generated archive (tar or zip or 7z).
How can I do that?
All the examples I saw either generated archive in memory (not an option, can be big) or generated a temp archive file.
I want to give archive data as it appears.

Not sure whether it's easily possible for zip files. The compression usually requires a lot of back-and-forth between data that's already written and new data.
If you want to build a tar file, there's the tar crate which provides a Builder class. The builder wraps a Write implementation which ultimately receives the archive data.
So in order to glue it together with actix-web, you'll probably need to write a struct which implements both Write and the futures::stream::Stream<Item=Bytes, Error=Error> trait.
Probably easiest would be to have something built on poll_fn from the futures crate - one just needs to be aware that not too much data must be kept in memory because it risks DoS attacks on the server by requesting but not actually reading the data and keeping the connection open.

Related

Copy/move semantics on FUSE

I have a hash-value database with tags and I want to implement a FUSE interface for it. Because values are indexed by their hashes they must be read-only.
Native interface for this database is very simple:
You can download, upload or tag a file.
You can get the set of all defined tags.
You can search for files tagged in accordance to a boolean combination of tags.
FUSE interface semantics are simple:
Database is viewed as a big synthetic directory hierarchy where values are files named by its hash and tags are directories.
cd-ing inside a directory is semantically equivalent to search for a given tag (naming conventions on paths can be used to implement boolean operations).
read-ing a file is semantically equivalent to download (part of) a value (FUSE allows an stateless read so open and close can be no-ops).
Copying/moving an inexistent file into a given path is equivalent to upload and tag it. Copying/moving an existent file into a given path is equivalent to add new tags.
Any other operation throws an error.
This FUSE interface is quite usable and allows you to easily embed a tag file system inside a hierarchical one without the need of external tools like TagSpaces or Evernote.
My problem arises identifying a file copy or move from any other forbidden operation with FUSE interface: there are endless possible combination of operations with equivalent semantics.
What is the most reliable way to identify a file copy or move with FUSE interface?
Hooking rename of a file should be straightforward by implementing rename() fuse call. In this call, you will get path of both old and new location, so that you can check if the file comes from outside or not. That said, this would work only if user space tool renames a file by invoking rename(2) kernel call.
On the other hand, hooking file copy operation would be harder: it can't be done directly as there is no such fuse call - copying happens in user space completely and so it's not directly detectable in kernel space.
You could try to do some heuristics and process incoming fuse operations to detect rename of already stored file (eg. by hashing content of new file and comparing that with already existing files), but I'm not sure how much it makes sense in your case or if it would be actually practical.

Zip Create Process with Node Express of large ZIP packages

Goal
We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.
Issues & Design Constraints
The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like
to give the user a running progress bar while the ZIP is being
constructed. (We assume the browser will give running updates as to
the progress of the actual download).
While we expect low volume of requests
(1-2 request at a time). However, we do not want to completely tie up our 4
core server processor, so we want to minimize synchronous calls that tie up the express server.
Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
Is there any other issues we should worry about?
Question
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?
archiver: https://www.npmjs.com/package/archiver
jszip: https://www.npmjs.com/package/jszip
easyzip: https://www.npmjs.com/package/easy-zip
expresszip: https://www.npmjs.com/package/express-zip
zipstream: https://www.npmjs.com/package/zip-stream
What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?
Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.
Of the packages listed above. Most were not appropriate
JSZIP is mainly for the browser
EasyZip is a node wrapper for of JSZIP, but it does not provide
progress notifications durring creation
Express-Zip is an in-memory express friendly RES solution (but
probably would not handle the size of the ZIP we are talking about)
ZIP-Stream is underlying utility underleath Archiver. Archiver has
the queuing services, so one should just user archiver
YAZL might work, but the interface is more complex for progress
tracking than Archiver
We chose Archiver, since it had most of the features desired:
Express Friendly
low memory footprint
as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap
As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.
It is messy to find strings int he streams
it causes context switches to read the stream,
you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.
We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.
That appears to be a false assumption.
A command line like this will show progress for each file added to the archive on stdout as each new file is added:
7z a -bsp1 -bb3 test.7z *
So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.
Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.
The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.

Transactionally writing files in Node.js

I have a Node.js application that stores some configuration data in a file. If you change some settings, the configuration file is written to disk.
At the moment, I am using a simple fs.writeFile.
Now my question is: What happens when Node.js crashes while the file is being written? Is there the chance to have a corrupt file on disk? Or does Node.js guarantee that the file is written in an atomic way, so that either the old or the new version is valid?
If not, how could I implement such a guarantee? Are there any modules for this?
What happens when Node.js crashes while the file is being written? Is
there the chance to have a corrupt file on disk? Or does Node.js
guarantee that the file is written in an atomic way, so that either
the old or the new version is valid?
Node implements only a (thin) async wrapper over system calls, thus it does not provide any guarantees about atomicity of writes. In fact, fs.writeAll repeatedly calls fs.write until all data is written. You are right that when Node.js crashes, you may end up with a corrupted file.
If not, how could I implement such a guarantee? Are there any modules for this?
The simplest solution I can come up with is the one used e.g. for FTP uploads:
Save the content to a temporary file with a different name.
When the content is written on disk, rename temporary file to destination file.
The man page says that rename guarantees to leave an instance of newpath in place (on Unix systems like Linux or OSX).
fs.writeFile, just like all the other methods in the fs module are implemented as simple wrappers around standard POSIX functions (as stated in the docs).
Digging a bit in nodejs' code, one can see that the fs.js, where all the wrappers are defined, uses fs.c for all its file system calls. More specifically, the write method is used to write the contents of the buffer. It turns out that the POSIX specification for write explicitly says that:
Atomic/non-atomic: A write is atomic if the whole amount written in
one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether
write requests for more than {PIPE_BUF} bytes are atomic, but requires
that writes of {PIPE_BUF} or fewer bytes shall be atomic.
So it seems it is pretty safe to write, as long as the size of the buffer is smaller than PIPE_BUF. This is a constant that is system-dependent though, so you might need to check it somewhere else.
write-file-atomic will do what you need. It writes to temporary file, then rename. That's safe.

store some data in the struct inode

Hello I am a newbie to kernel programming. I am writing a small kernel module
that is based on wrapfs template to implement a backup mechanism. This is
purely for learning basis.
I am extending wrapfs so that when a write call is made wrapfs transparently
makes a copy of that file in a separate directory and then write is performed
on the file. But I don't want that I create a copy for every write call.
A naive approach could be I check for existence of file in that directory. But
I think for each call checking this could be a severe penalty.
I could also check for first write call and then store a value for that
specific file using private_data attribute. But that would not be stored on
disk. So I would need to check that again.
I was also thinking of making use of modification time. I could save a
modification time. If the older modification time is before that time then only
a copy is created otherwise I won't do anything. I tried to use inode.i_mtime
for this but it was the modified time even before write was called, also
applications can modify that time.
So I was thinking of storing some value in inode on disk that indicates its
backup has been created or not. Is that possible? Any other suggestions or
approaches are welcome.
You are essentially saying you want to do a Copy-On-Write virtual filesystem layer.
IMO, some of these have been done, and it would be easier to implement these in userland (using libfuse and the fuse module, e.g.). That way, you can be king of your castle and add your metadata in any which way you feel is appriate:
just add (hidden) metadata files to each directory
use extended POSIX attributes (setfattr and friends)
heck, you could even use a sqlite database
If you really insist on doing these things in-kernel, you'll have a lot more work since accessing the metadata from kernel mode is goind to take a lot more effort (you'd most likely want to emulate your own database using memory mapped files so as to minimize the amount of 'userland (style)' work required and to make it relatively easy to get atomicity and reliability right1.
1
On How Everybody Gets File IO Wrong: see also here
You can use atime instead of mtime. In that case setting S_NOATIME flag on the inode prevents it from updating (see touch_atime() function at the inode.c). The only thing you'll need is to mount your filesystem with noatime option.

Uploading & extracting archive (zip, rar, targz, tarbz) automatically - security issue?

I'd like to create following functionality for my web-based application:
user uploads an archive file (zip/rar/tar.gz/tar.bz etc) (content - several image files)
archive is automatically extracted after upload
images are shown in the HTML list (whatever)
Are there any security issues involved with extraction process? E.g. possibility of malicious code execution contained within uploaded files (or well-prepared archive file), or else?
Aside the possibility of exploiting the system with things like buffer overflows if it's not implemented carefully, there can be issues if you blindly extract a well crafted compressed file with a large file with redundant patterns inside (a zip bomb). The compressed version is very small but when you extract, it'll take up the whole disk causing denial of service and possibly crashing the system.
Also, if you are not careful enough, the client might hand a zip file with server-side executable contents (.php, .asp, .aspx, ...) inside and request the file over HTTP, which, if not configured properly can result in arbitrary code execution on the server.
In addition to Medrdad's answer: Hosting user supplied content is a bit tricky. If you are hosting a zip file, then that can be used to store Java class files (also used for other formats) and therefore the "same origin policy" can be broken. (There was the GIFAR attack where a zip was attached to the end of another file, but that no longer works with the Java PlugIn/WebStart.) Image files should at the very least be checked that they actually are image files. Obviously there is a problem with web browsers having buffer overflow vulnerabilities, that now your site could be used to attack your visitors (this may make you unpopular). You may find some client side software using, say, regexs to pass data, so data in the middle of the image file can be executed. Zip files may have naughty file names (for instance, directory traversal with ../ and strange characters).
What to do (not necessarily an exhaustive list):
Host user supplied files on a completely different domain.
The domain with user files should use different IP addresses.
If possible decode and re-encode the data.
There's another stackoverflow question on zip bombs - I suggest decompressing using ZipInputStream and stopping if it gets too big.
Where native code touches user data, do it in a chroot gaol.
White list characters or entirely replace file names.
Potentially you could use an IDS of some description to scan for suspicious data (I really don't know how much this gets done - make sure your IDS isn't written in C!).

Resources