How do I transparently compress/decompress a file as a program writes to/reads from it? - linux

I have a program that reads and writes very large text files. However, because of the format of these files (they are ASCII representations of what should have been binary data), these files are actually very easily compressed. For example, some of these files are over 10GB in size, but gzip achieves 95% compression.
I can't modify the program but disk space is precious, so I need to set up a way that it can read and write these files while they're being transparently compressed and decompressed.
The program can only read and write files, so as far as I understand, I need to set up a named pipe for both input and output. Some people are suggesting a compressed filesystem instead, which seems like it would work, too. How do I make either work?
Technical information: I'm on a modern Linux. The program reads a separate input and output file. It reads through the input file in order, though twice. It writes the output file in order.

Check out zlibc: http://zlibc.linux.lu/.
Also, if FUSE is an option (i.e. the kernel is not too old), consider: compFUSEd http://www.biggerbytes.be/

named pipes won't give you full duplex operations, so it will be a little bit more complicated if you need to provide just one filename.
Do you know if your applications needs to seek through the file ?
Does your application work with stdin, stdout ?
Maybe a solution is to create a mini compressed file system that contains only a directory with your files
Since you have separate input and output file you can do the following :
mkfifo readfifo
mkfifo writefifo
zcat your inputfile > readfifo &
gzip writefifo > youroutputfile &
launch your program !
Now, you probably will get in trouble with the read twice in order of the input, because as soon as zcat is finished reading the input file, yout program will get a SIGPIPE signal
The proper solution is probably to use a compressed file system like CompFUSE, because then you don't have to worry about unsupported operations like seek.

btrfs:
https://btrfs.wiki.kernel.org/index.php/Main_Page
provides support for pretty fast "automatic transparent compression/decompression" these days, and is present (though marked experimental) in newer kernels.

FUSE options:
http://apps.sourceforge.net/mediawiki/fuse/index.php?title=CompressedFileSystems

Which language are you using?
If you are using Java, take a look at GZipInputStream and GZipOutputStream classes in the API doc.
If you are using C/C++, zlibc is probably the best way to go about it.

Related

How to check if a file is opened in Linux?

The thing is, I want to track if a user tries to open a file on a shared account. I'm looking for any record/technique that helps me know if the concerned file is opened, at run time.
I want to create a script which monitors if the file is open, and if it is, I want it to send an alert to a particular email address. The file I'm thinking of is a regular file.
I tried using lsof | grep filename for checking if a file is open in gedit, but the command doesn't return anything.
Actually, I'm trying this for a pet project, and thus the question.
The command lsof -t filename shows the IDs of all processes that have the particular file opened. lsof -t filename | wc -w gives you the number of processes currently accessing the file.
The fact that a file has been read into an editor like gedit does not mean that the file is still open. The editor most likely opens the file, reads its contents and then closes the file. After you have edited the file you have the choice to overwrite the existing file or save as another file.
You could (in addition of other answers) use the Linux-specific inotify(7) facilities.
I am understanding that you want to track one (or a few) particular given file, with a fixed file path (actually a given i-node). E.g. you would want to track when /var/run/foobar is accessed or modified, and do something when that happens
In particular, you might want to install and use incrond(8) and configure it thru incrontab(5)
If you want to run a script when some given file (on a native local, e.g. Ext4, BTRS, ... but not NFS file system) is accessed or modified, use inotify incrond is exactly done for that purpose.
PS. AFAIK, inotify don't work well for remote network files, e.g. NFS filesystems (in particular when another NFS client machine is modifying a file).
If the files you are fond of are somehow source files, you might be interested by revision control systems (like git) or builder systems (like GNU make); in a certain way these tools are related to file modification.
You could also have the particular file system sits in some FUSE filesystem, and write your own FUSE daemon.
If you can restrict and modify the programs accessing the file, you might want to use advisory locking, e.g. flock(2), lockf(3).
Perhaps the data sitting in the file should be in some database (e.g. sqlite or a real DBMS like PostGreSQL ou MongoDB). ACID properties are important ....
Notice that the filesystem and the mount options may matter a lot.
You might want to use the stat(1) command.
It is difficult to help more without understanding the real use case and the motivation. You should avoid some XY problem
Probably, the workflow is wrong (having a shared file between several users able to write it), and you should approach the overall issue in some other way. For a pet project I would at least recommend using some advisory lock, and access & modify the information only thru your own programs (perhaps setuid) using flock (this excludes ordinary editors like gedit or commands like cat ...). However, your implicit use case seems to be well suited for a DBMS approach (a database does not have to contain a lot of data, it might be tiny), or some index locked file like GDBM library is handling.
Remember that on POSIX systems and Linux, several processes can access (and even modify) the same file simultaneously (unless you use some locking or synchronization).
Reading the Advanced Linux Programming book (freely available) would give you a broader picture (but it does not mention inotify which appeared aften the book was written).
You can use ls -lrt, it displays the last RW operations in the shell. Then you can conclude whether the file is opened or not. Make sure that you are in the exact directory.

Ghostscript: Convert PDFs to other filetypes without using the filesystem

I want to use the C API to Ghostscript on Linux to convert PDFs to other things: PDFs with fewer pages and images being two examples.
My understanding was by supplying callback functions with gsapi_set_stdio I could read and write data from them. However from my experimentation and reading, this doesn't seem to be the case.
My motivation for doing this is I will be processing PDFs at scale, and don't want my throughput to be held back by a spinning disk.
Am I missing something?
The stdio API allows you to provide your own replacements for stdin, stdout and stderr, it doesn't affect any activity by the interpreter which doesn't use those.
The pdfwrite device makes extensive use of the filesystem to write temporary files which hold various intermediate portions of the PDF file as it is interpreted, these are then later reassembled into the new PDF file. The temporary files aren't written to stdout or stderr.
There is no way to avoid this behaviour.
Rendering to images again uses the file system, unless you specify stdout as the destination of the bitmap in which case you can use the stdio API call to have stdout redirect elsewhere. If the image is rendered at a high enough resolution then GS will use a display list and again the display list will be stored in a temporary file which will be unaffected by stdio redirection.

Linux >2.6.33: could sendfile() be used to implement a faster 'cat'?

Having to concatenate lots of large files into an even larger single one, we currently use cat file1 file2 ... output_file
but are wondering whether it could be done faster than with that old friend.
Reading the man page of sendfile(), one can specify an offset into *input_file*, from where to send the remainder of it to *output_file*. But: can I also specify an offset into *output_file*?
Or could I simply loop over all input files, simply by leaving open my output FD and sendfile()'ing repeatedly into it, effectively concatenating the *input_files*?
In other words: would the filepointer into my output FD remain at its end, if I do not close it nor seek() in it?
Does anybody knows of such a cat implementation using sendfile()?
Admittedly, I'm an admin, not a programmer, so please bear with my lack of 'real' coding knowledge...
Yes, the file pointer of the output fd will remain at its end (if the file is new or is not bigger than the data you already wrote to it).
The documentation for sendfile() explicitly mentions (emphasis mine):
In Linux kernels before 2.6.33, out_fd must refer to a socket. Since
Linux 2.6.33 it can be any file. If it is a regular file, then
sendfile() changes the file offset appropriately.
I personally never saw an implementation of cat that relies on sendfile(), maybe because 2.6.33 is quite recent, and out_fd could not be fileno(stdout) before. sendfile() is also not portable, so doing that would result in a version of cat that only runs on Linux 2.6.33+ (although I guess it can still be implemented as a platform-dependent optimization activated at compile time).

How can we create 'special' files, like /dev/random, in linux?

In Linux file system, there are files such as /dev/zero and /dev/random which are not real files on hard disk.
Is there any way that we can create a similar file and tell it to get ouput from executing a program?
For example, can I create file, say /tmp/tarfile, such that any program reading it actually gets the output from the execution of a different program (/usr/bin/tar ...)?
It is possible to create such a file/program, but it would require creation of a special filesystem in order to insert hooks into the VFS so that accesses can be detected and handled properly.

How can I create a corrupt file with specified file size?

By corrupt I mean make a empty file or take an actual file and corrupting it so it becomes unreadable.
dd if=/dev/urandom of=somefile bs=somesize count=1
See the man page for details on the size.
Burn the file onto a cd/dvd then lightly scratch the disk
Files can contain any arbitrary pattern of bytes, so there is no (digital) way to create a file that is "corrupted". You can certainly modify a file in an existing format (for example, an XML file) so that it no longer is valid in whatever format it's supposed to be, but it's still just a file on disk and is perfectly readable.
You generally need to physically alter the storage medium on which the file is stored, in order to make the file actually unreadable.
My guess is that your best way is to mount a FUSE filesystem which is modified to return errors for specific files. As FUSE really is a filesystem driver (albeit in userspace), it can throw back whatever error code you want.

Resources