Modifying a file that is being used as an output redirection by another program - linux

If I have a file of which some output are redirected to, what will happen if I modify that file from another program? Will both changes be recorded to the file?
To illustrate:
Terminal 1 (a file is used to store output of a program using either tee or the redirection operator >:
$ ./program | tee output.log
Terminal 2 (at the same time, the log file is being modified by another program, e.g. vim):
$ vim output.log

It depends on the program and the system calls they make.
vim for example will not write to the file until you issue the ":w" or ":x" commands. It will then detect that the file has changed and makes you confirm the overwriting.
If the program does open(2) on the file with the O_APPEND flag, before each write(2) the file offset is positioned at the end of the file, as if with lseek(2).
So if you have two commands that append like "tee" they will take turns appending.
However, with NFS you still may get corrupted files if more than one process appends data to a file at once, because NFS doesn't support appending to a file and the kernel has to simulate it.

The effect of two or more processes modifying the data of the same file (inode in tech lingo) is undefined. The result depends on the particular order the writing processes are scheduled. This is a classic case of a race condition, i.e. a result depends on the particular order of process execution.

Related

Impact of vim on a file being used by an other process

I am using SLURM for my jobs on a computing cluster. I want to check my output file using vim in the login node when the job is running and will not do any editing. Will this have any impact on my SLURM job in progress?
It shouldn't effect the current job.
Whenever two different programs open a file on Unix operating systems, the operating system creates different entries in the global file table. Since they have different context (like current position, mode, etc.) a reader won't interfere with a writer. You will likely just read partial information.
If you want a safeguard to prevent writing to the file, you can use vim -R file to open the file in read-only mode. You can also use tail -f to "follow" the file, writing output to the terminal as information is appended to the file.

Linux ~/.bashrc export most recent directory

I have several environment variables in my ~/.bashrc that point to different directories. I am running a program that creates a new folder every time that it runs and puts a time stamp in the directory name. For example, baseline_2015_11_10_15_40_31-model-stride_1-type_1. Is there away of making a variable that can link to the last created directory?
cd $CURRENT_DIR
Your mileage may vary a lot depending on what exactly do you need to accomplish. However, it almost all cases I would advise against doing something that weird and unreliable like what's described below and revise your architecture to avoid hunting for directories.
Method 1
If your program creates a subdirectory inside current directory, and you always know that nothing else happens in that directory and you want a subdirectory with latest creation timestamp, then you can do something like:
your_complex_program_that_creates_dir
TARGET_DIR=$(ls -t1 --group-directories-first | head -n1)
cd "$TARGET_DIR"
Method 2
If a lot of stuff happens on the system, then you'll end up monitoring what your program does with the filesystem and reacting when it creates a directory. There are two ways to do that, using strace and inotify, both are relatively complex. Here's the way to do that with strace:
strace -o some_temp_file.strace your_complex_program_that_creates_dir
TARGET_DIR=$(sed -ne '/^mkdir(/ { s/^mkdir("\(.*\)", .*).*$/\1/; p }' some_temp_file.strace
cd "$TARGET_DIR"
This snippet runs your_complex_program_that_creates_dir under control of strace, which essentially logs every system call your program makes into a file. Afterwards, this file is analyzed to seek a line like
mkdir("target_dir", 0777) = 0
and extract value of "target_dir" into a variable. Note that:
if your program creates more than 1 directory (even for temporary purposes and deletes them afterwards, or whatever) — there's really no way to determine which of them to grab
running a program with strace is much slower that normal due to huge overhead of logging all the syscalls.
it's super non-portable — facilities like strace exist on most modern OS, but implementations will vary a lot
A solution with inotify works in the same way, but using different mechanism — i.e. it uses OS hook to log all the operations that process performs with file system and then react to it (remember created directory).
However, I repeat, I'd strongly suggest against using any of these solutions beyond research interest.

Does the Linux system lock the file when I copy it?

I have wrote a program and it will update a file periodically, sometimes I want to copy the file into another computer to check its content. If I copied the file when the program was not writing it, there is no problem. But, if I copied the file when the program was writing it, the copied file would be partial. So, I wonder that, if the Linux system exists the lock strategy to prevent the situation.
In fact, I copy the file in a bash script, so I want to check if the program is writing it in the bash script. If yes, the bash script will check its state after some seconds and then copy its completed version. So in bash script, how could we check the file was opened or modified by other programs?
You could check from your script whether the file is being written to, and abort/pause copy if it is...
fuser -v /path/to/your/file | awk 'BEGIN{FS=""}$38=="F"{num++}END{print num}'
If the output is smaller 1 you're good to copy :)
When your code writes into the file, it actually writes into an output buffer in memory. The buffer will be flushed out to disk when it becomes full. Thus, when you copy the file whose buffer has not been flushed out to disk, you will observe partial file.
You can modify the buffer size by using the call to setvbuf. If you set the buffer size to zero, it will get flushed out as it is written. Another thing you can do is to make a call to fflush() to flush the output to disk. Either of these two should update the file as it is written.

Retrieving a list of all file descriptors (files) that a process ever opened in linux

I would like to be able to get a list of all of the file descriptors (now considering this question to pertain to actual files) that a process ever opened during the runtime of the process. The problem with polling /proc/(PID)/fd/ is that you only get a snapshot in time of what is currently open. Is there a way to force linux to keep this information around long enough to log it for the entire run of the process?
First, notice that a file descriptor which is open-ed then close-d by the application is recycled by the kernel (a future open could give the same file descriptor). See open(2) and close(2) and read Advanced Linux Programming.
Then, consider using strace(1); you'll be able to log all the syscalls (or perhaps just open, socket, close, accept, ... that is the syscalls changing the file descriptor table). Of course strace is using the ptrace(2) syscall (which you probably don't want to bother using directly).
The simplest way would be to run strace -o /tmp/mytrace.tr yourprog argments... and to look, e.g. with some pager like less, into the quite big /tmp/mytrace.tr file.
As Gearoid Murphy commented you could restrict the output of strace using e.g. -e trace=file.
BTW, to debug Makefile-s this is the wrong approach. Learn more about remake.

How do I transparently compress/decompress a file as a program writes to/reads from it?

I have a program that reads and writes very large text files. However, because of the format of these files (they are ASCII representations of what should have been binary data), these files are actually very easily compressed. For example, some of these files are over 10GB in size, but gzip achieves 95% compression.
I can't modify the program but disk space is precious, so I need to set up a way that it can read and write these files while they're being transparently compressed and decompressed.
The program can only read and write files, so as far as I understand, I need to set up a named pipe for both input and output. Some people are suggesting a compressed filesystem instead, which seems like it would work, too. How do I make either work?
Technical information: I'm on a modern Linux. The program reads a separate input and output file. It reads through the input file in order, though twice. It writes the output file in order.
Check out zlibc: http://zlibc.linux.lu/.
Also, if FUSE is an option (i.e. the kernel is not too old), consider: compFUSEd http://www.biggerbytes.be/
named pipes won't give you full duplex operations, so it will be a little bit more complicated if you need to provide just one filename.
Do you know if your applications needs to seek through the file ?
Does your application work with stdin, stdout ?
Maybe a solution is to create a mini compressed file system that contains only a directory with your files
Since you have separate input and output file you can do the following :
mkfifo readfifo
mkfifo writefifo
zcat your inputfile > readfifo &
gzip writefifo > youroutputfile &
launch your program !
Now, you probably will get in trouble with the read twice in order of the input, because as soon as zcat is finished reading the input file, yout program will get a SIGPIPE signal
The proper solution is probably to use a compressed file system like CompFUSE, because then you don't have to worry about unsupported operations like seek.
btrfs:
https://btrfs.wiki.kernel.org/index.php/Main_Page
provides support for pretty fast "automatic transparent compression/decompression" these days, and is present (though marked experimental) in newer kernels.
FUSE options:
http://apps.sourceforge.net/mediawiki/fuse/index.php?title=CompressedFileSystems
Which language are you using?
If you are using Java, take a look at GZipInputStream and GZipOutputStream classes in the API doc.
If you are using C/C++, zlibc is probably the best way to go about it.

Resources