Syncing two files when one is still being written to

Syncing two files when one is still being written to - linux

I have an application (video stream capture) which constantly writes its data to a single file. Application typically runs for several hours, creating ~1 gigabyte file. Soon (in a matter of several seconds) after it quits, I'd like to have 2 copies of file it was writing - let's say, one in /mnt/disk1, another in /mnt/disk2 (the latter is an USB flash drive with FAT32 filesystem).
I don't really like an idea of modifying the application to write 2 copies simulatenously, so I though of:
Application starts and begins to write the file (let's call it /mnt/disk1/file.mkv)
Some utility starts, copies what's already there in /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
After getting initial sync state, it continues to follow a written file in a manner like tail -f does, copying everything it gets from /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
Several hours pass
Application quits, we stop our syncing utility
Afterwards, we run a quick rsync /mnt/disk1/file.mkv /mnt/disk2/file.mkv just to make sure they're the same. In case if they're the same, it should just run a quick check and quit fairly soon.
What is the best approach for syncing 2 files, preferably using simple Linux shell-available utilities? May be I could use some clever trick with FUSE / md device / tee / tail -f?
Solution
The best possible solution for my case seems to be
mencoder ... -o >(
tee /mnt/disk1/file.mkv |
tee /mnt/disk2/file.mkv |
mplayer -
)
This one uses bash/zsh-specific magic named "process substitution" thus eliminating the need to make named pipes manually using mkfifo, and displays what's being encoded, as a bonus :)

Hmmm... the file is not usable while it's being written, so why don't you "trick" your program into writing through a pipe/fifo and use a 2nd, very simple program, to create 2 copies?
This way, you have your two copies as soon as the original process ends.

Read the manual page on tee(1).

Related

Windows/Python check if file is open or in use

I am using python to monitor a folder and check if files are being copied in and if so, replicate those to a new location.
I am using the following to monitor the folder:
fsmonitor
The issue I am facing is that I am unable to discern if the file is in use and currently in the process of writing the contents onto disk. if so I want to wait till copying is complete and then start copying it to my new location.
So how do I find out if a file is in use/open?
I have seen some suggestions here where I try to write to the file question and if it fails then it indicates that the file is in use:
example answer (I've seen similar in python)
But I am reluctant to use such a method due to the fear that it might cause corruption and such issues.
Is there an alternative/safer way to do this? Or is testing write permissions safe?
Is anyone familiar with pywin32? Does it provide such tools? The site looks arcane, so wonder if it has the latest API provided by windows, even fsmointor mentioned above uses the same library and I wonder if there are newer/more efficient ways to do this.
Currently, I am using psutil, proc.open_files() to loop through all processes and all files to list out open files. if files that I am concerned about appear on this list I wait and try again. However, this process creates a humongous list of files and uses 12% of my CPU to create it, so I desperately need an alternative.
In response to Adrian McCarthy
I starting out assuming that it is safe to action whatever fsmonitor puts out, but if you see the following output which si for a single file copy:
0 86 0
create C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe 3684bf38
create C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe 3684bf38
0 86 0
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe a8cf3250
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe a8cf3250
0 160 0
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64.exe caef5c64
So the conundrum is at which 'modify' do I start copying the file? I can wait a few minutes/seconds to see if another 'modified' appeared for that file but how do I decide the time to wait for a large file over SFTP may take 30 minutes, so I need something scalable.
Also, I would like not the make multiple copy actions for a file since that will make the script inefficient.

This can help you
check if a file is open in Python
here is a code:
try: # try to open the file
with open("file", "r") as file:
# some code here
except IOError:
# if it throws an error that means it is in use

I think you're unnecessarily concerned about working with the file while another process still has it open.
On Windows. fsmonitor using the ReadDirectoryChangesW mechanism. That means you'll get a notification about a change after it happens. So if a process writes to foo.log, you'll get a notification after the write operation is completed. (In fact, I think it's after the update of the directory metadata.)
To copy the file, you need read access. So just go ahead and open it for reading.
If it opens, then it's safe to read, even if another process has it open. You cannot corrupt a file by reading it even if another process is writing to it.
If it fails to open, then another process has it open and is intentionally preventing other processes from reading it (probably because they know they'll be actively updating it). In that case, you can try again later.
Trying to first check whether another process is using the file doesn't actually help because the answer could change between the moment you check and the moment you try to act on that information.
When you open a file, the system does the permission check and the opening under a mutex*, so the answer cannot change in between. There's no way for you to simulate that yourself from user-mode code. Once you have the file open, you can safely use it.
If you try to read from a file at the same moment another process tries to write to it, the system will ensure that the read will get the data as it was before the write or as it is after the write. It won't get a result that's a mixture of old and new.
That said, if you're reading the file with a bunch of small read operations while another process is writing to the file with a bunch of small write operations, it's possible you might capture some intermediate state of the file. But that's okay. The original file is unharmed, and those writes will trigger another fsmonitor notification, so you're code will start over and try to make another copy of the file.
* I'm using "mutex" in a generic sense: It uses some sort of synchronization mechanism, but it might not necessarily be a Windows Mutex object.

How to cache files with Perl while playing sound files using vlc?

I would like to manually cache files in Perl, so when playing a sound there is little to no delay.
I wrote a program in Perl, which plays an audio file by doing a system call to VLC. When executing it, I noticed a delay before the audio started playing. The delay is usually between about 1.0 and 1.5 seconds. However, when I create a loop which does the same VLC call multiple times in a row, the delay is only about 0.2 - 0.3 seconds. I assume this is because the sound file was cached by Linux. I found Cache::Cache on CPAN, but I don't understand how it works. I'm interested in a solution without using a module. If that's not possible, I'd like to know how to use Cache::Cache properly.
(I know it's a bad idea to use a system call to VLC regarding execution speed)
use Time::HiRes;
use warnings;
use strict;
while (1) {
my $start = Time::HiRes::time();
system('vlc -Irc ./media/audio/noise.wav vlc://quit');
my $end = Time::HiRes::time();
my $duration = $end - $start;
print "duration = $duration\n";
<STDIN>;
}

Its not as easy as just "caching" a file in perl.
vlc or whatever program needs to interpret the content of the data (in your case the .wav file).
Either you stick with calling an external program and just give it a file to execute or you need to implement the whole stack in perl (and probably Perl XS Modules). By whole stack I mean:
1. Keeping the Data (your .wav file) in Memory (inside the perl runtime).
2. Interpreting the Data inside Perl.
The second part is where it gets tricky you would probably need to write a lot of code and/or use 3rd Party modules to get where you want.
So if you just want to make it work fast, stick with system calls. You could also look into Nama which might give you what you need.
From your Question it looks like you are mostly into getting the runtime of a .wav file. If its just about getting information about the File and not about playing the sound then maybe Audio::Wav could be the module for you.

Cacheing internal to Perl does not help you here.
Prime the Linux file cache by reading the file once, for example at program initialisation time. It might happen that at the time you want to play it, it has already been made stale and removed from the cache, so if you want to guarantee low access time, then put the files on a RAM disk instead.
Find and use a different media player with a lower footprint that does not load dozens of libraries in order to reduce start-up time. Try paplay from package pulseaudio-utils, or gst123, or mpg123 with mpg123-pulse.

How do I "dump" the contents of an X terminal programmatically a la /dev/vcs{,a} in the Linux console?

Linux's kernel-level console/non-X terminal emulator contains a very cool feature (if compiled in): each /dev/ttyN device corresponds with /dev/vcsaN and /dev/vcsN devices which represent the in-memory (displayed) state of that tty, with and without attributes (color, flashing, etc) respectively. This allows you to very easily cat /dev/vcs7 and see a dump of /dev/tty7 wherever cat was launched. I used this incredibly practical capability the other day to login to a system via SSH and remotely watch a dd process I'd forgotten to put inside a screen (or similar) session - it was running off a text console, so I took a few moments to finetune the character ranges that I wanted to grab, and presently I was watching dd's transfer status over SSH (once every second, incidentally).
To reiterate and clarify, /dev/vcs{,a}* are character devices that retrieve the current in-memory representation the kernel console VT100 emulator, represented as a single "line" of text (there are no "newlines" at the end of each "line" of the screen). Just to remove confusion, I want to note that I can't tail -f this device: it's not a character stream like the TTY itself is. (But I've never needed this kind of behavior, for what it's worth.)
I've kept my ears perked for many years for a method to dump the character-cell memory state of X terminal emulators - or indeed any arbitrary process that needs to work with ttys, in some similar manner as I can with the Linux console. And... I am rather surprised that there is no practical solution to this problem - since it has, arguably, existed for approximately 30 years - X was introduced in 1984 - or, to be pedantic, at least 19 years - /dev/vcs{,a}* was introduced in kernel 1.1.94; the newest file in that release is dated 22 Feb 1995. (The oldest is from 1st Dec 1993 :P)
I would like to say that I do understand and realize that the tty itself is not a "screen buffer" as such but a character stream, and that the nonstandard feature I essentially exploited above is a quirky capability specific to the Linux VT102 emulator. However, this feature is cool enough (why else would it be in the mainline tree? :D) that, in my opinion, there should be a counterpart to it for things that work with /dev/pts*.
This afternoon, I needed to screen-scrape the output of an interactive ncurses application so I could extract metadata from the information it presented in my terminal. (There was no other practical way to achieve the goal I was aiming for.) Linux' kernel VT100 driver would permit such a task to be completed very easily, and I made the mistake of thinking that it, in light of this, it couldn't truly be that hard to do the same under X11.
By 9AM, I'd decided that the easiest way to experimentally request a dump of a remote screen would be to run it in dtach (think "screen -x" without any other options) and hack the dtach code to request a screen update and quit.
Around 11AM-12PM, I was requesting screen updates and dumping them to stdout.
Around 3:30PM, I accepted that using dtach would be impossible:
First of all, it relies on the application itself to send the screen redraws on request, by design, to keep the code simple. This is great, but, as luck would have it, the application I was using didn't support whole-screen repaints - it would only redraw on screen-size change (and only if the screen size was truly different!).
Running the program inside a screen session (because screen is a true terminal emulator and has an internal 2D character-cell buffer), then running screen -x inside dtach, also mysteriously failed to produce character cell updates.
I have previously examined screen and found the code sufficiently insane enough to remove any inclinations I might otherwise have to hack on it; all I can say is that said insanity may be one of the reasons screen does not already have the capabilities I have presented here (which would arguably be very easy to implement).
Other questions similar to this one frequently get answers to use typescript, or script; I just want to clarify that script saves the stream of the tty itself to a file, which I would need to push through a VT100 emulator to obtain a screen image of the current state of the tty in question. In other words, script would be a very insane solution to my problem.

I'm not marking this as accepted since it doesn't solve the actual core issue (which is many years old), but I was able to achieve the specific goal I set out to do.
My specific requirements were that I wanted to screen-scrape the output of the ncdu interactive disk usage browser, so I could simply press Enter in another terminal (or perform some similar, easy sequence) to add the directory currently highlighted/selected in ncdu to a file-list of files I wanted to work with.My goal was not to have to distract myself with endless copy+paste and/or retyping of directory names (probably with not a few inaccuracies to boot), so I could focus on the directories I wanted to select.
screen has a refresh feature, accessed by pressing (by default) CTRL+A, CTRL+L. I extended my copy of dtach to be capable of sending keystrokes in addition to dumping remote screens to stdout, and wrapped dtach in a script that transmitted the refresh sequence (\001\014) to screen -x running inside dtach. This worked perfectly, retrieving complete screen updates without any flicker.
I will warn anyone interested in trying this technique, however, that you will need to perfect the art of dodging VT100 escape sequences. I used regular expressions for this so I wasn't writing thousands of lines of code; here's the specific part of the script that extracted out the two pieces of information I needed:
sh -c "(sleep 0.1; dtach -k qq $'\001\014') &"; path="$(dtach -d qq -t 130000 | sed -n $'/^\033\[7m.*\/\.\./q;/---.*$/{s/.*--- //;s/ -\+.*//;h};/^\033\[7m/{s/.\033.*//g;s/\r.*//g;s/ *$//g;s/^\033\[7m *[^ ]\+ \[[# ]*\] *\(\/*\)\(.*\)$/\/\\2\\1/;p;g;p;q}' | sed 'N;s/\(.*\)\n\(.*\)/\2\1/')"
Since screenshots are cool and help people visualize things, here's a look at how it works when it's running:
The file shown inverted at the bottom of the ncdu-scrape window is being screen-scraped from the ncdu window itself; the four files in the list are there because I selected them using the arrow keys in ncdu, moved my mouse over to the ncdu-scrape window (I use focus-follows-mouse), and pressed Enter. That added the file to the list (a simple text file itself).
Having said this, I would like to clarify that the regular expression above is not a code sample to run with; it is, rather, a warning: for anything beyond incredibly trivial (!!) content extractions such as the one presented here, you're basically getting into the same territory as large corporations/interests who want to convert from VT100-based systems to something more modern, who have to spend tends of thousands commissioning large translation frameworks that perform the kind of conversion outlined above on an especially large scale.
Saner solutions appreciated.

Time taken by `less` command to show output

I have a script that produces a lot of output. The script pauses for a few seconds at point T.
Now I am using the less command to analyze the output of the script.
So I execute ./script | less. I leave it running for sufficient time so that the script would have finished executing.
Now I go through the output of the less command by pressing Pg Down key. Surprisingly while scrolling at the point T of the output I notice the pause of few seconds again.
The script does not expect any input and would have definitely completed by the time I start analyzing the output of less.
Can someone explain how the pause of few seconds is noticable in the output of less when the script would have finished executing?

Your script is communicating with less via a pipe. Pipe is an in-memory stream of bytes that connects two endpoints: your script and the less program, the former writing output to it, the latter reading from it.
As pipes are in-memory, it would be not pleasant if they grew arbitrarily large. So, by default, there's a limit of data that can be inside the pipe (written, but not yet read) at any given moment. By default it's 64k on Linux. If the pipe is full, and your script tries to write to it, the write blocks. So your script isn't actually working, it stopped at some point when doing a write() call.
How to overcome this? Adjusting defaults is a bad option; what is used instead is allocating a buffer in the reader, so that it reads into the buffer, freeing the pipe and thus letting the writing program work, but shows to you (or handles) only a part of the output. less has such a buffer, and, by default, expands it automatically, However, it doesn't fill it in the background, it only fills it as you read the input.
So what would solve your problem is reading the file until the end (like you would normally press G), and then going back to the beginning (like you would normally press g). The thing is that you may specify these commands via command line like this:
./script | less +Gg
You should note, however, that you will have to wait until the whole script's output loads into memory, so you won't be able to view it at once. less is insufficiently sophisticated for that. But if that's what you really need (browsing the beginning of the output while the ./script is still computing its end), you might want to use a temporary file:
./script >x & less x ; rm x

The pipe is full at the OS level, so script blocks until less consumes some of it.

Flow control. Your script is effectively being paused while less is paging.
If you want to make sure that your command completes before you use less interactively, invoke less as less +G and it will read to the end of the input, you can then return to the start by typing 1G into less.

For some background information there's also a nice article by Alexander Sandler called "How less processes its input"!
http://www.alexonlinux.com/how-less-processes-its-input

Can I externally enforce line buffering on the script?
Is there an off the shelf pseudo tty utility I could use?
You may try to use the script command to turn on line-buffering output mode.
script -q /dev/null ./script | less # FreeBSD, Mac OS X
script -c "./script" /dev/null | less # Linux
For more alternatives in this respect please see: Turn off buffering in pipe.

File output redirection in Linux

I have two programs A and B. I can't change the program A - I can only run it with some parameters, but I have written the B myself, and I can modify it the way I like.
Program A runs for a long time (20-40 hours) and during that time it produces output to the file, so that its size increases constantly and can be huge at the end of run (like 100-200 GB). The program B then reads the file and calculates some stuff. The special property of the file is that its content is not correlated: I can divide the file in half and run calculations on each part independently, so that I don't need to store all the data at once: I can calculate on the first part, then throw it away, calculate on the second one, etc.
The problem is that I don't have enough space to store such a big files. I wonder if it is possible to pipe somehow the output of the A to B without storing all the data at once and without making huge files. Is it possible to do something like that?
Thank you in advance, this is crucial for me now, Roman.

If program A supports it, simply pipe.
A | B
Otherwise, use a fifo.
mkfifo /tmp/fifo
ls -la > /tmp/fifo &
cat /tmp/fifo
EDIT: Adjust buffer sizes with ulimit -p and then:
cat /tmp/fifo | B

It is possible to pipeline output of one program into another.
Read here to know the syntax and know-hows of Unix pipelining.

you can use socat which can take stdout and feed it to network and get from network and feed it to stdin
named or unnamed pipe have a problem of small ( 4k ? ) buffer .. that means too many process context switches if you are writing multi gb ...
Or if you are adventurous enough .. you can LD_PRELOAD a so in process A, and trap the open/write calls to do whatever ..

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string