Is it dangerous to pipe data from S3 to a long-running process - linux

I have a 30GB file and I want to feed it into a program that accepts data via stdin and that will take 24hr to process the data.
Can I just do aws s3 cp s3://bigfile.txt - | long_process.sh?
I'm attracted to this b/c I don't have to store bigfile.txt on disk and I can start working on the data stream immediately, but I worry that if the s3 command has a problem during the 24 hours that it will crash and I will lose all the progress.
EDIT:
Alternatively, I am looking for a way to stage the data to a file and read from the staged data. For example: aws s3 cp s3://bigfile.txt /local/bigfile.txt &; long_process.sh < /local/bigfile.txt. The trouble is that long_process.sh complains of a truncated file; It does not seem to wait for further input like a pipe would. I have thought about tee and mkfifo but nothing seems to quite fit my needs.

Yes, it’s dangerous because the processing might crash during the time and you could lose everything

Related

Reduce Size of .forever Log Files Without Disrupting forever Process

The log files (in /root/.forever) created by forever have reached a large size and is almost filling up the hard disk.
If the log file were to be deleted while the forever process is still running, forever logs 0 will return undefined. The only way for logging of the current forever process to resume is to stop it and start the node script again.
Is there a way to just trim the log file without disrupting logging or the forever process?
So Foreverjs will continue to write to the same file handle and ideally would support something that allows you to send it a signal and rotate to a different file.
Without that, which requires code change on the Forever.js package, your options look like:
A command line version:
Make a backup
Null out the file
cp forever-guid.log backup && :> forever-guid.log;
This has the slight risk of if your writing to the log file at a speedy pace, that you'll end up writing a log line between the backup and the nulling, resulting in the loss of the log line.
Use Logrotate w/copytruncate
You can set up logrotate to watch the forever log directory to copy and truncate automatically based on filesize or time.
Have your node code handle this
You can have your logging code look at how many lines the log file is and then doing the copy truncate - this would allow you to avoid the potential data loss.
EDIT: I had originally thought that split and truncate could do the job. They probably can but an implementation would look really awkward. Split doesn't have a good way to splitting the file into a short one (the original log) and a long one (the backup). Truncate (which in addition to the fact that it's not always installed) doesn't reset the write pointer, so forever just writes the same byte as it would have, resulting in strange data.
You can truncate the log file without losing its handle (reference).
cat /dev/null > largefile.txt

When does the writer of a named pipe do its work?

I'm trying to understand how a named pipe behaves in terms of performance. Say I have a large file I am decompressing that I want to write to a named pipe (/tmp/data):
gzip --stdout -d data.gz > /tmp/data
and then I sometime later run a program that reads from the pipe:
wc -l /tmp/data
When does gzip actually decompress the data, when I run the first command, or when I run the second and the reader attaches to the pipe? If the former, is the data stored on disk or in memory?
Pipes (named or otherwise) have only a very small buffer if any -- so if nothing is reading, then nothing (or very little) can be written.
In your example, gzip will do very little until wc is run, because before that point its efforts to write output will block. Out-of-the-box there is no nontrivial buffer either on-disk or in-memory, though tools exist which will implement such a buffer for you, should you want one -- see pv with its -B argument, or the no-longer-maintained (and, sadly, removed from Debian by folks who didn't understand its function) bfr.

Node.js - How does a readable stream react to a file that is still being written?

I have found a lot of information on how to pump, or pipe data from a read stream to a write stream in Node. The newest version even auto pauses, and resumes for you. However, I have a different need and would like some help.
I am writing a video file using ffmpeg (to a local file, not a writeable stream), and I would like to create a readstream that reads the data as it gets written. Obviously, the read stream speed will surpass how quickly ffmpeg encodes the file. What will happen when the read stream reaches the end of data before ffmpeg finishes writing the file? I assume it will stop the read stream before the file is fully encoded.
Anyone have any suggestions for the best way to pause/resume the read stream so that it doesn't reach the end of the locally encoding file until the encoding is 100% complete?
In summary:
This is what people normally do: readStream --> writeStream (using .pipe)
This is what I want to do: local file (in slow creation process) --> readStream
As always, thanks to the stackOverflow community.
The growing-file module is what you want.

Upload output of a program directly to a remote file by ftp

I have some program that generates a lot of data, to be specific encrypting tarballs. I want to upload result on a remote ftp server.
Files are quite big (about 60GB), so I don't want to waste hdd space for tmp dir and time.
Is it possible? I checked ncftput util, but there is not option to read from a standard input.
curl can upload while reading from stdin:
-T, --upload-file
[...]
Use the file name "-" (a single dash) to use stdin instead of a given
file. Alternately, the file name "." (a single period) may be
specified instead of "-" to use stdin in non-blocking mode to allow
reading server output while stdin is being uploaded.
[...]
I guess you could do that with any upload program using named pipe, but I foresee problems if some part of the upload goes wrong and you have to restart your upload: the data is gone and you cannot start back your upload, even if you only lost 1 byte. This also applied to a read from stdin strategy.
My strategy would be the following:
Create a named pipe using mkfifo.
Start the encryption process writing to that named pipe in the background. Soon, the pipe buffer will be full and the encryption process will be blocked trying to write data to the pipe. It should unblock when we will read data from the pipe later.
Read a certain amount of data from the named pipe (let say 1 GB) and put this in a file. The utility dd could be used for that.
Upload that file though ftp doing it the standard way. You then can deal with retries and network errors. Once the upload is completed, delete the file.
Go back to step 3 until you get a EOF from the pipe. This will mean that the encryption process is done writing to the pipe.
On the server, append the files in order to an empty file, deleting the files one by one once it has been appended. Using touch next_file; for f in ordered_list_of_files; do cat $f >> next_file; rm $f; done or some variant should do it.
You can of course prepare the next file while you upload the previous file to use concurrency at its maximum. The bottleneck will be either your encryption algorithm (CPU), you network bandwidth, or your disk bandwidth.
This method will waste you 2 GB of disk space on the client side (or less or more depending the size of the files), and 1 GB of disk space on the server side. But you can be sure that you will not have to do it again if your upload hang near the end.
If you want to be double sure about the result of the transfer, you could compute hash of you files while writing them to disk on the client side, and only delete the client file once you have verify the hash on the server side. The hash can be computed on the client side at the same time you are writing the file to disk using dd ... | tee local_file | sha1sum. On the server side, you would have to compute the hash before doing the cat, and avoid doing the cat if the hash is not good, so I cannot see how to do it without reading the file twice (once for the hash, and once for the cat).
You can write to a remote file using ssh:
program | ssh -l userid host 'cd /some/remote/directory && cat - > filename'
This is a sample of uploading to ftp site by curl
wget -O- http://www.example.com/test.zip | curl -T - ftp://user:password#ftp.example.com:2021/upload/test.zip

Syncing two files when one is still being written to

I have an application (video stream capture) which constantly writes its data to a single file. Application typically runs for several hours, creating ~1 gigabyte file. Soon (in a matter of several seconds) after it quits, I'd like to have 2 copies of file it was writing - let's say, one in /mnt/disk1, another in /mnt/disk2 (the latter is an USB flash drive with FAT32 filesystem).
I don't really like an idea of modifying the application to write 2 copies simulatenously, so I though of:
Application starts and begins to write the file (let's call it /mnt/disk1/file.mkv)
Some utility starts, copies what's already there in /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
After getting initial sync state, it continues to follow a written file in a manner like tail -f does, copying everything it gets from /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
Several hours pass
Application quits, we stop our syncing utility
Afterwards, we run a quick rsync /mnt/disk1/file.mkv /mnt/disk2/file.mkv just to make sure they're the same. In case if they're the same, it should just run a quick check and quit fairly soon.
What is the best approach for syncing 2 files, preferably using simple Linux shell-available utilities? May be I could use some clever trick with FUSE / md device / tee / tail -f?
Solution
The best possible solution for my case seems to be
mencoder ... -o >(
tee /mnt/disk1/file.mkv |
tee /mnt/disk2/file.mkv |
mplayer -
)
This one uses bash/zsh-specific magic named "process substitution" thus eliminating the need to make named pipes manually using mkfifo, and displays what's being encoded, as a bonus :)
Hmmm... the file is not usable while it's being written, so why don't you "trick" your program into writing through a pipe/fifo and use a 2nd, very simple program, to create 2 copies?
This way, you have your two copies as soon as the original process ends.
Read the manual page on tee(1).

Resources