Redirecting multiple stdouts to single file - linux

I have a program running on multiple machines with NFS and I'd like to log all their outputs into a single file. Can I just run ./my_program >> filename on every machine or is there an issue with concurrency I should be aware of? Since I'm only appending, I don't think there would be a problem, but I'm just trying to make sure.

That could work, but yes, you will have concurrency issues with it, and the log file will be basically indecipherable.
What I would recommend is that there be a log file for each machine and then on some periodical basis (say nightly), concatenate the files together with the machine name as the file name:
for i in "/path/to/logfiles/*"; do
echo "Machine: $i";
cat $i;
done > filename.log
That should give you some ideas, I think.

The NFS protocol does not support atomic append writes, so append writes are never atomic on NFS for any platform. Files WILL end up corrupt if you try.
When appending to files from multiple threads or processes, the fwrites to that file are atomic under the condition that the file was opened in appending mode, the string written to it does not exceed the filesystem blocksize and the filesystem is local. Which in NFS is not the case.
There is a workaround, although I would not know how to do it from a shellscript. The technique is called close-to-open cache consistency

Related

will io direction operation lock the file?

i have a growing nginx log file about 20G already, and i wish to rotate it.
1, i mv the old log file to a new log file
2, i do > old_log_file.log to truncate the old log file in about 2~3 seconds
if there's a lock(write lock?) on the old log file when i doing the truncating(about 2~3 seconds)?
at that 2~3s period, nginx returns 502 for waiting to append logs to old log file until lock released?
thank you for explaining.
On Linux, there is (almost) no mandatory file locks (more precisely, there used to be some mandatory locking feature in the kernel, but it is deprecated and you really should avoid using it). File locking happens with flock(2) or lockf(3) and is advisory and should be explicit (e.g. with flock(1) command, or some program calling flock or lockf).
So every locking related to files is practically a convention between all the software using that file (and mv(1) or the redirection by your shell don't use file locking).
Remember that a file on Linux is mostly an i-node (see inode(7)) which could have zero, one or several file paths (see path_resolution(7) and be aware of link(2), rename(2), unlink(2)) and used thru some file descriptor. Read ALP (and perhaps Operating Systems: Three Easy Pieces) for more.
No file locking happens in the scenario of your question (and the i-nodes and file descriptors involved are independent).
Consider using logrotate(8).
Some software provide a way to reload their configuration and re-open log files. You should read the documentation of your nginx.
It depends on application if it locks the file. Application that generates this log file must have option to clear log file. One example is in editor like vim file can be externally modified while it is still open in editor.

Opening syslog files safely

I have a syslog server running syslog-ng (soon to be running rsyslog) on RHEL 6 with almost a 1000 hosts logging to it. I want write a script to open each file (read-only and 1 at the time), pull some data from them and close them (probably in Ruby). Will this mess up syslog or cause any other issues? What other pitfalls might I need to be aware of?
My main worry is syslog trying to write data to a file that I have open, even tho I may only have it open for a very short amount of time (maybe less than a second).
possible sudo code:
foreach file
open $file
grep "search sting" $file
close $file
If the daemon only appends to the file, I think there's no risk. It's deleting/tryncating I would worry about. I think it depends on the configuration of your daemon (not so much on which daemon is used), but I would assume most configs don't actually remove data.
Deleting is typically done with logrotate and other tools. I would look at these tools' configs (e.g. make them run at a different time than the script you're mentioning).

How to know if a file in a shared disk is in use by another Linux instance using bash?

I need to know if there is a way to know if a file in a shared disk is in use by another Linux instance.
I have 2 Linux machines sharing a disk. Each random time, the first machine writes a results file (with consecutive filename) to the shared disk when an analysis process is done.
In the other machine I have a bash script verifying if the file has already been finished by the first machine.
The way I verify now is in a for loop in bash script running the stat command to know if the Last modified date of the file is after the current date of the machine. If this is true I can process the file. If not, I run a sleep and then I run stat again.
So, there is any way to avoid this and to know if the file in the shared disk is in use by another machine? Or which is the best way to wait for the finished file?
Thanks in advance.
Write the result file into the same directory with a temporary name. Only rename it to its final name after closing the file under its temporary name, ensuring that contents are flushed. Thus, if the file exists under its final name, it is guaranteed to be complete.
This needs to be into the same directory because NFS renames are only guaranteed to be atomic within a single directory (whereas in non-NFS scenarios, a location anywhere on the same filesystem would work).
You can try using flock as demonstrated here: How to prevent a script from running simultaneously?. There are other ways to synchronize file access as well (lockfile, mkdir, just to name a few). A simple google search should give you what you need. I'm not 100% sure this if appropriate for your setup with the shared disk though.
Kind of the same approach as comparing the date:
Get the filesize two time and compare
STAT1=`du [FILE] | awk '{print $1}'`;
STAT2=`du [FILE] | awk '{print $1}'`;
[ $STAT1 -ne $STAT2 ] \
&& echo "writing to..." \
|| echo "FINISHED"
If you expect hevy I/O and iowaits put a sleep 1 between the to STATS.

How to use rsync instead of scp in my below shell script to copy the files?

I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.
I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.
Below is my problem description -
I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.
If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.
I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.
Below is my shell script which I have -
#!/bin/bash
export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers
export dir3=/testing/snapshot/20140103
do_Copy() {
el=$1
PRIMSEC=$2
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy
parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[#]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[#]}" &
wait
echo "All files copied."
Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?
rsync is designed to efficiently synchronise two hierarchies of folders and files.
While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.
While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.
In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.
You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.
In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.
You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).
It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?
Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.
You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.
It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.
I think you can just replace
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
by
rsync david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data
(note that the change is not only the command)
Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Resources