best way to copy complex directory structure file - linux

As we know, we could use rsync, tar or cp to transfer directory or files on NFS file system or different server.
Just be curious that which way is best to transfer complex directory?
Or writing a customized c program would be simple and fast to copy or create the link directory structure tree.

The best way is use rsync. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. rsync is widely used for backups and mirroring and as an improved copy command for everyday use.
The utility of command tar is: "designed to store and extract files from an archive file known as a tarfile". Not to do a transfer between servers.. And cp is more simple than rsync command. Replace all the files in the destination.
About program customized in C, i think that It is difficult to optimize something that has already been optimized.

Related

Strategy for compressing and navigating large compressed directories

I manage a computer cluster. It is a multi-user system. I have a large directory filled with files (terabytes in size). I'd like to compress it so the user who owns it can save space and still be able to extract files from it.
Challenges with possible solutions :
tar : The directory's size makes it challenging to decompress the subsequent tarball due to tar's poor random access read. I'm referring to the canonical way of compressing, i.e. tar cvzf mytarball.tar.gz mybigdir
squashfs : It appears that this would be a great solution, except in order to mount it, it requires root access. I don't really want to be involved in mounting their squashfs file every time they want to access a file.
Compress then tar : I could compress the files first and then use tar to create the archive. This would have the disadvantage that I wouldn't save as much space with compression and I wouldn't get back any inodes.
Similar questions (here) have been asked before, but the solutions are not appropriate in this case.
QUESTION:
Is there a convenient way to compress a large directory such that it is quick and easy to navigate and doesn't require root permissions?
You add it in tags, but do not mention it in question. For me zip is the simplest way to manage big archives (with many files). Moreover tar+gzip is actually two step operation which need special operations to speedup. And zip is available for lot of platforms so you win also in this direction.

Rsync with .dd.xz files

I am trying different ways to update/write an image on a linux device and using rsync for this.
For file system synchronization rsync checks and only transfers missing /changed files reducing the bandwidth.
In similar way I created a binary file of 10MB(original.bin) and modified this file by adding few changes (modified.bin)and tried to rsync the original.bin file.First time it transfers the whole file as there is no copy on the device.Next modified.bin file is renamed to original.bin and did rsync. It only transferred changes in the modified.bin I want to know if this is the same with .dd.xz files as well. I have 2 .dd.xz files (image1.dd.xz and image2.dd.xz which has few dlls and mono packgaes added) and when these files are extracted to .dd files and rsync transfers only changes.
But when i rsync the files as .dd.xz it transfers the whole file again. Can some one help me to understand if this is expected behaviour or rsync behaves same on .dd files as with any other text files?
xz is the extension used by the xz compress tool. Compressed files don't work with rsync for obvious reasons.
Consider whether you're better off using dd images without compressing them. You can (de)compress them faster using the pixz command which does its job in parallel using all available processors.

convert tar to zip with stdout/stdin

How can I convert a tar file to zip using stdout/stdin?
-# takes a list of files from stdin, tar -t provides that but doesn't actually extract the files. Using -xv provides me a list of files and extracts it to disk but I'm trying to avoid touching the disk.
Something like the following but obviously the below will not produce the same file structure.
tar -xf somefile.tar -O | zip somefile.zip -
I can do it by temporarily writing files to disk but I'm trying to avoid that - I'd like to use pipes only.
Basically I'm not aware of "standard" utility doing the conversion on the fly.
But similar question has been discussed over here using a python libs, which has a simple structure, probably you can adapt the script to your requirements.
I never heard about the tool that performs direct conversion. As a workaround, if your files are not extremely huge, you can extract the tar archive into a tmpfs-mounted directory (or similar), that is, the filesystem that resides completely in memory and don't work with disk. It should be much faster than extracting into the disk folder.

How to use rsync instead of scp in my below shell script to copy the files?

I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.
I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.
Below is my problem description -
I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.
If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.
I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.
Below is my shell script which I have -
#!/bin/bash
export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers
export dir3=/testing/snapshot/20140103
do_Copy() {
el=$1
PRIMSEC=$2
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy
parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[#]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[#]}" &
wait
echo "All files copied."
Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?
rsync is designed to efficiently synchronise two hierarchies of folders and files.
While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.
While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.
In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.
You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.
In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.
You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).
It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?
Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.
You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.
It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.
I think you can just replace
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
by
rsync david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data
(note that the change is not only the command)
Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.

Checking the integrity of a copied folder

I am copying a big folder (300Gb) into an external hard drive. I want to make sure the copied file is complete and not corrupt before deleting the original file. How can I do that in ubuntu?
You could use rsync --checksum to check the files. Or simply use sha256sum or similar to check the files manually. Using rsync is in my opinion more comfortable because it automatically checks recursively, but that largely depends on your usecase.
If you really require absolute integrity, you should really consider using an error correction code . Hard drives don't keep data integrity forever, a bit might change from time to time.

Resources