Parallel copy to move large amount of Data in linux - linux

I have folder which needs to be migrated to NFS location and its around 20T. when I try to rsync(or using parallel rsync) its taking lot of time. even with parallel rsync. is there any way we can improve copy rate? I just want to move entire directory even no need of syncing:
Tried the following option but with this its taking very long:
rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /<dir>/ /<nfs path> > /tmp/sync.log
/tmp/sync.log | parallel --will-cite -j 5 rsync -avzm --relative --stats --safe-links --ignore-existing --human-readable {} /<nfs path (dest dir)> > /tmp/result.log
I think its overwriting and taking lot than regular.

Related

How Can I Accelerate rsync with Simultaneous/Concurrent File Transfers?

We need to move 150TB of data from one RHEL server to another as quickly as possible. We're currently using rsync, which takes a long time.
So I'd like to run an rsync for each of the directories in /depots/, up to a maximum of, say, 5 at a time. 6 rsyncs would be performed for /depot/depot1, /depot/depot2, /depot/depot3, /depot/depot4, and /depot/depot5.
we are using:
ls /depots | xargs -n1 -P4 -I% rsync -Pa % targetremoteserver.com:/depots/
it's not working in my case. Is there a way we can speed up the synchronization process?

WGET - Simultaneous connections are SLOW

I use the following command to append the browser's response from list of URLs into an according output file:
wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
>> /Applications/MAMP/htdocs/data/export.txt
This works fine and when finished it says:
Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)
In order to speed this up I used:
cat /Applications/MAMP/htdocs/data/urls.txt | \
tr -d '\r' | \
xargs -P 10 $(which wget) -i - -O - \
>> /Applications/MAMP/htdocs/data/export.txt
Which opens simultaneous connections making it a little faster:
Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)
As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?
On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.
My API Rate-Limiting is as follows:
Number of requests per minute 100
Number of mapping jobs in a single request 100
Total number of mapping jobs per minute 10,000
Have you tried GNU Parallel? It will be something like this:
parallel -a /Applications/MAMP/htdocs/data/urls.txt wget -O - > result.txt
You can use this to see what it will do without actually doing anything:
parallel --dry-run ...
And either of these to see progress:
parallel --progress ...
parallel --bar ...
As your input file seems to be a bit of a mess, you can strip carriage returns like this:
tr -d '\r' < /Applications/MAMP/htdocs/data/urls.txt | parallel wget {} -O - > result.txt
A few things:
I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
man xargs advises you to "Use the -n option with -P; otherwise
chances are that only one exec will be done."
You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
To debug, substitute echo for wget and check how it's batching the parameters
So this should work:
cat urls.txt | \
xargs --max-procs=10 --max-args=100 wget --output-document=-
(I've preferred long params - --max-procs is -P. --max-args is -n)
See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.
However, in most circumstances I would not expect parallelising to significantly increase your download rate.
In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.
So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

multiple wget -r a site simultaneously?

any command / wget with options?
For multithreaded download a site recursively and simultaneously?
I found a decent solution.
Read original at http://www.linuxquestions.org/questions/linux-networking-3/wget-multi-threaded-downloading-457375/
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copied as many times as you deem fitting to have as much processes
downloading. This isn't as elegant as a properly multithreaded app,
but it will get the job done with only a slight amount of over head.
the key here being the "-N" switch. This means transfer the file only
if it is newer than what's on the disk. This will (mostly) prevent
each process from downloading the same file a different process
already downloaded, but skip the file and download what some other
process hasn't downloaded. It uses the time stamp as a means of doing
this, hence the slight overhead.
It works great for me and saves a lot of time. Don't have too many
processes as this may saturate the web site's connection and tick off
the owner. Keep it around a max of 4 or so. However, the number is
only limited by CPU and network bandwidth on both ends.
With the use of parallel wget utilizing the xargs switch, this solution seems so much better:
https://stackoverflow.com/a/11850469/1647809
Use axel to download with multi connections
apt-get install axel
axel http://example.com/file.zip
Well, you can always run multiple instances of wget, no?
Example:
wget -r http://somesite.example.org/ &
wget -r http://othersite.example.net/ &
etc. This syntax will work in any Unix-like environment (e.g. Linux or MacOS); not sure how to do this in Windows.
Wget itself does not support multithreaded operations - at least, neither the manpage nor its website has any mention of this. Anyway, since wget supports HTTP keepalive, the bottleneck is usually the bandwidth of the connection, not the number of simultaneous downloads.

Alternative to scp, transferring files between linux machines by opening parallel connections

Is there an alternative to scp, to transfer a large file from one machine to another machine by opening parallel connections and also able to pause and resume the download.
Please don't transfer this to severfault.com. I am not a system administrator. I am a developer trying to transfer past database dumps between backup hosts and servers.
Thank you
You could try using split(1) to break the file apart and then scp the pieces in parallel. The file could then be combined into a single file on the destination machine with 'cat'.
# on local host
split -b 1M large.file large.file. # split into 1MiB chunks
for f in large.file.*; do scp $f remote_host: & done
# on remote host
cat large.file.* > large.file
Take a look at rsync to see if it will meet your needs.
The correct placement of questions is not based on your role, but on the type of question. Since this one is not strictly programming related it is likely that it will be migrated.
Similar to Mike K's answer, check out https://code.google.com/p/scp-tsunami/ - it handles splitting the file, starting several scp processes to copy the parts and then joins them again...it can also copy to multiple hosts...
./scpTsunami.py -v -s -t 9 -b 10m -u dan bigfile.tar.gz /tmp -l remote.host
That splits the file into 10MB chunks and copies them using 9 scp processes...
The program you are after is lftp. It supports sftp and parallel transfers using its pget command. It is available under Ubuntu (sudo apt-get install lftp) and you can read a review of it here:
http://www.cyberciti.biz/tips/linux-unix-download-accelerator.html

Remote linux server to remote linux server large sparse files copy - How To?

I have two twins CentOS 5.4 servers with VMware Server installed on each.
What is the most reliable and fast method for copying virtual machines files from one server to the other, assuming that I always use sparse file for my vmware virtual machines?
The vm's files are a pain to copy since they are very large (50 GB) but since they are sparse files I think something can be done to improve the speed of the copy.
If you want to copy large data quickly, rsync over SSH is not for you. As running an rsync daemon for quick one-shot copying is also overkill, yer olde tar and nc do the trick as follows.
Create the process that will serve the files over network:
tar cSf - /path/to/files | nc -l 5000
Note that it may take tar a long time to examine sparse files, so it's normal to see no progress for a while.
And receive the files with the following at the other end:
nc hostname_or_ip 5000 | tar xSf -
Alternatively, if you want to get all fancy, use pv to display progress:
tar cSf - /path/to/files \
| pv -s `du -sb /path/to/files | awk '{ print $1 }'` \
| nc -l 5000
Wait a little until you see that pv reports that some bytes have passed by, then start the receiver at the other end:
nc hostname_or_ip 5000 | pv -btr | tar xSf -
Have you tried rsync with the option --sparse(possibly over ssh)?
From man rsync:
Try to handle sparse files efficiently so they take up less
space on the destination. Conflicts with --inplace because it’s
not possible to overwrite data in a sparse fashion.
Since rsync is terribly slow at copying sparse file, I usually resort using tar over ssh :
tar Scjf - my-src-files | ssh sylvain#my.dest.host tar Sxjf - -C /the/target/directory
You could have a look at http://www.barricane.com/virtsync
(Disclaimer: I am the author.)

Resources