multiple wget -r a site simultaneously? - multithreading

any command / wget with options?
For multithreaded download a site recursively and simultaneously?

I found a decent solution.
Read original at http://www.linuxquestions.org/questions/linux-networking-3/wget-multi-threaded-downloading-457375/
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copied as many times as you deem fitting to have as much processes
downloading. This isn't as elegant as a properly multithreaded app,
but it will get the job done with only a slight amount of over head.
the key here being the "-N" switch. This means transfer the file only
if it is newer than what's on the disk. This will (mostly) prevent
each process from downloading the same file a different process
already downloaded, but skip the file and download what some other
process hasn't downloaded. It uses the time stamp as a means of doing
this, hence the slight overhead.
It works great for me and saves a lot of time. Don't have too many
processes as this may saturate the web site's connection and tick off
the owner. Keep it around a max of 4 or so. However, the number is
only limited by CPU and network bandwidth on both ends.

With the use of parallel wget utilizing the xargs switch, this solution seems so much better:
https://stackoverflow.com/a/11850469/1647809

Use axel to download with multi connections
apt-get install axel
axel http://example.com/file.zip

Well, you can always run multiple instances of wget, no?
Example:
wget -r http://somesite.example.org/ &
wget -r http://othersite.example.net/ &
etc. This syntax will work in any Unix-like environment (e.g. Linux or MacOS); not sure how to do this in Windows.
Wget itself does not support multithreaded operations - at least, neither the manpage nor its website has any mention of this. Anyway, since wget supports HTTP keepalive, the bottleneck is usually the bandwidth of the connection, not the number of simultaneous downloads.

Related

WGET - Simultaneous connections are SLOW

I use the following command to append the browser's response from list of URLs into an according output file:
wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
>> /Applications/MAMP/htdocs/data/export.txt
This works fine and when finished it says:
Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)
In order to speed this up I used:
cat /Applications/MAMP/htdocs/data/urls.txt | \
tr -d '\r' | \
xargs -P 10 $(which wget) -i - -O - \
>> /Applications/MAMP/htdocs/data/export.txt
Which opens simultaneous connections making it a little faster:
Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)
As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?
On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.
My API Rate-Limiting is as follows:
Number of requests per minute 100
Number of mapping jobs in a single request 100
Total number of mapping jobs per minute 10,000
Have you tried GNU Parallel? It will be something like this:
parallel -a /Applications/MAMP/htdocs/data/urls.txt wget -O - > result.txt
You can use this to see what it will do without actually doing anything:
parallel --dry-run ...
And either of these to see progress:
parallel --progress ...
parallel --bar ...
As your input file seems to be a bit of a mess, you can strip carriage returns like this:
tr -d '\r' < /Applications/MAMP/htdocs/data/urls.txt | parallel wget {} -O - > result.txt
A few things:
I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
man xargs advises you to "Use the -n option with -P; otherwise
chances are that only one exec will be done."
You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
To debug, substitute echo for wget and check how it's batching the parameters
So this should work:
cat urls.txt | \
xargs --max-procs=10 --max-args=100 wget --output-document=-
(I've preferred long params - --max-procs is -P. --max-args is -n)
See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.
However, in most circumstances I would not expect parallelising to significantly increase your download rate.
In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.
So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

How to measure IOPS for a command in linux?

I'm working on a simulation model, where I want to determine when the storage IOPS capacity becomes a bottleneck (e.g. and HDD has ~150 IOPS, while an SSD can have 150,000). So I'm trying to come up with a way to benchmark IOPS in a command (git) for some of it's different operations (push, pull, merge, clone).
So far, I have found tools like iostat, however, I am not sure how to limit the report to what a single command does.
The best idea I can come up with is to determine my HDD IOPS capacity, use time on the actual command, see how long it lasts, multiply that by IOPS and those are my IOPS:
HDD ->150 IOPS
time df -h
real 0m0.032s
150 * .032 = 4.8 IOPS
But, this is of course very stupid, because the duration of the execution may have been related to CPU usage rather than HDD usage, so unless usage of HDD was 100% for that time, it makes no sense to measure things like that.
So, how can I measure the IOPS for a command?
There are multiple time(1) commands on a typical Linux system; the default is a bash(1) builtin which is somewhat basic. There is also /usr/bin/time which you can run by either calling it exactly like that, or telling bash(1) to not use aliases and builtins by prefixing it with a backslash thus: \time. Debian has it in the "time" package which is installed by default, Ubuntu is likely identical, and other distributions will be quite similar.
Invoking it in a similar fashion to the shell builtin is already more verbose and informative, albeit perhaps more opaque unless you're already familiar with what the numbers really mean:
$ \time df
[output elided]
0.00user 0.00system 0:00.01elapsed 66%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+261minor)pagefaults 0swaps
However, I'd like to draw your attention to the man page which lists the -f option to customise the output format, and in particular the %w format which counts the number of times the process gave up its CPU timeslice for I/O:
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=184
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=1
Note that the first run stopped for I/O 184 times, but the second run stopped just once. The first figure is credible, as there are 124 directories in my ~/Maildir: the reading of the directory and the inode gives roughly two IOPS per directory, less a bit because some inodes were likely next to each other and read in one operation, plus some extra again for mapping in the du(1) binary, shared libraries, and so on.
The second figure is of course lower due to Linux's disk cache. So the final piece is to flush the cache. sync(1) is a familiar command which flushes dirty writes to disk, but doesn't flush the read cache. You can flush that one by writing 3 to /proc/sys/vm/drop_caches. (Other values are also occasionally useful, but you want 3 here.) As a non-root user, the simplest way to do this is:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Combining that with /usr/bin/time should allow you to build the scripts you need to benchmark the commands you're interested in.
As a minor aside, tee(1) is used because this won't work:
sudo echo 3 >/proc/sys/vm/drop_caches
The reason? Although the echo(1) runs as root, the redirection is as your normal user account, which doesn't have write permissions to drop_caches. tee(1) effectively does the redirection as root.
The iotop command collects I/O usage information about processes on Linux. By default, it is an interactive command but you can run it in batch mode with -b / --batch. Also, you can a list of processes with -p / --pid. Thus, you can monitor the activity of a git command with:
$ sudo iotop -p $(pidof git) -b
You can change the delay with -d / --delay.
You can use pidstat:
pidstat -d 2
More specifically pidstat -d 2 | grep COMMAND or pidstat -C COMMANDNAME -d 2
The pidstat command is used for monitoring individual tasks currently being managed by the Linux kernel. It writes to standard output activities for every task selected with option -p or for every task managed by the Linux kernel if option -p ALL has been used. Not selecting any tasks is equivalent to specifying -p ALL but only active tasks (tasks with non-zero statistics values) will appear in the report.
The pidstat command can also be used for monitoring the child processes of selected tasks.
-C commDisplay only tasks whose command name includes the stringcomm. This string can be a regular expression.

Alternative to scp, transferring files between linux machines by opening parallel connections

Is there an alternative to scp, to transfer a large file from one machine to another machine by opening parallel connections and also able to pause and resume the download.
Please don't transfer this to severfault.com. I am not a system administrator. I am a developer trying to transfer past database dumps between backup hosts and servers.
Thank you
You could try using split(1) to break the file apart and then scp the pieces in parallel. The file could then be combined into a single file on the destination machine with 'cat'.
# on local host
split -b 1M large.file large.file. # split into 1MiB chunks
for f in large.file.*; do scp $f remote_host: & done
# on remote host
cat large.file.* > large.file
Take a look at rsync to see if it will meet your needs.
The correct placement of questions is not based on your role, but on the type of question. Since this one is not strictly programming related it is likely that it will be migrated.
Similar to Mike K's answer, check out https://code.google.com/p/scp-tsunami/ - it handles splitting the file, starting several scp processes to copy the parts and then joins them again...it can also copy to multiple hosts...
./scpTsunami.py -v -s -t 9 -b 10m -u dan bigfile.tar.gz /tmp -l remote.host
That splits the file into 10MB chunks and copies them using 9 scp processes...
The program you are after is lftp. It supports sftp and parallel transfers using its pget command. It is available under Ubuntu (sudo apt-get install lftp) and you can read a review of it here:
http://www.cyberciti.biz/tips/linux-unix-download-accelerator.html

Remote linux server to remote linux server large sparse files copy - How To?

I have two twins CentOS 5.4 servers with VMware Server installed on each.
What is the most reliable and fast method for copying virtual machines files from one server to the other, assuming that I always use sparse file for my vmware virtual machines?
The vm's files are a pain to copy since they are very large (50 GB) but since they are sparse files I think something can be done to improve the speed of the copy.
If you want to copy large data quickly, rsync over SSH is not for you. As running an rsync daemon for quick one-shot copying is also overkill, yer olde tar and nc do the trick as follows.
Create the process that will serve the files over network:
tar cSf - /path/to/files | nc -l 5000
Note that it may take tar a long time to examine sparse files, so it's normal to see no progress for a while.
And receive the files with the following at the other end:
nc hostname_or_ip 5000 | tar xSf -
Alternatively, if you want to get all fancy, use pv to display progress:
tar cSf - /path/to/files \
| pv -s `du -sb /path/to/files | awk '{ print $1 }'` \
| nc -l 5000
Wait a little until you see that pv reports that some bytes have passed by, then start the receiver at the other end:
nc hostname_or_ip 5000 | pv -btr | tar xSf -
Have you tried rsync with the option --sparse(possibly over ssh)?
From man rsync:
Try to handle sparse files efficiently so they take up less
space on the destination. Conflicts with --inplace because it’s
not possible to overwrite data in a sparse fashion.
Since rsync is terribly slow at copying sparse file, I usually resort using tar over ssh :
tar Scjf - my-src-files | ssh sylvain#my.dest.host tar Sxjf - -C /the/target/directory
You could have a look at http://www.barricane.com/virtsync
(Disclaimer: I am the author.)

Automated website folder backup system needed? Any recommendations?

Hi guys is there any backup software that can take periodic backups of online website folders and store them offline on a local system. Need something robust and would be nice if theres something free that can do the job :)
Thanks for the links - I have ftp access and its my website and its a bit of a documents sharing website with user uploads and I would like to maintain a backup of teh files uploaded from time to time on the website on a periodic basis. Just want to automate this process. My local system is windows based though.
If you are referring to a website that will be accessed by you from your browser (rather than as the administrator of the site) you should check out WGet. And, if you need to use WGet from a Windows system, checkout Cygwin
If you have access to the webserver, a cronjob which emails or ftps out the archive would do the job.
If you don't have shell access at the site, you can use wget:
#!/bin/bash
export BCKDIR=`date -u +"%Y%m%dT%H%M%SZ"`
wget -m -np -P $BCKDIR http://www.example.com/path/to/dir
wget options:
-m - Mirror everything, follow links
-np - Don't access parent directories (avoids downloading the whole site)
-P - Store files below $BCKDIR
If you have shell access, you can use rsync. One way to do it, is to have this loop running in a screen(1) session with automatic login using ssh-agent:
#!/bin/bash
while :; do
export BCKDIR=`date -u +"%Y%m%dT%H%M%SZ"`
rsync -az user#hostname:/path/to/dir $BCKDIR
sleep 86400 # Sleep 24 hours
done
Not sure what OS you're using, but this should run fine under *NIX. And for MS Windows, there's Cygwin.

Resources