WGET - Simultaneous connections are SLOW - multithreading

I use the following command to append the browser's response from list of URLs into an according output file:
wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
>> /Applications/MAMP/htdocs/data/export.txt
This works fine and when finished it says:
Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)
In order to speed this up I used:
cat /Applications/MAMP/htdocs/data/urls.txt | \
tr -d '\r' | \
xargs -P 10 $(which wget) -i - -O - \
>> /Applications/MAMP/htdocs/data/export.txt
Which opens simultaneous connections making it a little faster:
Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)
As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?
On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.
My API Rate-Limiting is as follows:
Number of requests per minute 100
Number of mapping jobs in a single request 100
Total number of mapping jobs per minute 10,000

Have you tried GNU Parallel? It will be something like this:
parallel -a /Applications/MAMP/htdocs/data/urls.txt wget -O - > result.txt
You can use this to see what it will do without actually doing anything:
parallel --dry-run ...
And either of these to see progress:
parallel --progress ...
parallel --bar ...
As your input file seems to be a bit of a mess, you can strip carriage returns like this:
tr -d '\r' < /Applications/MAMP/htdocs/data/urls.txt | parallel wget {} -O - > result.txt

A few things:
I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
man xargs advises you to "Use the -n option with -P; otherwise
chances are that only one exec will be done."
You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
To debug, substitute echo for wget and check how it's batching the parameters
So this should work:
cat urls.txt | \
xargs --max-procs=10 --max-args=100 wget --output-document=-
(I've preferred long params - --max-procs is -P. --max-args is -n)
See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.
However, in most circumstances I would not expect parallelising to significantly increase your download rate.
In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.
So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

Related

Launching parallel network tasks using xargs whilst minimising context switching overhead

I want to run 100 networking (non cpu intense) jobs in parallel and want to understand the best approach.
Specifically is it possible to run 100+ jobs using xargs and what are the drawbacks?
I understand that there is a point where there is more context switching being done then actual packet processing.
How to understand where that point is and what is the best way to minimise it?
For example, are there better tools to use other then xargs, etc?
Better will often be a matter of taste.
Using GNU Parallel you can do something like this to fetch 100 images in parallel:
seq 1000 | parallel -j100 wget https://foo.bar/image{}.jpg
If you want data from 100 servers and you get a full line every time:
parallel -a servers.txt -j0 --line-buffer my_connect {}
Or:
parallel -a servers.txt -j0 --line-buffer --tag my_connect {}
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

How to measure IOPS for a command in linux?

I'm working on a simulation model, where I want to determine when the storage IOPS capacity becomes a bottleneck (e.g. and HDD has ~150 IOPS, while an SSD can have 150,000). So I'm trying to come up with a way to benchmark IOPS in a command (git) for some of it's different operations (push, pull, merge, clone).
So far, I have found tools like iostat, however, I am not sure how to limit the report to what a single command does.
The best idea I can come up with is to determine my HDD IOPS capacity, use time on the actual command, see how long it lasts, multiply that by IOPS and those are my IOPS:
HDD ->150 IOPS
time df -h
real 0m0.032s
150 * .032 = 4.8 IOPS
But, this is of course very stupid, because the duration of the execution may have been related to CPU usage rather than HDD usage, so unless usage of HDD was 100% for that time, it makes no sense to measure things like that.
So, how can I measure the IOPS for a command?
There are multiple time(1) commands on a typical Linux system; the default is a bash(1) builtin which is somewhat basic. There is also /usr/bin/time which you can run by either calling it exactly like that, or telling bash(1) to not use aliases and builtins by prefixing it with a backslash thus: \time. Debian has it in the "time" package which is installed by default, Ubuntu is likely identical, and other distributions will be quite similar.
Invoking it in a similar fashion to the shell builtin is already more verbose and informative, albeit perhaps more opaque unless you're already familiar with what the numbers really mean:
$ \time df
[output elided]
0.00user 0.00system 0:00.01elapsed 66%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+261minor)pagefaults 0swaps
However, I'd like to draw your attention to the man page which lists the -f option to customise the output format, and in particular the %w format which counts the number of times the process gave up its CPU timeslice for I/O:
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=184
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=1
Note that the first run stopped for I/O 184 times, but the second run stopped just once. The first figure is credible, as there are 124 directories in my ~/Maildir: the reading of the directory and the inode gives roughly two IOPS per directory, less a bit because some inodes were likely next to each other and read in one operation, plus some extra again for mapping in the du(1) binary, shared libraries, and so on.
The second figure is of course lower due to Linux's disk cache. So the final piece is to flush the cache. sync(1) is a familiar command which flushes dirty writes to disk, but doesn't flush the read cache. You can flush that one by writing 3 to /proc/sys/vm/drop_caches. (Other values are also occasionally useful, but you want 3 here.) As a non-root user, the simplest way to do this is:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Combining that with /usr/bin/time should allow you to build the scripts you need to benchmark the commands you're interested in.
As a minor aside, tee(1) is used because this won't work:
sudo echo 3 >/proc/sys/vm/drop_caches
The reason? Although the echo(1) runs as root, the redirection is as your normal user account, which doesn't have write permissions to drop_caches. tee(1) effectively does the redirection as root.
The iotop command collects I/O usage information about processes on Linux. By default, it is an interactive command but you can run it in batch mode with -b / --batch. Also, you can a list of processes with -p / --pid. Thus, you can monitor the activity of a git command with:
$ sudo iotop -p $(pidof git) -b
You can change the delay with -d / --delay.
You can use pidstat:
pidstat -d 2
More specifically pidstat -d 2 | grep COMMAND or pidstat -C COMMANDNAME -d 2
The pidstat command is used for monitoring individual tasks currently being managed by the Linux kernel. It writes to standard output activities for every task selected with option -p or for every task managed by the Linux kernel if option -p ALL has been used. Not selecting any tasks is equivalent to specifying -p ALL but only active tasks (tasks with non-zero statistics values) will appear in the report.
The pidstat command can also be used for monitoring the child processes of selected tasks.
-C commDisplay only tasks whose command name includes the stringcomm. This string can be a regular expression.

How do I find out what inotify watches have been registered?

I have my inotify watch limit set to 1024 (I think the default is 128?). Despite that, yeoman, Guard and Dropbox constantly fail, and tell me to up my inotify limit. Before doing so, I'd like to know what's consuming all my watches (I have very few files in my Dropbox).
Is there some area of /proc or /sys, or some tool I can run, to find out what watches are currently registered?
Oct 31 2022 update
While my script below works fine as it is, Michael Sartain implemented a native executable that is much faster, along with additional functionality not present in my script (below). Worth checking out if you can spend a few seconds compiling it! I have also added contributed some PRs to align the functionality, so it should be pretty 1:1, just faster.
Upvote his answer on the Unix Stackexchange.
Original answer with script
I already answered this in the same thread on Unix Stackexchange as was mentioned by #cincodenada, but thought I could repost my ready-made answer here, seeing that no one really has something that works:
I have a premade script, inotify-consumers, that lists the top offenders for you:
INOTIFY INSTANCES
WATCHES PER
COUNT PROCESS PID USER COMMAND
------------------------------------------------------------
21270 1 11076 my-user /snap/intellij-idea-ultimate/357/bin/fsnotifier
201 6 1 root /sbin/init splash
115 5 1510 my-user /lib/systemd/systemd --user
85 1 3600 my-user /usr/libexec/xdg-desktop-portal-gtk
77 1 2580 my-user /usr/libexec/gsd-xsettings
35 1 2475 my-user /usr/libexec/gvfsd-trash --spawner :1.5 /org/gtk/gvfs/exec_spaw/0
32 1 570 root /lib/systemd/systemd-udevd
26 1 2665 my-user /snap/snap-store/558/usr/bin/snap-store --gapplication-service
18 2 1176 root /usr/libexec/polkitd --no-debug
14 1 1858 my-user /usr/bin/gnome-shell
13 1 3641 root /usr/libexec/fwupd/fwupd
...
21983 WATCHES TOTAL COUNT
INotify instances per user (e.g. limits specified by fs.inotify.max_user_instances):
INSTANCES USER
----------- ------------------
41 my-user
23 root
1 whoopsie
1 systemd-ti+
...
Here you quickly see why the default limit of 8K watchers is too little on a development machine, as just WebStorm instance quickly maxes this when encountering a node_modules folder with thousands of folders. Add a webpack watcher to guarantee problems ...
Even though it was much faster than the other alternatives when I made it initially, Simon Matter added some speed enhancements for heavily loaded Big Iron Linux (hundreds of cores) that sped it up immensely, taking it down from ten minutes (!) to 15 seconds on his monster rig.
Later on, Brian Dowling contributed instance count per process, at the expense of relatively higher runtime. This is insignificant on normal machines with a runtime of about one second, but if you have Big Iron, you might want the earlier version with about 1/10 the amount of system time :)
How to use
inotify-consumers --help 😊 To get it on your machine, just copy the contents of the script and put it somewhere in your $PATH, like /usr/local/bin. Alternatively, if you trust this stranger on the net, you can avoid copying it and pipe it into bash over http:
$ curl -s https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers | bash
INOTIFY
WATCHER
COUNT PID USER COMMAND
--------------------------------------
3044 3933 myuser node /usr/local/bin/tsserver
2965 3941 myuser /usr/local/bin/node /home/myuser/.config/coc/extensions/node_modules/coc-tsserver/bin/tsserverForkStart /hom...
6990 WATCHES TOTAL COUNT
How does it work?
For reference, the main content of the script is simply this (inspired by this answer)
find /proc/*/fd \
-lname anon_inode:inotify \
-printf '%hinfo/%f\n' 2>/dev/null \
\
| xargs grep -c '^inotify' \
| sort -n -t: -k2 -r
Changing the limits
In case you are wondering how to increase the limits
$ inotify-consumers --limits
Current limits
-------------
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 524288
Changing settings permanently
-----------------------------
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf
sudo sysctl -p # re-read config
inotify filesystem options
sysctl fs.inotify
opened files
lsof | grep inotify | wc -l
Increase the values like this
sysctl -n -w fs.inotify.max_user_watches=16384
sysctl -n -w fs.inotify.max_user_instances=512
The default maximum number of inotify watches is 8192; it can be increased by writing to /proc/sys/fs/inotify/max_user_watches.
You can use sysctl fs.inotify.max_user_watches to check current value.
Use tail -f to verify if your OS does exceed the inotify maximum watch limit.
The internal implementation of tail -f command uses the inotify mechanism to monitor file changes.
If you've run out of your inotify watches, you'll most likely to get this error:
tail: inotify cannot be used, reverting to polling: Too many open files
To find out what inotify watches have been registered, you may refer to this, and this. I tried, but didn't get the ideal result. :-(
Reference:
https://askubuntu.com/questions/154255/how-can-i-tell-if-i-am-out-of-inotify-watches
https://unix.stackexchange.com/questions/15509/whos-consuming-my-inotify-resources
https://bbs.archlinux.org/viewtopic.php?pid=1340049
I think
sudo ls -l /proc/*/fd/* | grep notify
might be of use. You'll get a list of the pids that have a inotify fd registered.
I don't know how to get more info than this! HTH
Since this is high in Google results, I'm copy-pasting part of my answer from a similar question over on the Unix/Linux StackExchange:
I ran into this problem, and none of these answers give you the answer of "how many watches is each process currently using?" The one-liners all give you how many instances are open, which is only part of the story, and the trace stuff is only useful to see new watches being opened.
This will get you a file with a list of open inotify instances and the number of watches they have, along with the pids and binaries that spawned them, sorted in descending order by watch count:
sudo lsof | awk '/anon_inode/ { gsub(/[urw]$/,"",$4); print "/proc/"$2"/fdinfo/"$4; }' | while read fdi; do count=$(sudo grep -c inotify $fdi); exe=$(sudo readlink $(dirname $(dirname $fdi))/exe); echo -e $count"\t"$fdi"\t"$exe; done | sort -nr > watches
If you're interested in what that big ball of mess does and why, I explained in depth over on the original answer.
The following terminal command worked perfectly for me on my Ubuntu 16.04 Machine:
for foo in /proc/\*/fd/*; do readlink -f $foo; done | grep '^/proc/.*inotify' |cut -d/ -f3 |xargs -I '{}' -- ps --no-headers -o '%p %U %a' -p '{}' |uniq -c |sort -n
My problem was that I had a good majority of my HDD loaded as a folder in Sublime Text. Between /opt/sublime_text/plugin_host 8992 and /opt/sublime_text/sublime_text, Sublime had 18 instances of inotify while the rest of my programs were all between 1-3.
Since I was doing Ionic Mobile App development I reduced the number of instances by 5 by adding the large Node.js folder "node_modules" to the ignore list in the Sublime settings.
"folder_exclude_patterns": [".svn", ".git", ".hg", "CVS", "node_modules"]
Source: https://github.com/SublimeTextIssues/Core/issues/1195
Based on the excellent analysis of cincodenada, I made my own one-liner, which works better for me:
find /proc/*/fd/ -type l -lname "anon_inode:inotify" -printf "%hinfo/%f\n" | xargs grep -cE "^inotify" | column -t -s:
It helps to find all inotify watchers and their watching count. It does not translate process ids to their process names or sort them in any way but that was not the point for me. I simply wanted to find out which process consumes most of the watches. I then was able to search for that process using its process id.
You can omit the last column command if you don't have it installed. It's only there to make the output look nicer.
Okay, as you can see, there is a similar and less fork hungry approach from #oligofren. Better you use his simple script. It's very nice. I was also able to shrink my one-liner because I was not aware of the -lname parameter of find which comes in very handy here.

multiple wget -r a site simultaneously?

any command / wget with options?
For multithreaded download a site recursively and simultaneously?
I found a decent solution.
Read original at http://www.linuxquestions.org/questions/linux-networking-3/wget-multi-threaded-downloading-457375/
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
copied as many times as you deem fitting to have as much processes
downloading. This isn't as elegant as a properly multithreaded app,
but it will get the job done with only a slight amount of over head.
the key here being the "-N" switch. This means transfer the file only
if it is newer than what's on the disk. This will (mostly) prevent
each process from downloading the same file a different process
already downloaded, but skip the file and download what some other
process hasn't downloaded. It uses the time stamp as a means of doing
this, hence the slight overhead.
It works great for me and saves a lot of time. Don't have too many
processes as this may saturate the web site's connection and tick off
the owner. Keep it around a max of 4 or so. However, the number is
only limited by CPU and network bandwidth on both ends.
With the use of parallel wget utilizing the xargs switch, this solution seems so much better:
https://stackoverflow.com/a/11850469/1647809
Use axel to download with multi connections
apt-get install axel
axel http://example.com/file.zip
Well, you can always run multiple instances of wget, no?
Example:
wget -r http://somesite.example.org/ &
wget -r http://othersite.example.net/ &
etc. This syntax will work in any Unix-like environment (e.g. Linux or MacOS); not sure how to do this in Windows.
Wget itself does not support multithreaded operations - at least, neither the manpage nor its website has any mention of this. Anyway, since wget supports HTTP keepalive, the bottleneck is usually the bandwidth of the connection, not the number of simultaneous downloads.

Remote linux server to remote linux server large sparse files copy - How To?

I have two twins CentOS 5.4 servers with VMware Server installed on each.
What is the most reliable and fast method for copying virtual machines files from one server to the other, assuming that I always use sparse file for my vmware virtual machines?
The vm's files are a pain to copy since they are very large (50 GB) but since they are sparse files I think something can be done to improve the speed of the copy.
If you want to copy large data quickly, rsync over SSH is not for you. As running an rsync daemon for quick one-shot copying is also overkill, yer olde tar and nc do the trick as follows.
Create the process that will serve the files over network:
tar cSf - /path/to/files | nc -l 5000
Note that it may take tar a long time to examine sparse files, so it's normal to see no progress for a while.
And receive the files with the following at the other end:
nc hostname_or_ip 5000 | tar xSf -
Alternatively, if you want to get all fancy, use pv to display progress:
tar cSf - /path/to/files \
| pv -s `du -sb /path/to/files | awk '{ print $1 }'` \
| nc -l 5000
Wait a little until you see that pv reports that some bytes have passed by, then start the receiver at the other end:
nc hostname_or_ip 5000 | pv -btr | tar xSf -
Have you tried rsync with the option --sparse(possibly over ssh)?
From man rsync:
Try to handle sparse files efficiently so they take up less
space on the destination. Conflicts with --inplace because it’s
not possible to overwrite data in a sparse fashion.
Since rsync is terribly slow at copying sparse file, I usually resort using tar over ssh :
tar Scjf - my-src-files | ssh sylvain#my.dest.host tar Sxjf - -C /the/target/directory
You could have a look at http://www.barricane.com/virtsync
(Disclaimer: I am the author.)

Resources