Launching parallel network tasks using xargs whilst minimising context switching overhead - linux

I want to run 100 networking (non cpu intense) jobs in parallel and want to understand the best approach.
Specifically is it possible to run 100+ jobs using xargs and what are the drawbacks?
I understand that there is a point where there is more context switching being done then actual packet processing.
How to understand where that point is and what is the best way to minimise it?
For example, are there better tools to use other then xargs, etc?

Better will often be a matter of taste.
Using GNU Parallel you can do something like this to fetch 100 images in parallel:
seq 1000 | parallel -j100 wget https://foo.bar/image{}.jpg
If you want data from 100 servers and you get a full line every time:
parallel -a servers.txt -j0 --line-buffer my_connect {}
Or:
parallel -a servers.txt -j0 --line-buffer --tag my_connect {}
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

WGET - Simultaneous connections are SLOW

I use the following command to append the browser's response from list of URLs into an according output file:
wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
>> /Applications/MAMP/htdocs/data/export.txt
This works fine and when finished it says:
Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)
In order to speed this up I used:
cat /Applications/MAMP/htdocs/data/urls.txt | \
tr -d '\r' | \
xargs -P 10 $(which wget) -i - -O - \
>> /Applications/MAMP/htdocs/data/export.txt
Which opens simultaneous connections making it a little faster:
Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)
As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?
On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.
My API Rate-Limiting is as follows:
Number of requests per minute 100
Number of mapping jobs in a single request 100
Total number of mapping jobs per minute 10,000
Have you tried GNU Parallel? It will be something like this:
parallel -a /Applications/MAMP/htdocs/data/urls.txt wget -O - > result.txt
You can use this to see what it will do without actually doing anything:
parallel --dry-run ...
And either of these to see progress:
parallel --progress ...
parallel --bar ...
As your input file seems to be a bit of a mess, you can strip carriage returns like this:
tr -d '\r' < /Applications/MAMP/htdocs/data/urls.txt | parallel wget {} -O - > result.txt
A few things:
I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
man xargs advises you to "Use the -n option with -P; otherwise
chances are that only one exec will be done."
You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
To debug, substitute echo for wget and check how it's batching the parameters
So this should work:
cat urls.txt | \
xargs --max-procs=10 --max-args=100 wget --output-document=-
(I've preferred long params - --max-procs is -P. --max-args is -n)
See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.
However, in most circumstances I would not expect parallelising to significantly increase your download rate.
In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.
So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

How to control multithreaded background jobs in for loop in shell script

I found that my linux workstation with 12 CPUs had almost stopped to work after I executed a shell script (tcsh) having a for-loop where more than hundreds of loops are executed simultaneously by adding '&' at the end of the command. Is there any way to control the number or executing time for background processes in the for-loop using tcsh?
GNU Parallel is made for this kind of situations.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

How to measure IOPS for a command in linux?

I'm working on a simulation model, where I want to determine when the storage IOPS capacity becomes a bottleneck (e.g. and HDD has ~150 IOPS, while an SSD can have 150,000). So I'm trying to come up with a way to benchmark IOPS in a command (git) for some of it's different operations (push, pull, merge, clone).
So far, I have found tools like iostat, however, I am not sure how to limit the report to what a single command does.
The best idea I can come up with is to determine my HDD IOPS capacity, use time on the actual command, see how long it lasts, multiply that by IOPS and those are my IOPS:
HDD ->150 IOPS
time df -h
real 0m0.032s
150 * .032 = 4.8 IOPS
But, this is of course very stupid, because the duration of the execution may have been related to CPU usage rather than HDD usage, so unless usage of HDD was 100% for that time, it makes no sense to measure things like that.
So, how can I measure the IOPS for a command?
There are multiple time(1) commands on a typical Linux system; the default is a bash(1) builtin which is somewhat basic. There is also /usr/bin/time which you can run by either calling it exactly like that, or telling bash(1) to not use aliases and builtins by prefixing it with a backslash thus: \time. Debian has it in the "time" package which is installed by default, Ubuntu is likely identical, and other distributions will be quite similar.
Invoking it in a similar fashion to the shell builtin is already more verbose and informative, albeit perhaps more opaque unless you're already familiar with what the numbers really mean:
$ \time df
[output elided]
0.00user 0.00system 0:00.01elapsed 66%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+261minor)pagefaults 0swaps
However, I'd like to draw your attention to the man page which lists the -f option to customise the output format, and in particular the %w format which counts the number of times the process gave up its CPU timeslice for I/O:
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=184
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=1
Note that the first run stopped for I/O 184 times, but the second run stopped just once. The first figure is credible, as there are 124 directories in my ~/Maildir: the reading of the directory and the inode gives roughly two IOPS per directory, less a bit because some inodes were likely next to each other and read in one operation, plus some extra again for mapping in the du(1) binary, shared libraries, and so on.
The second figure is of course lower due to Linux's disk cache. So the final piece is to flush the cache. sync(1) is a familiar command which flushes dirty writes to disk, but doesn't flush the read cache. You can flush that one by writing 3 to /proc/sys/vm/drop_caches. (Other values are also occasionally useful, but you want 3 here.) As a non-root user, the simplest way to do this is:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Combining that with /usr/bin/time should allow you to build the scripts you need to benchmark the commands you're interested in.
As a minor aside, tee(1) is used because this won't work:
sudo echo 3 >/proc/sys/vm/drop_caches
The reason? Although the echo(1) runs as root, the redirection is as your normal user account, which doesn't have write permissions to drop_caches. tee(1) effectively does the redirection as root.
The iotop command collects I/O usage information about processes on Linux. By default, it is an interactive command but you can run it in batch mode with -b / --batch. Also, you can a list of processes with -p / --pid. Thus, you can monitor the activity of a git command with:
$ sudo iotop -p $(pidof git) -b
You can change the delay with -d / --delay.
You can use pidstat:
pidstat -d 2
More specifically pidstat -d 2 | grep COMMAND or pidstat -C COMMANDNAME -d 2
The pidstat command is used for monitoring individual tasks currently being managed by the Linux kernel. It writes to standard output activities for every task selected with option -p or for every task managed by the Linux kernel if option -p ALL has been used. Not selecting any tasks is equivalent to specifying -p ALL but only active tasks (tasks with non-zero statistics values) will appear in the report.
The pidstat command can also be used for monitoring the child processes of selected tasks.
-C commDisplay only tasks whose command name includes the stringcomm. This string can be a regular expression.

Have 5 scripts running at any given time

I have a bash script (running under CentOS 6.4) that launches 90 different PHP scripts, ie.
#!/bin/bash
php path1/some_job_1.php&
php path2/some_job_2.php&
php path3/some_job_3.php&
php path4/some_job_4.php&
php path5/some_job_5.php
php path6/some_job_6.php&
php path7/some_job_7.php&
php path8/some_job_8.php&
php path9/some_job_9.php&
php path10/some_job_10.php
...
exit 0
In order to avoid overloading my server, I use the ampersand &, it works, but my goal is to always have 5 scripts running at the same time
Is there a way to achieve this?
This question is popped several times, but I could not find a proper answer for it. I think now I found a good solution!
Unfortunately parallel is not the part of the standard distributions, but make is. It has a switch -j to do makes parallel.
man make(1)]: (more info on make's parallel execution)
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously. If
there is more than one -j option, the last one is effective. If
the -j option is given without an argument, make will not limit
the number of jobs that can run simultaneously.
So with a proper Makefile the problem could be solved.
.PHONY: all $(PHP_DEST)
# Create an array of targets in the form of PHP1 PHP2 ... PHP90
PHP_DEST := $(addprefix PHP, $(shell seq 1 1 90))
# Default target
all: $(PHP_DEST)
# Run the proper script for each target
$(PHP_DEST):
N=$(subst PHP,,$#); php path$N/some_job_$N.php
It creates 90 of PHP# targets each calls php path#/some_job_#.php. If You run make -j 5 then it will run 5 instance of php parallel. If one finishes it starts the next.
I renamed the Makefile to parallel.mak, I run chmod 700 parallel.mak and I added #!/usr/bin/make -f to the first line. Now it can be called as ./parallel.mak -j 5.
Or even You can use the more sophisticated -l switch:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there
are others jobs running and the load average is at least load (a
floating-point number). With no argument, removes a previous load
limit.
In this case make will decide how many jobs can be launched depending on the system's load.
I tested it with ./parallel.mak -j -l 1.0 and run nicely. It started 4 programs in parallel at first contrary -j without args means run as many process parallel as it can.
Use cron and schedule them at the same time.
Or use parallel.

linux batch jobs in parallel

I have seven licenses of a particular software. Therefore, I want to start 7 jobs simultaneously. I can do that using '&'. Now, 'wait' command waits till the end of all of those 7 processes to be finished to spawn the next 7. Now, I would like to write the shell script where after I start the first seven, as and when a job gets completed I would like to start another. This is because some of those 7 jobs might take very long while some others get over really quickly. I don't want to waste time waiting for all of them to finish. Is there a way to do this in linux? Could you please help me?
Thanks.
GNU parallel is the way to go. It is designed for launching multiples instances of a same command, each with a different argument retrieved either from stdin or an external file.
Let's say your licensed script is called myScript, each instance having the same options --arg1 --arg2 and taking a variable parameter --argVariable for each instance spawned, those parameters being stored in file myParameters :
cat myParameters | parallel -halt 1 --jobs 7 ./myScript --arg1 --argVariable {} --arg2
Explanations :
-halt 1 tells parallel to halt all jobs if one fails
--jobs 7 will launch 7 instances of myScript
On a debian-based linux system, you can install parallel using :
sudo apt-get install parallel
As a bonus, if your licenses allow it, you can even tell parallel to launch these 7 instances amongst multiple computers.
You could check how many are currently running and start more if you have less than 7:
while true; do
if [ "`ps ax -o comm | grep process-name | wc -l`" -lt 7 ]; then
process-name &
fi
sleep 1
done
Write two scripts. One which restarts a job everytime it is finished and one that starts 7 times the first script.
Like:
script1:
./script2 job1
...
./script2 job7
and
script2:
while(...)
./jobX
I found a fairly good solution using make, which is a part of the standard distributions. See here

Resources