Parallel download using Curl command line utility

Parallel download using Curl command line utility - linux

I want to download some pages from a website and I did it successfully using curl but I was wondering if somehow curl downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curl command line utility?
The current command I am using is
curl 'http://www...../?page=[1-10]' 2>&1 > 1.html
Here I am downloading pages from 1 to 10 and storing them in a file named 1.html.
Also, is it possible for curl to write output of each URL to separate file say URL.html, where URL is the actual URL of the page under process.

My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs, which is capable of running a specified number of commands in subprocesses.
The one-liner I would use is, simply:
$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'
This warrants some explanation. The use of -n 1 instructs xargs to process a single input argument at a time. In this example, the numbers 1 ... 10 are each processed separately. And -P 2 tells xargs to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.
You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs is.
Update: I suspect that my example with xargs could be improved (at least on Mac OS X and BSD with the -J flag). With GNU Parallel, the command is a bit less unwieldy as well:
parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}

Well, curl is just a simple UNIX process. You can have as many of these curl processes running in parallel and sending their outputs to different files.
curl can use the filename part of the URL to generate the local file. Just use the -O option (man curl for details).
You could use something like the following
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here
for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate

As of 7.66.0, the curl utility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargs and background spawning, in most cases:
curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'
This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/

For launching of parallel commands, why not use the venerable make command line utility.. It supports parallell execution and dependency tracking and whatnot.
How? In the directory where you are downloading the files, create a new file called Makefile with the following contents:
# which page numbers to fetch
numbers := $(shell seq 1 10)
# default target which depends on files 1.html .. 10.html
# (patsubst replaces % with %.html for each number)
all: $(patsubst %,%.html,$(numbers))
# the rule which tells how to generate a %.html dependency
# $# is the target filename e.g. 1.html
%.html:
curl -C - 'http://www...../?page='$(patsubst %.html,%,$#) -o $#.tmp
mv $#.tmp $#
NOTE The last two lines should start with a TAB character (instead of 8 spaces) or make will not accept the file.
Now you just run:
make -k -j 5
The curl command I used will store the output in 1.html.tmp and only if the curl command succeeds then it will be renamed to 1.html (by the mv command on the next line). Thus if some download should fail, you can just re-run the same make command and it will resume/retry downloading the files that failed to download during the first time. Once all files have been successfully downloaded, make will report that there is nothing more to be done, so there is no harm in running it one extra time to be "safe".
(The -k switch tells make to keep downloading the rest of the files even if one single download should fail.)

Curl can also accelerate a download of a file by splitting it into parts:
$ man curl |grep -A2 '\--range'
-r/--range <range>
(HTTP/FTP/SFTP/FILE) Retrieve a byte range (i.e a partial docu-
ment) from a HTTP/1.1, FTP or SFTP server or a local FILE.
Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl

Starting from 7.68.0 curl can fetch several urls in parallel. This example will fetch urls from urls.txt file with 3 parallel connections:
curl --parallel --parallel-immediate --parallel-max 3 --config urls.txt
urls.txt:
url = "example1.com"
output = "example1.html"
url = "example2.com"
output = "example2.html"
url = "example3.com"
output = "example3.html"
url = "example4.com"
output = "example4.html"
url = "example5.com"
output = "example5.html"

curl and wget cannot download a single file in parallel chunks, but there are alternatives:
aria2 (written in C++, available in Deb and Cygwin repo's)
aria2c -x 5 <url>
axel (written in C, available in Deb repo)
axel -n 5 <url>
wget2 (written in C, available in Deb repo)
wget2 --max-threads=5 <url>
lftp (written in C++, available in Deb repo)
lftp -n 5 <url>
hget (written in Go)
hget -n 5 <url>
pget (written in Go)
pget -p 5 <url>

Run a limited number of process is easy if your system have commands like pidof or pgrep which, given a process name, return the pids (the count of the pids tell how many are running).
Something like this:
#!/bin/sh
max=4
running_curl() {
set -- $(pidof curl)
echo $#
}
while [ $# -gt 0 ]; do
while [ $(running_curl) -ge $max ] ; do
sleep 1
done
curl "$1" --create-dirs -o "${1##*://}" &
shift
done
to call like this:
script.sh $(for i in `seq 1 10`; do printf "http://example/%s.html " "$i"; done)
The curl line of the script is untested.

I came up with a solution based on fmt and xargs. The idea is to specify multiple URLs inside braces http://example.com/page{1,2,3}.html and run them in parallel with xargs. Following would start downloading in 3 process:
seq 1 50 | fmt -w40 | tr ' ' ',' \
| awk -v url="http://example.com/" '{print url "page{" $1 "}.html"}' \
| xargs -P3 -n1 curl -o
so 4 downloadable lines of URLs are generated and sent to xargs
curl -o http://example.com/page{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}.html
curl -o http://example.com/page{17,18,19,20,21,22,23,24,25,26,27,28,29}.html
curl -o http://example.com/page{30,31,32,33,34,35,36,37,38,39,40,41,42}.html
curl -o http://example.com/page{43,44,45,46,47,48,49,50}.html

Bash 3 or above lets you populate an array with multiple values as it expands sequence expressions:
$ urls=( "" http://example.com?page={1..4} )
$ unset urls[0]
Note the [0] value, which was provided as shorthand to make the indices line up with page numbers, since bash arrays autonumber starting at zero. This strategy obviously might not always work. Anyway, you can unset it in this example.
Now you have a an array, and you can verify the contents with declare -p:
$ declare -p urls
declare -a urls=([1]="http://example.com?Page=1" [2]="http://example.com?Page=2" [3]="http://example.com?Page=3" [4]="http://example.com?Page=4")
Now that you have a list of URLs in an array, expand the array into a curl command line:
$ curl $(for i in ${!urls[#]}; do echo "-o $i.html ${urls[$i]}"; done)
The curl command can take multiple URLs and fetch all of them, recycling the existing connection (HTTP/1.1) to a common server, but it needs the -o option before each one in order to download and save each target. Note that characters within some URLs may need to be escaped to avoid interacting with your shell.

I am not sure about curl, but you can do that using wget.
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains website.org \
--no-parent \
www.website.org/tutorials/html/

Related

How to replace backup file with timestamp in its name without producing duplicates in Linux Bash (shell script)

#!/usr/bin/env bash
# usage: wttr [location], e.g. wttr Berlin, wttr New\ York
# Standard location if no parameters were passed
location=''
language=''
time=`date`
# Expand terminal display
if [ -z "$language" ]; then
language=${LANG%_*}
fi
curl \
-H -x "Accept-Language: ${language}" \
-x wttr.in/"${1:-${location}}" |
head -n 7 |
tee /home/of/weather.txt |
tee -a /home/of/weather.log |
tee /home/of/BACKUP/weather_"$time".txt
#cp weather.txt /home/of/BACKUP
#mv -f /home/of/BACKUP/weather.txt /home/of/BACKUP/weather_"$time".txt
I'm very new to Linux Bash and Shell scripting and can't figure out the following.
I have a problem with the shell script above.
It works fine so far (curling ASCII data from website and writing it to weather.txt and .log).
It is also in set in crontab to run every 5 minutes.
Now I need to make a backup of weather.txt under /home/of/, in /home/of/BACKUP with the filename weather_<timestamp>.txt.
I tried to delete (rm weather*.txt) the old timestamped files in /home/of/BACKUP and then copy and rename the file everytime the cronjob is running.
I tried piping cp and mv and so on but somehow I end up with producing many duplicates as due to the timestamp the filenames are different or nothing at all when I try to delete the content of the folder first.
All I need is ONE backup file of weather.txt as weather_<timestamp>.txt which gets updated every 5 minutes with the actual timestamp bit I can't figure it out.

If I understand your question at all, then simply
rm -f /home/of/BACKUP/weather_*.txt
cp /home/of/weather.txt /home/of/BACKUP/weather_"$time".txt
cp lets you rename the file you are copying to; it doesn't make sense to separately cp and then mv.
For convenience, you might want to cd /home/of so you don't have to spell out the full paths, or put them in a variable.
dir=/home/of
rm -f "$dir"/BACKUP/weather_*.txt
cp "$dir"/weather.txt "$dir"/BACKUP/weather_"$time".txt
If you are running out of the cron of the user named of then your current working directory will be /home/of (though if you need to be able to run the script manually from anywhere, that cannot be guaranteed).
Obviously, make sure the wildcard doesn't match any files you actually want to keep.
As an aside, you can simplify the tee commands slightly. If this should only update the files and not print anything to the terminal, you could even go with
curl \
-H -x "Accept-Language: ${language}" \
-x wttr.in/"${1:-${location}}" |
head -n 7 |
tee /home/of/weather.txt \
>>/home/of/weather.log
I took out the tee to the backup file since you are deleting it immediately after anyway. You could alternatively empty the backup directory first, but then you will have no backups if the curl fails.
If you want to keep printing to the terminal, too, probably run the script with redirection to /dev/null in the cron job to avoid having your email inbox fill up with unread copies of the output.

Bash increase pid kernel to unlimited for huge loop

I've been try to make cURL on a huge loop and I run the cURL into background process with bash, there are about 904 domains that will be cURLed
and the problem is that 904 domains can't all be embedded because of the PID limit on the Linux kernel. I have tried adding pid_max to 4194303 (I read in this discussion Maximum PID in Linux) but after I checked only domain 901 had run in background proccess, before I added pid_max is only around 704 running in the background process.
here is my loop code :
count=0
while IFS= read -r line || [[ -n "$line" ]];
do
(curl -s -L -w "\\n\\nNo:$count\\nHEADER CODE:%{http_code}\\nWebsite : $line\\nExecuted at :$(date)\\n==================================================\\n\\n" -H "X-Gitlab-Event: Push Hook" -H 'X-Gitlab-Token: '$SECRET_KEY --insecure $line >> output.log) &
(( count++ ))
done < $FILE_NAME
Anyone have another solution or fix it to handle huge loop to run cURL into background process ?

a script example.sh can be created
#!/bin/bash
line=$1
curl -s -L -w "\\n\\nNo:$count\\nHEADER CODE:%{http_code}\\nWebsite : $line\\nExecuted at :$(date)\\n==================================================\\n\\n" -H "X-Gitlab-Event: Push Hook" -H 'X-Gitlab-Token: '$SECRET_KEY --insecure $line >> output.log
then the command could be (to limit number of running process at a time to 50)
xargs -n1 -P50 --process-slot-var=count ./example.sh < "$FILE_NAME"

Even if you could run that many processes in parallel, it's pointless - starting that many DNS queries to resolve 900+ domain names in a short span of time will probably overwhelm your DNS server, and having that many concurrent outgoing HTTP requests at the same time will clog your network. A much better approach is to trickle the processes so that you run a limited number (say, 100) at any given time, but start a new one every time one of the previously started ones finishes. This is easy enough with xargs -P.
xargs -I {} -P 100 \
curl -s -L \
-w "\\n\\nHEADER CODE:%{http_code}\\nWebsite : {}\\nExecuted at :$(date)\\n==================================================\\n\\n" \
-H "X-Gitlab-Event: Push Hook" \
-H "X-Gitlab-Token: $SECRET_KEY" \
--insecure {} <"$FILE_NAME" >output.log
The $(date) result will be interpolated at the time the shell evaluates the xargs command line, and there is no simple way to get the count with this mechanism. Refactoring this to put the curl command and some scaffolding into a separate script could solve these issues, and should be trivial enough if it's really important to you. (Rough sketch:
xargs -P 100 bash -c 'count=0; for url; do
curl --options --headers "X-Notice: use double quotes throughout" \
"$url"
((count++))
done' _ <"$FILE_NAME" >output.log
... though this will restart numbering if xargs receives more URLs than will fit on a single command line.)

parallel gnu command in conjunction with piping

I am relatively new to informatics, and have just discovered the virtues of the parallel command. However, I am having trouble using this in conjunction with piping and output.
I am using this command:
parallel -j 2 echo ./hisat2 --dta -p 32 -x path/to/index -U {} | ./samtools view -b - > /path/to/storage/folder/{/.}.bam :::: fs1 > executable.sh
fs1 contains a list of all the files I want to run. executable.sh is the executable command list. I wish for each file listed in fs1 to be individually processed by a program (called hisat2) and the ouput sam file to be converted into bam format with samtools. However, it does not seem to like the piping - it complains with the following:
bash: /path/to/storage/folder/{/.}.bam: No such file or directory
parallel: Warning: Input is read from the terminal. Only experts do this on purpose. Press CTRL-D to exit.
How can I overcome this? Is the only way around this to first process all files to sam, and then parallel bam convert?

You need to quote the pipe and redirection:
parallel -j 2 "./hisat2 --dta -p 32 -x path/to/index -U {} | ./samtools view -b - > /path/to/storage/folder/{/.}.bam" :::: fs1
Use --dry-run to see what would be run:
parallel --dry-run -j 2 "./hisat2 --dta -p 32 -x path/to/index -U {} | ./samtools view -b - > /path/to/storage/folder/{/.}.bam" :::: fs1
(Are you sure samtools is in current dir? Usually that is installed for a wider audience.)
May I suggest you spend an hour walking through man parallel_tutorial? Your command line will love you for it.

Executing several bash scripts simultaneously from one script?

I want to make a bash script that will execute around 30 or so other scripts simultaneously, these 30 scripts all have wget commands iterating through some lists.
I thought of doing something with screen (send ctrl + shift + a + d) or send the scripts to background but really I dont know what to do.
To summarize: 1 master script execution will trigger all other 30 scripts to execute all at the same time.
PS: I've seen the other questions but I don't quite understand how the work or the are a bit more than what I need(expecting a return value, etc)
EDIT:
Small snippet of the script(this part is the one that executes with the config params I specified)
if [ $WP_RANGE_STOP -gt 0 ]; then
#WP RANGE
for (( count= "$WP_RANGE_START"; count< "$WP_RANGE_STOP"+1; count=count+1 ));
do
if cat downloaded.txt | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> downloaded.txt
wget --keep-session-cookies --load-cookies=cookies.txt --referer=server.com http://server.com/wallpaper/$count
cat $count | egrep -o "http://wallpapers.*(png|jpg|gif)" | wget --keep-session-cookies --load-cookies=cookies.txt --referer=http://server.com/wallpaper/$number -i -
rm $count
fi

Probably the most straightforward approach would be to use xargs -P or GNU parallel. Generate the different arguments for each child in the master script. For simplicity's sake, let's say you just want to download a bunch of different content at once. Either of
xargs -P 30 wget < urls_file
parallel -j 30 wget '{}' < urls_file
will spawn up to 30 simultaneous wget processes with different args from the given input. If you give more information about the scripts you want to run, I might be able to provide more specific examples.
Parallel has some more sophisticated tuning options compared to xargs, such as the ability to automatically split jobs across cores or cpus.
If you're just trying to run a bunch of heterogeneous different bash scripts in parallel, define each individual script in its own file, then make each file executable and pass it to parallel:
$ cat list_of_scripts
/path/to/script1 arg1 arg2
/path/to/script2 -o=5 --beer arg3
…
/path/to/scriptN
then
parallel -j 30 < list_of_scripts

Wget Output in Recursive mode

I am using wget -r to download 3 .zip files from a specified webpage. Here is what I have so far:
wget -r -nd -l1 -A.zip http://www.website.com/example
Right now, the zip files all begin with abc_*.zip where * seems to be a random. I want to have the first downloaded file to be called xyz_1.zip, the second to be xyz_2.zip, and the third to be xyz_3.zip.
Is this possible with wget?
Many thanks!

I don't think it's possible with wget alone. After downloading you could use some simple shell scripting to rename the files, like:
i=1; for f in abc_*.zip; do mv "$f" "xyz_$i.zip"; i=$(($i+1)); done

Try to get a listing first and then download each file separately.
let n=1
wget -nv -l1 -r --spider http://www.website.com/example 2>&1 | \
egrep -io 'http://.*\.zip'| \
while read url; do
wget -nd -nv -O $(echo $url|sed 's%^.*/\(.*\)_.*$%\1%')_$n.zip "$url"
let n++
done

I don't think there is a way you can do it within a single wget command.
wget does have a -O option which you can use to tell it which file to output to, but it won't work in your case because multiple files will get concatenated together.
You will have to write a script which renames the files from abc_*.zip to xyz_*.zip after wget has completed.
Alternatively, invoke wget for one zip file at a time and use the -O option.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parallel download using Curl command line utility - linux

I am not sure about curl, but you can do that using wget. wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --domains website.org \ --no-parent \ www.website.org/tutorials/html/

Related

How to replace backup file with timestamp in its name without producing duplicates in Linux Bash (shell script)

Bash increase pid kernel to unlimited for huge loop

parallel gnu command in conjunction with piping

Executing several bash scripts simultaneously from one script?

Wget Output in Recursive mode

Categories

Resources