wget: don't follow redirects - linux

How do I prevent wget from following redirects?

--max-redirect 0
I haven't tried this, it will either allow none or allow infinite..

Use curl without -L instead of wget. Omitting that option when using curl prevents the redirect from being followed.
If you use curl -I <URL> then you'll get the headers instead of the redirect HTML.
If you use curl -IL <URL> then you'll get the headers for the URL, plus those for the URL you're redirected to.

Some versions of wget have a --max-redirect option: See here

wget follows up to 20 redirects by default. However, it does not span hosts. If you have asked wget to download example.com, it will not touch any resources at www.example.com. wget will detect this as a request to span to another host and decide against it.
In short, you should probably be executing:
wget --mirror www.example.com
Rather than
wget --mirror example.com
Now let's say the owner of www.example.com has several subdomains at example.com and we are interested in all of them. How to proceed?
Try this:
wget --mirror --domains=example.com example.com
wget will now visit all subdomains of example.com, including m.example.com and www.example.com.

In general, it is not a good idea to depend on a specific number of redirects.
For example, in order to download IntellijIdea, the URL that is promised to always resolve to the latest version of Community Edition for Linux is something like https://download.jetbrains.com/product?code=IIC&latest&distribution=linux, but if you visit that URL nowadays, you are going to be redirected twice (2 times) before you reach the actual downloadable file. In the future you might be redirected three times, or not at all.
The way to solve this problem is with the use of the HTTP HEAD verb. Here is how I solved it in the case of IntellijIdea:
# This is the starting URL.
URL="https://download.jetbrains.com/product?code=IIC&latest&distribution=linux"
echo "URL: $URL"
# Issue HEAD requests until the actual target is found.
# The result contains the target location, among some irrelevant stuff.
LOC=$(wget --no-verbose --method=HEAD --output-file - $URL)
echo "LOC: $LOC"
# Extract the URL from the result, stripping the irrelevant stuff.
URL=$(cut "--delimiter= " --fields=4 <<< "$LOC")
echo "URL: $URL"
# Optional: download the actual file.
wget "$URL"

Related

Get files from a page that is redirected using wget

I am trying to download the contents of http://someurl.com/123 using wget. The problem is that http://someurl.com/123 is redirected to http://someurl.com/456.
I already tried using --max-redirect 0 and this is my result:
wget --max-redirect 0 http://someurl/123
...
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://someurl.com/456 [following]
0 redirections exceeded.
Is there a way to get the actual contents of http://someurl.com/123?
By default, wget can follow up to 20 redirections. But since you set --max-redirect to 0, it returned 0 redirections exceeded.
Try this command
wget --content-disposition http://someurl.com/123
--max-redirect=number
Specifies the maximum number of redirections to follow for a resource. The default is 20, which is usually far more than necessary. However, on those occasions where you want to allow more (or fewer), this is the option to use.
--content-disposition
If this is set to on, experimental (not fully-functional) support for Content-Disposition headers is enabled. This can currently result in extra round-trips to the server for a HEAD request, and is known to suffer from a few bugs, which is why it is not currently enabled by default.

wget: obtaining files matching regex

According to the man page of wget, --acccept-regex is the argument to use when I need to selectively transfer files whose names matching a certain regular expression. However, I am not sure how to use --accept-regex.
Assuming I want to obtain files diffs-000107.tar.gz, diffs-000114.tar.gz, diffs-000121.tar.gz, diffs-000128.tar.gz in IMDB data directory ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/. "diffs\-0001[0-9]{2}\.tar\.gz" seems to be an ok regex to describe the file names.
However, when executing the following wget command
wget -r --accept-regex='diffs\-0001[0-9]{2}\.tar\.gz' ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/
wget indiscriminately acquires all files in the ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/ directory.
I wonder if anyone could tell what I have possibly done wrong?
Be careful --accept-regex is for the complete URL. But our target is some specific files. So we will use -A.
For example,
wget -r -np -nH -A "IMG[012][0-9].jpg" http://x.com/y/z/
will download all the files from IMG00.jpg to IMG29.jpg from the URL.
Note that a matching pattern contains shell-like wildcards, e.g. ‘books’ or ‘zelazny196[0-9]*’.
reference:
wget manual: https://www.gnu.org/software/wget/manual/wget.html
regex: https://regexone.com/
I'm reading in wget man page:
--accept-regex urlregex
--reject-regex urlregex
Specify a regular expression to accept or reject the complete URL.
and noticing that it mentions the complete URL (e.g. something like ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/diffs-000121.tar.gz)
So I suggest (without having tried it) to use
--accept-regex='.*diffs\-0001[0-9][0-9]\.tar\.gz'
(and perhaps give the appropriate --regex-type too)
BTW, for such tasks, I would also consider using some scripting language à la Python (or use libcurl or curl)

Wget Macro for downloading multiple URLS?

(NOTE: You need at least 10 reputation to post more than 2 links. I had to remove the http and urls, but it's still understandable i hope!)
Hello!
I am trying to Wget an entire website for personal educational use. Here's what the URL looks like:
example.com/download.php?id=1
i want to download all the pages from 1 to the last page which is 4952
so the first URL is:
example.com/download.php?id=1
and the second is
example.com/download.php?id=4952
What would be the most efficient method to download the pages from 1 - 4952?
My current command is (it's working perfectly fine, the exact way i want it to):
wget -P /home/user/wget -S -nd --reject=.rar http://example.com/download.php?id=1
NOTE: The website has a troll and if you try to run the following command:
wget -P /home/user/wget -S -nd --reject=.rar --recursive --no-clobber --domains=example.com --no-parent http://example.com/download.php
it will download a 1000GB .rar file just to troll you!!!
Im new to linux, please be nice! just trying to learn!
Thank you!
Notepadd++ =
your URL + Column Editor = Massive list of all urls
Wget -I your_file_with_all_urls = Success!
thanks to Barmar

wget: don't follow redirects if it points to location X, otherwise do

As I read here redirection can be easily turned of by --max-redirect 0. But what in case when there is two kinds of redirection: good one and the bad one.
In my case good redirection is:
http://someaddres.com/888.html -> http://someaddres.com/some-string-in-url-describing-page.html
where bad redirection is:
http://someaddres.com/555.html -> http://someaddres.com/
What can I do to fallow only good redirections?
The only way I can think of off the top of my head would be to turn off redirects as you've said, then parse the response (I suggest using sed or grep, but I'm sure there are other options) looking for a redirect request. You may need the parameter --server-response to get the headers depending on the method used for redirecting. If you find one, do a new wget to the redirect target (unless it's one you don't want to redirect to).
As #Thor84no said, one solution can parse response. This is mine:
REDIRECTED_TO=`wget --max-redirect 0 $ADDRESS 2>&1 | grep "Location" | sed 's|.*\(http://.*/.*\) .*|\1|'`
if [ "$REDIRECTED_TO" != "$BAD_REDIRECTION" ]; then wget $REDIRECTED_TO; fi

How do I pull image links from a website and download them using wget?

I really want to download images from a website, but I don't know a lot of wget to do so. They host the images on a seperate website, how I do pull the image link from the website using cat or something, so I could use wget to download them all. All I know is the wget part. Example would be Reddit.com
wget -i download-file-list.txt
Try this:
wget -r -l 1 -A jpg,jpeg,png,gif,bmp -nd -H http://reddit.com/some/path
It will recurse 1 level deep starting from the page http://reddit.com/some/path, and it will not create a directory structure (if you want directories, remove the -nd), and it will only download files ending in "jpg", "jpeg", "png", "gif", or "bmp". And it will span hosts.
I would use the perl module WWW::Mechanize. The following dumps all links to stdout:
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->get("URL");
$mech->dump_links(undef, 'absolute' => 1);
Replace URL with the actual url you want.

Resources