(NOTE: You need at least 10 reputation to post more than 2 links. I had to remove the http and urls, but it's still understandable i hope!)
Hello!
I am trying to Wget an entire website for personal educational use. Here's what the URL looks like:
example.com/download.php?id=1
i want to download all the pages from 1 to the last page which is 4952
so the first URL is:
example.com/download.php?id=1
and the second is
example.com/download.php?id=4952
What would be the most efficient method to download the pages from 1 - 4952?
My current command is (it's working perfectly fine, the exact way i want it to):
wget -P /home/user/wget -S -nd --reject=.rar http://example.com/download.php?id=1
NOTE: The website has a troll and if you try to run the following command:
wget -P /home/user/wget -S -nd --reject=.rar --recursive --no-clobber --domains=example.com --no-parent http://example.com/download.php
it will download a 1000GB .rar file just to troll you!!!
Im new to linux, please be nice! just trying to learn!
Thank you!
Notepadd++ =
your URL + Column Editor = Massive list of all urls
Wget -I your_file_with_all_urls = Success!
thanks to Barmar
Related
I need to download most recent file based on the file creation time from the remote site using curl. How can I achieve it?
These are files in remote site
user-producer-info-etl-2.0.0-20221213.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221212.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221214.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221215.111513-53-exec.jar
Above user-producer-info-etl-2.0.0-20221215.111513-53-exec.jar is the most recent file that I want to download? How can I achieve it?
Luckily for you, file names contains dates that are alphabetically sortable !
I don't know where you are so I'm guessing you have at least a shell and I propose this bash answer:
First get the last file name
readonly endpoint="https://your-gitlab.local"
# Get the last filename
readonly most_recent_file="$(curl -s "${endpoint}/get_list_uri"|sort|tail -n 1)"
# Download it
curl -LOs curl "${endpoint}/get/${most_recent_file}"
You will obviously need to replace urls accordingly but I'm sure you get the idea
-L : follow HTTP 302 redirects
-O : download file to local dir keeping the name as is
-s : silent don't show network times and stuff
you can also specify another local name with -o <the_most_recent_file>
for more info:
man curl
hth
when I use this to download a file from an ftp server:
wget ftp://blah:blah#ftp.haha.com/"$(date +%Y%m%d -d yesterday)-blah.gz" /myFolder/Documents/"$(date +%Y%m%d -d yesterday)-blah.gz"
It says "20131022-blah.gz saved" (it downloads fine), however I get this:
/myFolder/Documents/20131022-blah.gz: Scheme missing (I believe this error prevents it from saving the file in /myFolder/Documents/).
I have no idea why this is not working.
Save the filename in a variable first:
OUT=$(date +%Y%m%d -d yesterday)-blah.gz
and then use -O switch for output file:
wget ftp://blah:blah#ftp.haha.com/"$OUT" -O /myFolder/Documents/"$OUT"
Without the -O, the output file name looks like a second file/URL to fetch, but it's missing http:// or ftp:// or some other scheme to tell wget how to access it. (Thanks #chepner)
If wget takes time to download a big file then minute will change and your download filename will be different from filename being saved.
In my case I had it working with the npm module http-server.
And discovered that I simply had a leading space before http://.
So this was wrong " http://localhost:8080/archive.zip".
Changed to working solution "http://localhost:8080/archive.zip".
In my case I used in cpanel:
wget https://www.blah.com.br/path/to/cron/whatever
Trying to enter a number here in the website and get the result with WGET.
[http://]
I've tried wget --cookies=on --save-cookies=site.txt URL to save the sessionID/cookie,
then went on with 'wget --cookies=on --keep-session-cookies --load-cookies=site.txt URL', but with no luck.
Also tried monitoring the POST data sent with WireShark and tried replicating it with wget --post-data --referer etc, but also without luck.
Anyone who has an easy way of doing this? I'm all ears! :)
All help is much appreciated.
Thank you.
The trick is to send the second request to http://202.91.22.186/dms-scw-dtac_th/page/?wicket:interface=:0:2:::0:
With my Xidel it seems to work like this (do not have a valid number to test)
xidel "http://202.91.22.186/dms-scw-dtac_th/page/" -f '{"post": "number=0812345678", "url": resolve-uri("?wicket:interface=:0:2:::0:")}' -e "#id5"
I really want to download images from a website, but I don't know a lot of wget to do so. They host the images on a seperate website, how I do pull the image link from the website using cat or something, so I could use wget to download them all. All I know is the wget part. Example would be Reddit.com
wget -i download-file-list.txt
Try this:
wget -r -l 1 -A jpg,jpeg,png,gif,bmp -nd -H http://reddit.com/some/path
It will recurse 1 level deep starting from the page http://reddit.com/some/path, and it will not create a directory structure (if you want directories, remove the -nd), and it will only download files ending in "jpg", "jpeg", "png", "gif", or "bmp". And it will span hosts.
I would use the perl module WWW::Mechanize. The following dumps all links to stdout:
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->get("URL");
$mech->dump_links(undef, 'absolute' => 1);
Replace URL with the actual url you want.
How do I prevent wget from following redirects?
--max-redirect 0
I haven't tried this, it will either allow none or allow infinite..
Use curl without -L instead of wget. Omitting that option when using curl prevents the redirect from being followed.
If you use curl -I <URL> then you'll get the headers instead of the redirect HTML.
If you use curl -IL <URL> then you'll get the headers for the URL, plus those for the URL you're redirected to.
Some versions of wget have a --max-redirect option: See here
wget follows up to 20 redirects by default. However, it does not span hosts. If you have asked wget to download example.com, it will not touch any resources at www.example.com. wget will detect this as a request to span to another host and decide against it.
In short, you should probably be executing:
wget --mirror www.example.com
Rather than
wget --mirror example.com
Now let's say the owner of www.example.com has several subdomains at example.com and we are interested in all of them. How to proceed?
Try this:
wget --mirror --domains=example.com example.com
wget will now visit all subdomains of example.com, including m.example.com and www.example.com.
In general, it is not a good idea to depend on a specific number of redirects.
For example, in order to download IntellijIdea, the URL that is promised to always resolve to the latest version of Community Edition for Linux is something like https://download.jetbrains.com/product?code=IIC&latest&distribution=linux, but if you visit that URL nowadays, you are going to be redirected twice (2 times) before you reach the actual downloadable file. In the future you might be redirected three times, or not at all.
The way to solve this problem is with the use of the HTTP HEAD verb. Here is how I solved it in the case of IntellijIdea:
# This is the starting URL.
URL="https://download.jetbrains.com/product?code=IIC&latest&distribution=linux"
echo "URL: $URL"
# Issue HEAD requests until the actual target is found.
# The result contains the target location, among some irrelevant stuff.
LOC=$(wget --no-verbose --method=HEAD --output-file - $URL)
echo "LOC: $LOC"
# Extract the URL from the result, stripping the irrelevant stuff.
URL=$(cut "--delimiter= " --fields=4 <<< "$LOC")
echo "URL: $URL"
# Optional: download the actual file.
wget "$URL"