I am trying to use wget to mirror a full website I have created. I am wondering why this bit in the terminal won't work:
wget --mirror-no-check-certificate [my site goes here]
I am also trying to get a site that doesn't contain a robotics.txt so if anyone knows a workaround that would be great.
thanks in advance guys.
This doesn't work because there is no option called --mirror-no-check-certificate.
You may have intended to use the two separate options --mirror and --no-check-certificate:
wget --mirror --no-check-certificate [your site goes here]
Related
I need to Download and archive about 50 subsites (including all working links within the subsite) that were created as part of my company's main portal. I need wget to download the subsites without downloading the entire site.
from the bit of searching I've done this is what I've tried so far
wget --mirror --page-requisites --convert-links --recursive --adjust-extension --compression=auto --reject-regex "/search|/rss" --no-if-modified-since --no-check-certificate --user=xxxxxxx --password=xxxxxxx
this instead downloaded the home page of every subsite without any of the actual links working.
You should add --no-parent to restrict to the part you want.
An example line would be wget --mirror --convert-links --page-requisites ----no-parent -P /path/to/download https://example-domain.com.
Some times wget will refuse to download the specified file. Adding the --no-check-certificate, I am often able to download the file anyway.
1) Briefly, what is this certificate which wget checks by default? How does it perform this check?
2) Does the need of --no-check-certificate for some particular URL vary from machine to machine? That is, if I'm able to download some file using wget www.website/file, can I be sure that my friend using some other machine can do the same, also without the --no-check-certificate option?
When hitting a website that is secure (https) wget will attempt to validate the certificate. In order to trust certificates, wget would need access to a certificate store which is essentially an SSL directory to store trusted certs (see here for more info: https://wiki.openwrt.org/doc/howto/wget-ssl-certs) This can be bypassed as you have seen by using the --no-check-certificate option
Using the --no-check-certificate option should work regardless of which machine you are using wget from.That option is specific to the wget program itself and is not machine dependent
I am trying to get a svn repository using wget. I have:
wget -m -np http://svn.wikia-code.com/wikia/releases/release-126/
This does not work and starts to pull down everything above too. I just want the release 126 folder and everything it contains.
What is the correct syntax please?
SOLUTION found by me:
wget -r -np http://svn.wikia-code.com/wikia/releases/release-126/
You are getting higher-level directories because the -m switch for wget follows all links. The first link on the directory listing points to the parent directory.
As others have correctly pointed out, you should be checking out the code with svn co, or following the project's instructions for acquiring & installing the software.
I want to download all these RPMs from SourceForge in one go with wget:
Link
How do I do this?
Seeing how for example HeaNet is one of the SF mirrors hosting this project (and many others), you could find out where SF redirects you, specifically:
http://ftp.heanet.ie/mirrors/sourceforge/h/project/hp/hphp/CentOS%205%2064bit/SRPM/
... and download that entire directory with the -r option (probably should use "no parent" switch, too).
One of the two ways:
Create a script that parses the html file and gets the links that ends withs *.rpm, and download those links using wget $URL
Or start copy & pasting those urls and use:
wget $URL from the console.
Suppose i have a directory accessible via http e,g
Http://www.abc.com/pdf/books
Inside the folder i have many pdf files
Can i use something like
wget http://www.abc.com/pdf/books/*
wget -r -l1 -A.pdf http://www.abc.com/pdf/books
from wget man page:
Wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site. This is
sometimes referred to as ``recursive downloading.'' While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). Wget can be instructed to convert the
links in downloaded HTML files to the local files for offline viewing.
and
Recursive Retrieval Options
-r
--recursive
Turn on recursive retrieving.
-l depth
--level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5.
It depends on the webserver and the configuration of the server. Strictly speaking the URL is not a directory path, so the http://something/books/* is meaningless.
However if the web server implements the path of http://something/books to be a index page listing all the books on the site, then you can play around with the recursive option and spider options and wget will be happy to follow any links which is in the http://something/books index page.