I tried hard with different options but still i cant wget
http://auno.org/ao/nanos.php?prof=nano-technician
I would like to get all the littles .gif images.
My current request is:
wget -A.gif http://auno.org/ao/nanos.php?prof=nano-technician/
I tried many options like -O or -U firefox etc.
Thanks for reading.
This way you're only downloading the requested page, nothing else. If you want to download more then that, you must turn on recursive downloading (-r).
Also, as the images are on different hosts, you may want to enable host spanning (-H), and when you do that you should also specify a restrictive recursion limit to avoid downloading half the internet (-l):
wget -A .gif -r -l 1 -H http://auno.org/ao/nanos.php?prof=nano-technician/
Related
I checked the original site of the internet archive and they mentioned there a couple of steps to follow, which included the use of the wget utility using Cygwin over windows, I followed the steps above, I made an advanced search and extracted the CSV file, converted it to .txt and then tried to run the following commands
wget -r -H -nc -np -nH --cut-dirs=1 -A .pdf,.epub -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/
The emulator gets stuck afterwards and no log message or even an error message appears indicating any practical progress, I want to know what wrong have I done so far.
The ia command-line tool is the official way to do this. If you can craft a search term that captures all your items, you can have ia download everything that matches.
For example:
ia download --search 'creator:Hamilton Commodore User Group'
will download all of the items attributed to this (now defunct) computer user group. This is a live, working query that downloads roughly 8.6 MB of data for 40 Commodore 64 disk images.
It will also download from an itemlist, as above.
After Some time I figured out how to resolve this matter, the commands posted in the internet archive help blog are general commands posted to help use the wget utility , the commands we will need right here are simply those which follow
--cutdirs=1
-A .pdf,.epub
-e robots=off
-i ./itemlist.txt
and of course the url source:
B- 'archive.org/download/'
I have to create a full website copy / mirror of https://www.landesmuseum-mecklenburg.de. This copy has to run locally on a Windows 7 system, which has no network connection. Windows 7 has a path length limitation (255), which is reached in my case. How do I circumvent this?
I create a static website copy on a Debian system with: wget --mirror --adjust-extension --convert-links --restrict-file-names=windows --page-requisites -e robots=off --no-clobber --no-parent --base=./ https://www.landesmuseum-mecklenburg.de/.
This way I got nearly everything I need. Some special URLs / images are downloaded via a seperated URL-list via: wget -c -x -i imagelist.txt
After I create an archive out of this files and transfer them to my Windows 7 Test system and extract them to a local Webserver (called "MiniWebserver") I can visit http://localhost/ and everything seems to work.
But some deep links, especially images have a path length in the windows NTFS filesystem of over 255 characters. All these images are not displayed in the local webpage.
I tried the -nH --cut-dirs=5 option of wget, but with no acceptable result (the index.html gets overwritten each time).
Another idea was to use the DOS path shorten compatbility feature, so that long directory names would be translated to 8 character names. E.g. longdirname translated to longdi~1. But I have no idea how to automate this.
Update: Yet another idea
One more thing, which came to my mind was to use hashes over the entire path (e.g. md5) and use this instead of the full path + filename. Additionally all URLs in the downloaded .html files must be substituted processed too. But again: I have no idea how to accomplish this using Debian command line tools.
You can try the following:
wget --trust-server-names <url>
--trust-server-names
If this is set to on, on a redirect the last component of the
redirection URL will be used as the local file name. By default it is
used the last component in the original URL.
I want to download the contents of a website where the URLs are built as
http://www.example.com/level1/level2?option1=1&option2=2
Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.
You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.
wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/
This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.
wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.
It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:
rename -v -n 's/[?].*//' *[?]*
The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.
Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:
http://www.example.com/main-topic/whatever-content-in-this-page
All other URLs had references to the CMS. I got all I neede with
wget -r http://www.example.com -A "*-*"
This did the trick. Thanks for thought sharing!
#kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:
wget --reject "*\?*" -m -c --content-disposition http://example.com/
How can wget save only certain file types linked to from pages linked to by the target page, regardless of the domain in which the certain files are?
Trying to speed up a task I have to do often.
I've been rooting through the wget docs and googling, but nothing seems to work. I keep on either getting just the target page or the subpages without the files (even using -H), so I'm obviously doing badly at this.
So, essentially, example.com/index1/ contains links to example.com/subpage1/ and example.com/subpage2/, while the subpages contain links to example2.com/file.ext and example2.com/file2.ext, etc. However, example.com/index1.html may link to example.com/index2/ which has links to more subpages I don't want.
Can wget even do this, and if not then what do you suggest I use? Thanks.
Following command worked for me.
wget -r --accept "*.ext" --level 2 "example.com/index1/"
Need to do recursively so -r should be added.
Something like this should Work:
wget --accept "*.ext" --level 2 "example.com/index1/"
Can wget be used to get all the files on a server.Suppose if this is the directory structure using Django framework on my site foo.com
And if this is the directory structure
/web/project1
/web/project2
/web/project3
/web/project4
/web/templates
Without knowing the name of directories of /project1,project2.....Is it possible to download all the files
You could use
wget -r -np http://www.foo.com/pool/main/z/
-r (fetch files/folders recursively)
-np (do not descent to parent directory when retrieving recursively)
or
wget -nH --cut-dirs=2 -r -np http://www.foo.com/pool/main/z/
--cut-dirs (it makes Wget not "see" number remote directory components)
-nH (invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.)
First of all, wget can only be used to retrieve files served by the web server. It's not clear in the question you're posting whether you mean actual files or web pages. I would guess from the way you phrased your question that your intent is to download the server files, not the web pages served by Django. If this is correct, then no wget won't work. You need to use something like rsync or scp.
If you do mean using wget to retrieve all of the generated pages from Django, then this will only work if links point to those directories. So, you need a page that has code like:
<ul>
<li>Project1</li>
<li>Project2</li>
<li>Project3</li>
<li>Project4</li>
<li>Templates</li>
</ul>
wget is not a psychic; it can only pull in pages it knows about.
try recursive retrieval - the -r option.