How to bulk download files from the internet archive

How to bulk download files from the internet archive - cygwin

I checked the original site of the internet archive and they mentioned there a couple of steps to follow, which included the use of the wget utility using Cygwin over windows, I followed the steps above, I made an advanced search and extracted the CSV file, converted it to .txt and then tried to run the following commands
wget -r -H -nc -np -nH --cut-dirs=1 -A .pdf,.epub -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/
The emulator gets stuck afterwards and no log message or even an error message appears indicating any practical progress, I want to know what wrong have I done so far.

The ia command-line tool is the official way to do this. If you can craft a search term that captures all your items, you can have ia download everything that matches.
For example:
ia download --search 'creator:Hamilton Commodore User Group'
will download all of the items attributed to this (now defunct) computer user group. This is a live, working query that downloads roughly 8.6 MB of data for 40 Commodore 64 disk images.
It will also download from an itemlist, as above.

After Some time I figured out how to resolve this matter, the commands posted in the internet archive help blog are general commands posted to help use the wget utility , the commands we will need right here are simply those which follow
--cutdirs=1
-A .pdf,.epub
-e robots=off
-i ./itemlist.txt
and of course the url source:
B- 'archive.org/download/'

Related

wget: downloading all files in directories/subdirectories

Basically on a webpage there is a list of directories, and each of these has further subdirectories. The subdirectories contain a number of files, and I want to download to a single location on my linux machine one file from each subdirectory which has the specific sequence letters 'RMD' in it.
E.g., say the main webpage links to directories dir1, dir2, dir3..., and each of these has subdirectories dir1a, dir1b..., dir2a, dir2b... etc. I want to download files of the form:
webpage/dir1/dir1a/file321RMD210
webpage/dir1/dir1b/file951RMD339
...
webpage/dir2/dir2a/file416RMD712
webpage/dir2/dir2b/file712RMD521
The directories/subdirectories are not sequentially numbered like in the above example (that was just me making it simpler to read) so is there a terminal command that will recursively go through each directory and subdirectory and download every file with the letters 'RMD' in the file name?
The website in question is: here
I hope that's enough information.

An answer with a lot of remarks:
In case the website supports ftp, you better use #MichaelBaldry's answer. This answer aims to give a way to do it with wget (but this is less efficient for both server and client).
Only in case the website works with a directory listing, you can use the -r flag for this (the -R flag aims to find links in webpages and then downloads these pages as well).
The following method is inefficient for both server and client and can result in a huge load if the pages are generated dynamically. The website you mention furthermore specifically asks not to fetch data that way.
wget -e robots=off -r -k -nv -nH -l inf -R jpg,jpeg,gif,png,tif --reject-regex '(.*)\?(.*)' --no-parent 'http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/'
with:
wget the program you want to call;
-e robots=off; the fact that you ignore the websites request not to download this automatically;
-r: you download recursively;
-R jpg,jpeg,gif,png,tif: reject the downloading of media (the small images);
--reject-regex '(.*)\?(.*)' do not follow or download query pages (sorting of index pages).
-l inf: that you keep downloading for an infinite level
--no-parent: prevent wget from starting to fetch links in the parent of the website (for instance the .. links to the parent directory).
wget downloads the files breadth-first so you will have to wait a long time before it eventually starts fetching the real data files.
Note that wget has no means to guess the directory structure at server-side. It only aims to find links in the fetched pages and thus with this knowledge aims to generate a dump of "visible" files. It is possible that the webserver does not list all available files, and thus wget will fail to download all files.

I've noticed this site supports FTP protocol, which is a far more convenient way of reading files and folders. (Its for transferring files, not web pages)
Get a FTP client (lots of them about) and open ftp://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ you can probably just highlight all the folders in there and hit download.

One solution using saxon-lint :
saxon-lint --html --xpath 'string-join(//a/#href, "^M")' http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ | awk '/SOL/{print "http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/"$0}' | while read url; do saxon-lint --html --xpath 'string-join(//a/#href, "^M")' "$url" | awk -vurl="$url" '/SOL/{print url$0}'; done | while read url2; do saxon-lint --html --xpath 'string-join(//a/#href, "^M")' "$url2" | awk -vurl2="$url2" '/RME/{print url2$0}'; done | xargs wget
Edit the
"^M"
by control+M (Unix) or \r\n for windows

wget to download new wildcard files and overwrite old ones

I'm currently using wget to download specific files from a remote server. The files are updated every week, but always have the same file names. e.g new upload file1.jpg will replace local file1.jpg
This is how I am grabbing them, nothing fancy :
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/file1.jpg
This downloads file1.jpg from the remote server if it is newer than the local version then overwrites the local one with the new one.
Trouble is, I'm doing this for over 100 files every week and have set up cron jobs to fire the 100 different download scripts at specific times.
Is there a way I can use a wildcard for the file name and have just one script that fires every 5 minutes for example?
Something like....
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/*.jpg
Will that work? Will it check the local folder for all current file names, see what is new and then download and overwrite only the new ones? Also, is there any danger of it downloading partially uploaded files on the remote server?
I know that some kind of file sync script between servers would be a better option but they all look pretty complicated to set up.
Many thanks!

You can specify the files to be downloaded one by one in a text file, and then pass that file name using option -i or --input-file.
e.g. contents of list.txt:
http://xx.xxx.xxx.xxx/remote/files/file1.jpg
http://xx.xxx.xxx.xxx/remote/files/file2.jpg
http://xx.xxx.xxx.xxx/remote/files/file3.jpg
....
then
wget .... --input-file list.txt
Alternatively, If all your *.jpg files are linked from a particular HTML page, you can use recursive downloading, i.e. let wget follow links on your page to all linked resources. You might need to limit the "recursion level" and file types in order to prevent downloading too much. See wget --help for more info.
wget .... --recursive --level=1 --accept=jpg --no-parent http://.../your-index-page.html

Using wget but ignore url parameters

I want to download the contents of a website where the URLs are built as
http://www.example.com/level1/level2?option1=1&option2=2
Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.

You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.
wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/
This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.

wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.

It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:
rename -v -n 's/[?].*//' *[?]*
The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.

Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:
http://www.example.com/main-topic/whatever-content-in-this-page
All other URLs had references to the CMS. I got all I neede with
wget -r http://www.example.com -A "*-*"
This did the trick. Thanks for thought sharing!

#kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:
wget --reject "*\?*" -m -c --content-disposition http://example.com/

Using wget to download select directories from ftp server

I'm trying to understand how to use wget to download specific directories from a bunch of different ftp sites with economic data from the US government.
As a simple example, I know that I can download an entire directory using a command like:
wget --timestamping --recursive --no-parent ftp://ftp.bls.gov/pub/special.requests/cew/2013/county/
But I envision running more complex downloads, where I might want to limit a download to a handful of directories. So I've been looking at the --include option. But I don't really understand how it works. Specifically, why doesn't this work:
wget --timestamping --recursive -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
The following does work, in the sense that it downloads files, but it downloads way more than I need (everything in the 2013 directory, vs just the county subdirectory):
wget --timestamping --recursive -I /pub/special.requests/cew/2013/ ftp://ftp.bls.gov/pub/special.requests/cew/
I can't tell if i'm not understanding something about wget or if my issue is with something more fundamental to ftp server structures.
Thanks for the help!

Based on this doc it seems that the filtering functions of wget are very limited.
When using the --recursive option, wget will download all linked documents after applying the various filters, such as --no-parent and -I, -X, -A, -R options.
In your example:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
This won't download anything, because the -I option specifies to include only links matching /pub/special.requests/cew/2013/county/, but on the page /pub/special.requests/cew/ there are no such links, so the download stops there. This will work though:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/
... because in this case the /pub/special.requests/cew/2013/ page does have a link to county/
Btw, you can find more details in this doc than on the man page:
http://www.gnu.org/software/wget/manual/html_node/

can't you simply do (and add the --timestamping/--no-parent etc. as needed)
wget -r ftp://ftp.bls.gov/pub/special.requests/cew/2013/county
The -I seems to work at one directory level at a time, so if we step one step up from county/ we could do:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/
But apparently we can't step further up and do
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/

wget some website .gif files

I tried hard with different options but still i cant wget
http://auno.org/ao/nanos.php?prof=nano-technician
I would like to get all the littles .gif images.
My current request is:
wget -A.gif http://auno.org/ao/nanos.php?prof=nano-technician/
I tried many options like -O or -U firefox etc.
Thanks for reading.

This way you're only downloading the requested page, nothing else. If you want to download more then that, you must turn on recursive downloading (-r).
Also, as the images are on different hosts, you may want to enable host spanning (-H), and when you do that you should also specify a restrictive recursion limit to avoid downloading half the internet (-l):
wget -A .gif -r -l 1 -H http://auno.org/ao/nanos.php?prof=nano-technician/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string