wget: downloading all files in directories/subdirectories - linux

Basically on a webpage there is a list of directories, and each of these has further subdirectories. The subdirectories contain a number of files, and I want to download to a single location on my linux machine one file from each subdirectory which has the specific sequence letters 'RMD' in it.
E.g., say the main webpage links to directories dir1, dir2, dir3..., and each of these has subdirectories dir1a, dir1b..., dir2a, dir2b... etc. I want to download files of the form:
webpage/dir1/dir1a/file321RMD210
webpage/dir1/dir1b/file951RMD339
...
webpage/dir2/dir2a/file416RMD712
webpage/dir2/dir2b/file712RMD521
The directories/subdirectories are not sequentially numbered like in the above example (that was just me making it simpler to read) so is there a terminal command that will recursively go through each directory and subdirectory and download every file with the letters 'RMD' in the file name?
The website in question is: here
I hope that's enough information.

An answer with a lot of remarks:
In case the website supports ftp, you better use #MichaelBaldry's answer. This answer aims to give a way to do it with wget (but this is less efficient for both server and client).
Only in case the website works with a directory listing, you can use the -r flag for this (the -R flag aims to find links in webpages and then downloads these pages as well).
The following method is inefficient for both server and client and can result in a huge load if the pages are generated dynamically. The website you mention furthermore specifically asks not to fetch data that way.
wget -e robots=off -r -k -nv -nH -l inf -R jpg,jpeg,gif,png,tif --reject-regex '(.*)\?(.*)' --no-parent 'http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/'
with:
wget the program you want to call;
-e robots=off; the fact that you ignore the websites request not to download this automatically;
-r: you download recursively;
-R jpg,jpeg,gif,png,tif: reject the downloading of media (the small images);
--reject-regex '(.*)\?(.*)' do not follow or download query pages (sorting of index pages).
-l inf: that you keep downloading for an infinite level
--no-parent: prevent wget from starting to fetch links in the parent of the website (for instance the .. links to the parent directory).
wget downloads the files breadth-first so you will have to wait a long time before it eventually starts fetching the real data files.
Note that wget has no means to guess the directory structure at server-side. It only aims to find links in the fetched pages and thus with this knowledge aims to generate a dump of "visible" files. It is possible that the webserver does not list all available files, and thus wget will fail to download all files.

I've noticed this site supports FTP protocol, which is a far more convenient way of reading files and folders. (Its for transferring files, not web pages)
Get a FTP client (lots of them about) and open ftp://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ you can probably just highlight all the folders in there and hit download.

One solution using saxon-lint :
saxon-lint --html --xpath 'string-join(//a/#href, "^M")' http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ | awk '/SOL/{print "http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/"$0}' | while read url; do saxon-lint --html --xpath 'string-join(//a/#href, "^M")' "$url" | awk -vurl="$url" '/SOL/{print url$0}'; done | while read url2; do saxon-lint --html --xpath 'string-join(//a/#href, "^M")' "$url2" | awk -vurl2="$url2" '/RME/{print url2$0}'; done | xargs wget
Edit the
"^M"
by control+M (Unix) or \r\n for windows

Related

(Linux) Recursively overwrite all files in folder with data from another file

I find myself in a situation similar to this question:
Linux: Overwrite all files in folder with specified data?
The answers there work nicely, however, they are for typed-out text. Allow me to provide context.
I have a Linux terminal which the following file structure: (with files & folders irrelevant to the question removed)
root/
empty.svg
svg/
257238.svg
297522.svg
a7yf872.svg
236y27fh.svg
38277.svg
... (~200 other .svg files with arbitrary names)
2903852.svg
The framework I am working with requires those .svg files to exist with those specific filenames, but obviously, it does not care about SVG image they contain. I do not plan on using such files and they take up a hefty amount of space on disk, so I wish to convert them all into empty SVGs, aka the empty.svg file on my root directory, which is a 12x12 transparent SVG file (124 bytes). This way the framework shouldn't error out like it did when I tried simply overwriting the raw data of those SVGs with plaintext using the answer of the question linked at the top of this question. I've tried many methods by trying to be creative with my basic Linux command-line knowledge but no success. How do I accomplish this?
TL;DR: How to recursively overwrite all files in a folder with the raw data of another file from Linux CLI?
Similar to the link, you can use tee command, but instead of echo use cat to copy file contents, where cat is the command to read the contents of the file.
cat empty.svg | tee svg/257238.svg svg/297522.svg <etc>
But if there are a lot of files in svg directory it will be useful to use loop to automate the previous command:
for f in svg/*; do
if [[ "$f" == *.svg ]]; then
cat empty.svg > "$f"
fi
done
Here we use pipes and redirections to connect commands and redirect previous command output.

wget to download new wildcard files and overwrite old ones

I'm currently using wget to download specific files from a remote server. The files are updated every week, but always have the same file names. e.g new upload file1.jpg will replace local file1.jpg
This is how I am grabbing them, nothing fancy :
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/file1.jpg
This downloads file1.jpg from the remote server if it is newer than the local version then overwrites the local one with the new one.
Trouble is, I'm doing this for over 100 files every week and have set up cron jobs to fire the 100 different download scripts at specific times.
Is there a way I can use a wildcard for the file name and have just one script that fires every 5 minutes for example?
Something like....
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/*.jpg
Will that work? Will it check the local folder for all current file names, see what is new and then download and overwrite only the new ones? Also, is there any danger of it downloading partially uploaded files on the remote server?
I know that some kind of file sync script between servers would be a better option but they all look pretty complicated to set up.
Many thanks!
You can specify the files to be downloaded one by one in a text file, and then pass that file name using option -i or --input-file.
e.g. contents of list.txt:
http://xx.xxx.xxx.xxx/remote/files/file1.jpg
http://xx.xxx.xxx.xxx/remote/files/file2.jpg
http://xx.xxx.xxx.xxx/remote/files/file3.jpg
....
then
wget .... --input-file list.txt
Alternatively, If all your *.jpg files are linked from a particular HTML page, you can use recursive downloading, i.e. let wget follow links on your page to all linked resources. You might need to limit the "recursion level" and file types in order to prevent downloading too much. See wget --help for more info.
wget .... --recursive --level=1 --accept=jpg --no-parent http://.../your-index-page.html

Using wget but ignore url parameters

I want to download the contents of a website where the URLs are built as
http://www.example.com/level1/level2?option1=1&option2=2
Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.
You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.
wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/
This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.
wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.
It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:
rename -v -n 's/[?].*//' *[?]*
The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.
Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:
http://www.example.com/main-topic/whatever-content-in-this-page
All other URLs had references to the CMS. I got all I neede with
wget -r http://www.example.com -A "*-*"
This did the trick. Thanks for thought sharing!
#kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:
wget --reject "*\?*" -m -c --content-disposition http://example.com/

How do I mirror a directory with wget without creating parent directories?

I want to mirror a folder via FTP, like this:
wget --mirror --user=x --password=x ftp://ftp.site.com/folder/subfolder/evendeeper
But I do not want to create a directory structure like this:
ftp.site.com -> folder -> subfolder -> evendeeper
I just want:
evendeeper
And anything below it to be the resulting structure. It would also be acceptable for the contents of evendeeper to wind up in the current directory as long as subdirectories are created for subdirectories of evendeeper on the server.
I am aware of the -np option, according to the documentation that just keeps it from following links to parent pages (a non-issue for the binary files I'm mirroring via FTP). I am also aware of the -nd option, but this prevents creating any directory structure at all, even for subdirectories of evendeeper.
I would consider alternatives as long as they are command-line-based, readily available as Ubuntu packages and easily automated like wget.
For a path like: ftp.site.com/a/b/c/d
-nH would download all files to the directory a/b/c/d in the current directory, and -nH --cut-dirs=3 would download all files to the directory d in the current directory.
I had a similar requirement and the following combination seems to be the perfect choice:
In the below example, all the files in http://url/dir1/dir2 (alone) are downloaded to local directory /dest/dir
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Thanks #ffledgling for the hint on "-nd"
For the above example:
wget -nd -np --mirror --user=x --password=x ftp://ftp.site.com/folder/subfolder/evendeeper
Snippets from manual:
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
-np (no parent) option will probably do what you want, tied in with -L 1 (I think, don't have a wget install before me), which limits the recursion to one level.
EDIT. ok. gah... maybe I should wait until I've had coffee.. There is a --cut or similar option, which allows you to "cut" a specified number of directories from the output path, so for /a/b/c/d, a cut of 2 would force wget to create c/d on your local machine
Instead of using:
-nH --cut-dirs=1
use:
-nH --cut-dirs=100
This will cut more directories and no folders will be created.
Note: 100 = the number of folders to skip creating.
You can change 100 to any number.

Can wget be used to get all the files on a server?

Can wget be used to get all the files on a server.Suppose if this is the directory structure using Django framework on my site foo.com
And if this is the directory structure
/web/project1
/web/project2
/web/project3
/web/project4
/web/templates
Without knowing the name of directories of /project1,project2.....Is it possible to download all the files
You could use
wget -r -np http://www.foo.com/pool/main/z/
-r (fetch files/folders recursively)
-np (do not descent to parent directory when retrieving recursively)
or
wget -nH --cut-dirs=2 -r -np http://www.foo.com/pool/main/z/
--cut-dirs (it makes Wget not "see" number remote directory components)
-nH (invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.)
First of all, wget can only be used to retrieve files served by the web server. It's not clear in the question you're posting whether you mean actual files or web pages. I would guess from the way you phrased your question that your intent is to download the server files, not the web pages served by Django. If this is correct, then no wget won't work. You need to use something like rsync or scp.
If you do mean using wget to retrieve all of the generated pages from Django, then this will only work if links point to those directories. So, you need a page that has code like:
<ul>
<li>Project1</li>
<li>Project2</li>
<li>Project3</li>
<li>Project4</li>
<li>Templates</li>
</ul>
wget is not a psychic; it can only pull in pages it knows about.
try recursive retrieval - the -r option.

Resources