Using wget to download select directories from ftp server - linux

I'm trying to understand how to use wget to download specific directories from a bunch of different ftp sites with economic data from the US government.
As a simple example, I know that I can download an entire directory using a command like:
wget --timestamping --recursive --no-parent ftp://ftp.bls.gov/pub/special.requests/cew/2013/county/
But I envision running more complex downloads, where I might want to limit a download to a handful of directories. So I've been looking at the --include option. But I don't really understand how it works. Specifically, why doesn't this work:
wget --timestamping --recursive -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
The following does work, in the sense that it downloads files, but it downloads way more than I need (everything in the 2013 directory, vs just the county subdirectory):
wget --timestamping --recursive -I /pub/special.requests/cew/2013/ ftp://ftp.bls.gov/pub/special.requests/cew/
I can't tell if i'm not understanding something about wget or if my issue is with something more fundamental to ftp server structures.
Thanks for the help!

Based on this doc it seems that the filtering functions of wget are very limited.
When using the --recursive option, wget will download all linked documents after applying the various filters, such as --no-parent and -I, -X, -A, -R options.
In your example:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
This won't download anything, because the -I option specifies to include only links matching /pub/special.requests/cew/2013/county/, but on the page /pub/special.requests/cew/ there are no such links, so the download stops there. This will work though:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/
... because in this case the /pub/special.requests/cew/2013/ page does have a link to county/
Btw, you can find more details in this doc than on the man page:
http://www.gnu.org/software/wget/manual/html_node/

can't you simply do (and add the --timestamping/--no-parent etc. as needed)
wget -r ftp://ftp.bls.gov/pub/special.requests/cew/2013/county
The -I seems to work at one directory level at a time, so if we step one step up from county/ we could do:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/
But apparently we can't step further up and do
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/

Related

How can I download all the files from a remote directory to my local directory?

I want to download all the files in a specific directory of my site.
Let's say I have 3 files in my remote SFTP directory
www.site.com/files/phone/2017-09-19-20-39-15
a.txt
b.txt
c.txt
My goal is to create a local folder on my desktop with ONLY those downloaded files. No parents files or parents directory needed. I am trying to get the clean report.
I've tried
wget -m --no-parent -l1 -nH -P ~/Desktop/phone/ www.site.com/files/phone/2017-09-19-20-39-15 --reject=index.html* -e robots=off
I got
I want to get
How do I tweak my wget command to get something like that?
Should I use anything else other than wget ?
Ihue,
Taking a shell programatic perspective I would recommend you try the following command line script, note I also added the citation so you can see the original threads.
wget -r -P ~/Desktop/phone/ -A txt www.site.com/files/phone/2017-09-19-20-39-15 --reject=index.html* -e robots=off
-r enables recursive retrieval. See Recursive Download for more information.
-P sets the directory prefix where all files and directories are saved to.
-A sets a whitelist for retrieving only certain file types. Strings and patterns are accepted, and both can be used in a comma separated list. See Types of Files for more information.
Ref: #don-joey
https://askubuntu.com/questions/373047/i-used-wget-to-download-html-files-where-are-the-images-in-the-file-stored

wget for selected samples from ftp?

I wanted to download selected files from this site:
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP042/SRP042286
If I wanted to download all of them I could do:
$wget -r ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP042/SRP042286/*
But my questions what a simple way to download just selected samples such as SRR1299458/ - SRR1299466/.
You can use wget -r ftp://example.com/path/{dir1,dir2}:
wget -r ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP042/SRP042286/{SRR1299458,SRR1299466}

A better way to download FTP contents to another server

I frequently have to move files from one server to another, when moving websites or when I have the need of a code package that is located on another server.
Currently I use the following commands:
wget -m --ftp-user=username --ftp-password=password ftp://ftp.domain.std/public_html
cp -rf ftp.domain.std/public_html/* .
cp -rf ftp.domain.std/public_html/.* .
This works fine, but I wonder if there is a method which will make the second and third line unnecessary?
You can give the -nH --cut-dirs=1 parameters to skip the host directory (-nH) and cut away one level of directories (--cut-dirs=1)
(This may vary by wget version, this is from GNU wget.)
wget -nH --cut-dirs=1 -m --ftp-user=username --ftp-password=password ftp://ftp.domain.std/public_html

Can wget be used to get all the files on a server?

Can wget be used to get all the files on a server.Suppose if this is the directory structure using Django framework on my site foo.com
And if this is the directory structure
/web/project1
/web/project2
/web/project3
/web/project4
/web/templates
Without knowing the name of directories of /project1,project2.....Is it possible to download all the files
You could use
wget -r -np http://www.foo.com/pool/main/z/
-r (fetch files/folders recursively)
-np (do not descent to parent directory when retrieving recursively)
or
wget -nH --cut-dirs=2 -r -np http://www.foo.com/pool/main/z/
--cut-dirs (it makes Wget not "see" number remote directory components)
-nH (invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.)
First of all, wget can only be used to retrieve files served by the web server. It's not clear in the question you're posting whether you mean actual files or web pages. I would guess from the way you phrased your question that your intent is to download the server files, not the web pages served by Django. If this is correct, then no wget won't work. You need to use something like rsync or scp.
If you do mean using wget to retrieve all of the generated pages from Django, then this will only work if links point to those directories. So, you need a page that has code like:
<ul>
<li>Project1</li>
<li>Project2</li>
<li>Project3</li>
<li>Project4</li>
<li>Templates</li>
</ul>
wget is not a psychic; it can only pull in pages it knows about.
try recursive retrieval - the -r option.

wget .listing file, is there a way to specify the name of it

Ok so I need to run wget but I'm prohibited from creating 'dot' files in the location that I need to run the wget. So my question is 'Can I get wget to use a name other than .listing that I can specify'.
further clarification : this is to sync / mirror an ftp folder with a local one, So using the -O option is not really useful, as I require all files to maintain format.
You can use the -O option to set the output filename, as in:
wget -O file http://stackoverflow.com
You can also use wget --help to get a complete list of options.
For folks that come along afterwards, and are surprised by an answer to the wrong question, here is a copy of one of the comments below:
#FelixD, yes, unfortunately misunderstood the question. Looking at the code for wget version 1.19 (Feb 2017), specifically ftp.c, it appears that the .listing file is hardcoded in macro LIST_FILENAME, and no override possible. There are probably better options for mirroring ftp sites - maybe take a look at lftp and its mirror command, also includes parallel downloads: lftp.yar.ru
#Paul: You can use that -O option specified by spong
No. You can't do this.
wget/src/ftp.c
/* File where the "ls -al" listing will be saved. */
#ifdef MSDOS
#define LIST_FILENAME "_listing"
#else
#define LIST_FILENAME ".listing"
#endif
I have same problem;
wget seems to save the .listing file in current directory where wget was called from, regardless of -O path/outpout_file
As an ugly/desperate solution we can try to run wget from random directories:
cd /temp/random_1; wget ftp://example.com/ -O /full/save_path/to_file_1.txt
cd /temp/random_2; wget ftp://example.com/ -O /full/save_path/to_file_2.txt
Note: manual says that using the --no-remove-listing option will cause it to create .listing.1, .listing.2, etc, so that might be an option to avoid conflicts.
Note: .listing file is not created at all if ftp login failed.

Resources