very new to this... apologies for mistakes/stupidity in advance.
Trying to use wget on mac to download PDF's from a list. I have a text file of DOI's (e.g. 10.1046/j.1365-294X.2001.01258.x), essentially I want to enter these DOI's into sci-hub.io and download the corresponding PDF. I have appended the DOI's to become a legitimate web address (e.g. http://sci-hub.io/10.1046/j.1365-294X.2001.01258.x). These work when entered manually into chrome, but not with wget to automate the process.
Tried:
wget -i file.txt
wget -A.pdf -i file.txt
wget -A.pdf -erobots=off -i file.txt
All of these still return files containing html info.
Any suggestions greatly appreciated.
Related
I checked the original site of the internet archive and they mentioned there a couple of steps to follow, which included the use of the wget utility using Cygwin over windows, I followed the steps above, I made an advanced search and extracted the CSV file, converted it to .txt and then tried to run the following commands
wget -r -H -nc -np -nH --cut-dirs=1 -A .pdf,.epub -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/
The emulator gets stuck afterwards and no log message or even an error message appears indicating any practical progress, I want to know what wrong have I done so far.
The ia command-line tool is the official way to do this. If you can craft a search term that captures all your items, you can have ia download everything that matches.
For example:
ia download --search 'creator:Hamilton Commodore User Group'
will download all of the items attributed to this (now defunct) computer user group. This is a live, working query that downloads roughly 8.6 MB of data for 40 Commodore 64 disk images.
It will also download from an itemlist, as above.
After Some time I figured out how to resolve this matter, the commands posted in the internet archive help blog are general commands posted to help use the wget utility , the commands we will need right here are simply those which follow
--cutdirs=1
-A .pdf,.epub
-e robots=off
-i ./itemlist.txt
and of course the url source:
B- 'archive.org/download/'
The page that I'm trying to download from (made up contact info):
http://www.filltext.com/?rows=1000&nro={index}&etunimi={firstName}&sukunimi={lastName}&email={email}&puhelinnumero={phone}&pituus={numberRange|150,200}&syntymaaika={date|10-01-1950,30-12-1999}&postinumero={zip}&kaupunki={city}&maa={country}&pretty=true
The command that I have been using (I have tried a lot of different options etc.):
wget -r -O -F [filename] URL
It works in the sense that it downloads the web page content to the file, but instead of being the raw data that is inside the cells, it's just a bunch of curly brackets.
How do I download the actual raw data instead of the JSON file? Any help would be very much apppreciated!
Thank you.
Do you want something like wget google.com -q -O - ?
I'm trying to understand how to use wget to download specific directories from a bunch of different ftp sites with economic data from the US government.
As a simple example, I know that I can download an entire directory using a command like:
wget --timestamping --recursive --no-parent ftp://ftp.bls.gov/pub/special.requests/cew/2013/county/
But I envision running more complex downloads, where I might want to limit a download to a handful of directories. So I've been looking at the --include option. But I don't really understand how it works. Specifically, why doesn't this work:
wget --timestamping --recursive -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
The following does work, in the sense that it downloads files, but it downloads way more than I need (everything in the 2013 directory, vs just the county subdirectory):
wget --timestamping --recursive -I /pub/special.requests/cew/2013/ ftp://ftp.bls.gov/pub/special.requests/cew/
I can't tell if i'm not understanding something about wget or if my issue is with something more fundamental to ftp server structures.
Thanks for the help!
Based on this doc it seems that the filtering functions of wget are very limited.
When using the --recursive option, wget will download all linked documents after applying the various filters, such as --no-parent and -I, -X, -A, -R options.
In your example:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
This won't download anything, because the -I option specifies to include only links matching /pub/special.requests/cew/2013/county/, but on the page /pub/special.requests/cew/ there are no such links, so the download stops there. This will work though:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/
... because in this case the /pub/special.requests/cew/2013/ page does have a link to county/
Btw, you can find more details in this doc than on the man page:
http://www.gnu.org/software/wget/manual/html_node/
can't you simply do (and add the --timestamping/--no-parent etc. as needed)
wget -r ftp://ftp.bls.gov/pub/special.requests/cew/2013/county
The -I seems to work at one directory level at a time, so if we step one step up from county/ we could do:
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/
But apparently we can't step further up and do
wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/
I want to download a number of files which are as follows:
http://example.com/directory/file1.txt
http://example.com/directory/file2.txt
http://example.com/directory/file3.txt
http://example.com/directory/file4.txt
.
.
http://example.com/directory/file199.txt
http://example.com/directory/file200.txt
Can anyone help me with it using shell scripting? Here is what I'm using but it is downloading only the first file.
for i in {1..200}
do
exec wget http://example.com/directory/file$i.txt;
done
wget http://example.com/directory/file{1..200}.txt
should do it. That expands to wget http://example.com/directory/file1.txt http://example.com/directory/file2.txt ....
Alternatively, your current code should work fine if you remove the call to exec, which is unnecessary and doesn't do what you seem to think it does.
To download a list of files you can use wget -i <file> where is a file name with a list of url to download.
For more details you can review the help page: man wget
I am migrating a site in PHP and someone has hardcoded all the links into a function call display image('http://whatever.com/images/xyz.jpg').
I can easily use text mate to convert all of these to http://whatever.com/images/xyz.jpg.
But what I also need to do is bring the images down with it so for example wget -i images.txt.
But I need to write a bash script to compile images.txt with all the links to save me doing this manually because there are a lot!
Any help you can give on this is greatly appreciated.
I found a one-liner on that website that should work: (replace index.php by your source)
wget `cat index.php | grep -P -o 'http:(\.|-|\/|\w)*\.(gif|jpg|png|bmp)'`
If you wget the file via. a web server, will you not get the output from the PHP script? That will contain img tags which you can extract using xml_grep or some such tool.