I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url
-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done
Related
When I download a file using wget without the -O argument, it saves the file with a default name. I would like to append some text to that default name, but the -O option completely overrides the default name.
For example:
wget www.google.com --> saves as index.html (this is the default name)
wget -O foo.html www.google.com --> saves as foo.html
I would like to save the file as ${default_name}_sometext.html. For the above example, it would look something like index_sometext.html. Any ideas appreciated!
Edit: For those who are wondering why I might need something like this, I have scrapped thousands of URLs for a personal project, and I need to preserve the default name along with some attributes for each file.
#!/bin/bash
domain=google.com
some_text=something
lol=$(wget https://${domain} 2>&1 | grep Saving | cut -d ' ' -f 3 | sed -e's/[^A-Za-z0-9._-]//g')
mv $lol ${some_text}_${lol}
i think it will work for you.
I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg
I'm currently using
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -L -J -O -s
to print the server-imposed filename from a download address but it downloads the file to the directory. How can I get the filename only without downloading the file?
Do a HEAD request in your batch file:
curl -w "%%{filename_effective}" "http://example.com/file.jpg" -I -X HEAD -L -J -O -s
This way, the requested file itself will not be downloaded, as per RFC2616:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. (...) This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself.
Note that this will create two files nonetheless, albeit only with the header data curl has received. I understand that your main interest is not downloading (potentially many and/or big) files, so I guess that should be OK.
Edit: On the follow-up question:
Is there really no way to get only and only the content-disposition filename without downloading anything at all?
It doesn't seem possible with curl alone, I'm afraid. The only way to have curl not output a file is the -o NUL (Windows) routine, and if you do that, %{filename_effective} becomes NUL as well.
curl gives an error when trying to combine -J and -I, so the only solution that I found is to parse the header output with grep and sed:
curl "http://example.com/file.jpg" -LIs | grep ^Content-Disposition | sed -r 's/.*"(.*)".*/\1/'
I have a text file containing a bunch of webpages:
http://rest.kegg.jp/link/pathway/7603
http://rest.kegg.jp/link/pathway/5620
…
My aim is to download all info on these pages to a single text file.
The following works perfectly but it gives me 3000+ text files, how could I simple merge all the output files during the loop.
while read i; do wget $i; done < urls.txt
Thanks a lot
Use -O file option, which appends the output to the logfile specified.
while read i; do wget -O outputFile $i; done < urls.txt
The outputFile will contain the contents as well
Also you can skip the while loop by specifying the input file using -i file
wget -O outpuFile -i url.txt
So i have a list of about 1000 urls in a txt file, one per line, I wish to save the contents of every page to a file, how can i automate this?"
You can use wget with the -i option to let it download a list of URLs. Assuming your URLs are stored in a file called urls.txt:
wget -i urls.txt
The problem here might be that the filenames can be the same for multiple websites (e.g. index.html), so that wget will append a number which makes it hard/impossible to connect a file to the original URL just by looking at the filename.
The solution to that would be to use a loop like this:
while read -r line
do
wget "$line" -O <...>
done < urls.txt
You can specify a custom filename with the -O option.
Or you can "build" the file name from the url you are processing.
while read -r line
do
fname=$(echo "$line" | sed -e 's~http[s]*://~~g' -e 's~[^A-Za-z0-9]~-~g')
fname=${fname}.html
wget "$line" -O "$fname"
done < urls.txt