wget: Append text to default file name - linux

When I download a file using wget without the -O argument, it saves the file with a default name. I would like to append some text to that default name, but the -O option completely overrides the default name.
For example:
wget www.google.com --> saves as index.html (this is the default name)
wget -O foo.html www.google.com --> saves as foo.html
I would like to save the file as ${default_name}_sometext.html. For the above example, it would look something like index_sometext.html. Any ideas appreciated!
Edit: For those who are wondering why I might need something like this, I have scrapped thousands of URLs for a personal project, and I need to preserve the default name along with some attributes for each file.

#!/bin/bash
domain=google.com
some_text=something
lol=$(wget https://${domain} 2>&1 | grep Saving | cut -d ' ' -f 3 | sed -e's/[^A-Za-z0-9._-]//g')
mv $lol ${some_text}_${lol}
i think it will work for you.

Related

Get wget to download only new items from a list

I've got a file that contains a list of file paths. I’m downloading them like this with wget:
wget -i cram_download_list.txt
However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.
I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.
File contents look like this:
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram
I’m currently trying to do something like this:
ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt
I’d like to look at the directory for which files already exist, and
only download the outstanding ones.
To get such behavior you might use -nc (alias --no-clobber) flag. It does skip downloads that would download to existing files (overwriting them). So in your case
wget -nc -i cram_download_list.txt
Beware that this solution does not handle partially downloaded files.
wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
grep -vf - cram_download_list.txt )
Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.
Added:
-c for finalizing incomplete files (i.e. resume download)
Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.
If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:
wget -c -i cram_download_list.txt
The files which are already completed will only be checked and skipped.

How to pull down a list of domains using wget and scan them using grep

I have a list of domain names contained within a folder named "domains.txt", formatted like this:
www.google.com
www.stackoverflow.com
www.apple.com
etc...
I want to perform a wget command to pull down a copy of each domain listed inside "domains.txt" and save it as a .html page.
I can do this individually using wget www.google.com but I'm wondering, instead of doing each one separately, can I iterate through the list and save each domain name as a separate .html file?
The second action I want to perform is a scan of these pulled down .html files for keywords, which I have contained in a text file named "keywords.txt". They're formatted like this:
first_keyword
second_keyword
third_keyword
etc...
Ideally, I'd like to have an output that prints the domain name to a text file, with a "yes" beside it if it has been found to contain any of the keywords contained in "keywords.txt". If it's possible to print what keywords were found beside each domain that would be brilliant, but a simple "yes" would be great too. I'm brand new to Linux and scripting, so any help would be greatly appreciated!
I assume the files don't contain the quotes. Otherwise I would need more code to remove the quotes.
domains.txt
www.google.com
www.stackoverflow.com
www.apple.com
keywords.txt
first_keyword
second_keyword
third_keyword
You can try something like this
outfile=tmp.html
while IFS= read -r domain
do
wget -O "$outfile" "$domain"
if fgrep -q -f keywords.txt "$outfile"
then
echo "$domain" yes
else
echo "$domain" no
fi
rm "$outfile"
done < domains.txt

Using wget to download images and saving with specified filename

I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg

Saving multiple URLs at once using Linux Centos

So i have a list of about 1000 urls in a txt file, one per line, I wish to save the contents of every page to a file, how can i automate this?"
You can use wget with the -i option to let it download a list of URLs. Assuming your URLs are stored in a file called urls.txt:
wget -i urls.txt
The problem here might be that the filenames can be the same for multiple websites (e.g. index.html), so that wget will append a number which makes it hard/impossible to connect a file to the original URL just by looking at the filename.
The solution to that would be to use a loop like this:
while read -r line
do
wget "$line" -O <...>
done < urls.txt
You can specify a custom filename with the -O option.
Or you can "build" the file name from the url you are processing.
while read -r line
do
fname=$(echo "$line" | sed -e 's~http[s]*://~~g' -e 's~[^A-Za-z0-9]~-~g')
fname=${fname}.html
wget "$line" -O "$fname"
done < urls.txt

How can I download and set the filenames using wget -i?

I wanted to know how to define the input file for wget, in order to download several files and set a custom name for them, using wget -i filename
Analog example using -O
wget -O customname url
-O filename works only when you give it a single URL.
With multiple URLs, all downloaded content ends up in filename.
You can use a while...loop:
cat urls.txt | while read url
do
wget "$url" -O "${url##*/}" # <-- use custom name here
done

Resources