Wget Output in Recursive mode - linux

I am using wget -r to download 3 .zip files from a specified webpage. Here is what I have so far:
wget -r -nd -l1 -A.zip http://www.website.com/example
Right now, the zip files all begin with abc_*.zip where * seems to be a random. I want to have the first downloaded file to be called xyz_1.zip, the second to be xyz_2.zip, and the third to be xyz_3.zip.
Is this possible with wget?
Many thanks!

I don't think it's possible with wget alone. After downloading you could use some simple shell scripting to rename the files, like:
i=1; for f in abc_*.zip; do mv "$f" "xyz_$i.zip"; i=$(($i+1)); done

Try to get a listing first and then download each file separately.
let n=1
wget -nv -l1 -r --spider http://www.website.com/example 2>&1 | \
egrep -io 'http://.*\.zip'| \
while read url; do
wget -nd -nv -O $(echo $url|sed 's%^.*/\(.*\)_.*$%\1%')_$n.zip "$url"
let n++
done

I don't think there is a way you can do it within a single wget command.
wget does have a -O option which you can use to tell it which file to output to, but it won't work in your case because multiple files will get concatenated together.
You will have to write a script which renames the files from abc_*.zip to xyz_*.zip after wget has completed.
Alternatively, invoke wget for one zip file at a time and use the -O option.

Related

Create Directory, download file and execute command from list of URL

I am working on a Red Hat Linux server. My end goal is to run CRB-BLAST on multiple fasta files and have the results from those in separate directories.
My approach is to download the fasta files using wget then run the CRB-BLAST. I have multiple files and would like to be able to download them each to their own directory (the name perhaps should come from the URL list files), then run the CRB-BLAST.
Example URLs:
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_3370_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_CB_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_13_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_37_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_123_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_195_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_31_chr.v0.1.liftover.CDS.fasta.gz
Ideally, the file name determines the directory name, for example, TC_3370/.
I think there might be a solution with cat URL.txt | mkdir | cd | wget | crb-blast
Currently I just run the commands in line:
mkdir TC_3370
cd TC_3370/
wget url
http://assemblies/Genomes/final_assemblies/10x_meta_assemblies_v1.0/TC_3370_chr.v1.0.maker.CDS.fasta.gz
crb-blast -q TC_3370_chr.v1.0.maker.CDS.fasta.gz -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
Try this Shellcheck-clean program:
#! /bin/bash -p
while read -r url; do
file=${url##*/}
dir=${file%%_chr.*}
mkdir -v -- "$dir"
(
cd "./$dir" || exit 1
wget -- "$url"
crb-blast -q "$file" -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
)
done <URL.txt
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${url##*/} etc.
The subshell (( ... )) is used to ensure that the cd doesn't affect the main program.
Another implementation
#!/bin/sh
# Read lines as url as long as it can
while read -r url
do
# Get file name by stripping-out anything up to the last / from the url
file_name=${url##*/}
# Get the destination dir name by stripping anything from the first __chr
dest_dir=${file_name%%_chr*}
# Compose the wget output path
fasta_path="$dest_dir/$file_name"
if
# Successfully created the destination directory AND
mkdir -p -- "$dest_dir" &&
# Successfully downloaded the file
wget --output-file="$fasta_path" --quiet -- "$url"
then
# Process the fasta file into fna
fna_path="$dest_dir/TCV2_annot_cds.fna"
crb-blast -q "$fasta_path" -t "$fna_path" -e 1e-20 -h 4 -o rbbh_TC
else
# Cleanup remove destination directory if any of mkdir or wget failed
rm -fr -- "$dest_dir"
fi
# reading from the URL.txt file for the whole while loop
done < URL.txt
Download files from list is task for -i file option, if you have file named say urls.txt with one URL per line you might simply do
wget -i urls.txt
Note that this will put all files inside current working directory, so if you wish to have them in separate dirs, you would need to move them after wget finish.

How to download multiple files using wget in linux and save them with custom name for each downloaded file

How to download multiple files using wget. Lets say i have a urls.txt containing several urls and i want to save them automatically with a custom filename for each file. How to do this?
I tried downloading 1 by 1 with this format "wget -c url1 -O filename1" successfully and now i want to try do batch download.
You might take look at xargs command, you would need to prepare file with arguments for each wget call, lets say it is named download.txt and
-O file1.html https://www.example.com
-O file2.html https://www.duckduckgo.com
and then use it as follows
cat download.txt | xargs wget -c
which is equivalent to doing
wget -c -O file1.html https://www.example.com
wget -c -O file2.html https://www.duckduckgo.com
Add the input file and a loop:
for i in `cat urls.txt`; do wget $i -O filename-$i; done

Bash, wget remove comma from output filename

I am reading a file with URL's by line by line then I pass URL to the wget:
FILE=/home/img-url.txt
while read line; do
url=$line
wget -N -P /home/img/ $url
done < $FILE
This works, but some file contains comma in the filename. How I can save the file without the comma?
Example:
http://xy.com/0005.jpg -> saved as 0005.jpg
http://xy.com/0022,22.jpg -> save as 002222.jpg not as 0022,22
I hope you find my question interesting.
UPDATE:
We have some nice solution, but is there any solution to the time stamping error?
WARNING: timestamping does nothing in combination with -O. See the manual
for details.
This should work:
url="$line"
filename="${url##*/}"
filename="${filename//,/}"
wget -P /home/img/ "$url" -O "$filename"
Using -N and -O both will throw a warning message. wget manual says:
-N (for timestamp-checking) is not supported in
combination with -O: since file is always newly created, it will always
have a very new timestamp.
So, when you use -O option, it actually creates a new file with new timestamping and thus the -N option becomes dummy (it can't do what it is for). If you want to preserve the timestapming, then a workaround might be this:
url="$line"
wget -N -P /home/img/ "$url"
file="${url##*/}"
newfile="${filename//,/}"
[[ $file != $newfile ]] && cp -p /home/img/"$file" /home/img/"$newfile" && rm /home/img/"$file"
In the body of the loop you need to generate the filename from the URL without commas and, without the leading part of the URL, and tell wget to save under other name.
url=$line
file=`echo $url | sed -e 's|^.*/||' -e 's/,//g'`
wget -N -P /home/image/dema-ktlg/ -O $file $url
In the meantime I wrote this:
url=$line
$file=`echo ${url##*/} | sed 's/,//'`
wget -N -P /home/image/dema-ktlg/ -O $file $url
Seems to work fine, is there any trivial problem with my code?

wget --accept files containing pattern

I am trying to write a script that downloads all the files linked in a certain page. These files must contain specific strings and have certain extensions.
Let's say that I want to download all files that contain the string "1080" or "1080p" etc. and that have as extension ".mov" ".avi" ".wmv" etc. This was to show that both the strings and the extensions are multiple.
This is what I have done so far:
wget -Amov -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com
Any help is really apreciated.
Thank you.
You can add a pattern for the -A switch, like this:
wget -A "*1080*mov" -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com
This example will get all files with "1080", except gif & png files:
wget -A "*1080*" -R gif,png -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com

wget -O for non-existing save path?

I can't wget while there is no path already to save. I mean, wget doens't work for the non-existing save paths. For e.g:
wget -O /path/to/image/new_image.jpg http://www.example.com/old_image.jpg
If /path/to/image/ is not previously existed, it always returns:
No such file or directory
How can i make it work to automatically create the path and save?
Try curl
curl http://www.site.org/image.jpg --create-dirs -o /path/to/save/images.jpg
mkdir -p /path/i/want && wget -O /path/i/want/image.jpg http://www.com/image.jpg
To download a file with wget, into a new directory, use --directory-prefix without -O:
wget --directory-prefix=/new/directory/ http://www.example.com/old_image.jpg
Using -O new_file in conjunction with --directory-prefix, will not create the new directory structure, and will save the new file in the current directory.
It may even fail with "No such file or directory" error, if you specify -O /new/directory/new_file
I was able to create folder if it doesn't exists with this command:
wget -N http://www.example.com/old_image.jpg -P /path/to/image
wget is only getting a file NOT creating the directory structure for you (mkdir -p /path/to/image/), you have to do this by urself:
mkdir -p /path/to/image/ && wget -O /path/to/image/new_image.jpg http://www.example.com/old_image.jpg
You can tell wget to create the directory (so you dont have to use mkdir) with the parameter --force-directories
alltogether this would be
wget --force-directories -O /path/to/image/new_image.jpg http://www.example.com/old_image.jpg
After searching a lot, I finally found a way to use wget to download for non-existing path.
wget -q --show-progress -c -nc -r -nH -i "$1"
=====
Clarification
-q
--quiet --show-progress
Kill annoying output but keep the progress-bar
-c
--continue
Resume download if the connection lost
-nc
--no-clobber
Overwriting file if exists
-r
--recursive
Download in recursive mode (What topic creator asked for!)
-nH
--no-host-directories
Tell wget do not use the domain as a directory (for e.g: https://example.com/what/you/need
- without this option, it will download to "example.com/what/you/need")
-i
--input-file
File with URLs need to be download (in case you want to download a lot of URLs,
otherwise just remove this option)
Happy wget-ing!

Resources