how to recursively fetch some data with pattern using wget - linux

I am trying to download some specific files from this website (http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/), they keep 10 days data. I want to download all the files starting with "ST4" from all the directories starting with "nam_pcpn_anal". I could download all the files staring with "ST4" from one folder like :
wget -r -nd -N --no-parent -nH --cut-dirs=100 -P ~/test/ -A ST4* 'http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/nam_pcpn_anal.20160625/'
but I do not know how to search ST4 recursively. I thought the following should work but nope!
wget -r -nd -N --no-parent -nH --cut-dirs=100 -P ~/test/ -A ST4* --accept nam_pcpn_anal*/ST4* 'http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/'
Any idea!

The wget manual shows:
-I list
--include-directories=list
Specify a comma-separated list of directories you wish to follow
when downloading. Elements of list may contain wildcards.
So, you could try:
wget -r -nd -N --no-parent -nH --cut-dirs=100 -P ~/test/ \
-A 'ST4*' -I '*/nam_pcpn_anal.*' \
'http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/'

Related

Create Directory, download file and execute command from list of URL

I am working on a Red Hat Linux server. My end goal is to run CRB-BLAST on multiple fasta files and have the results from those in separate directories.
My approach is to download the fasta files using wget then run the CRB-BLAST. I have multiple files and would like to be able to download them each to their own directory (the name perhaps should come from the URL list files), then run the CRB-BLAST.
Example URLs:
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_3370_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_CB_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_13_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_37_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_123_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_195_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_31_chr.v0.1.liftover.CDS.fasta.gz
Ideally, the file name determines the directory name, for example, TC_3370/.
I think there might be a solution with cat URL.txt | mkdir | cd | wget | crb-blast
Currently I just run the commands in line:
mkdir TC_3370
cd TC_3370/
wget url
http://assemblies/Genomes/final_assemblies/10x_meta_assemblies_v1.0/TC_3370_chr.v1.0.maker.CDS.fasta.gz
crb-blast -q TC_3370_chr.v1.0.maker.CDS.fasta.gz -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
Try this Shellcheck-clean program:
#! /bin/bash -p
while read -r url; do
file=${url##*/}
dir=${file%%_chr.*}
mkdir -v -- "$dir"
(
cd "./$dir" || exit 1
wget -- "$url"
crb-blast -q "$file" -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
)
done <URL.txt
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${url##*/} etc.
The subshell (( ... )) is used to ensure that the cd doesn't affect the main program.
Another implementation
#!/bin/sh
# Read lines as url as long as it can
while read -r url
do
# Get file name by stripping-out anything up to the last / from the url
file_name=${url##*/}
# Get the destination dir name by stripping anything from the first __chr
dest_dir=${file_name%%_chr*}
# Compose the wget output path
fasta_path="$dest_dir/$file_name"
if
# Successfully created the destination directory AND
mkdir -p -- "$dest_dir" &&
# Successfully downloaded the file
wget --output-file="$fasta_path" --quiet -- "$url"
then
# Process the fasta file into fna
fna_path="$dest_dir/TCV2_annot_cds.fna"
crb-blast -q "$fasta_path" -t "$fna_path" -e 1e-20 -h 4 -o rbbh_TC
else
# Cleanup remove destination directory if any of mkdir or wget failed
rm -fr -- "$dest_dir"
fi
# reading from the URL.txt file for the whole while loop
done < URL.txt
Download files from list is task for -i file option, if you have file named say urls.txt with one URL per line you might simply do
wget -i urls.txt
Note that this will put all files inside current working directory, so if you wish to have them in separate dirs, you would need to move them after wget finish.

Using wget on Linux to scan sub-folders for specific files

Using wget -r -P Home -A jpg http://example.com will result me a list of files from that website directory, what i'm searching for is how do i query a search like: wget -r -P home -A jpg http://example.com/from 65121 to 75121/ file_ 100 to 200.jpg
Example(s):
wget -r -P home -A jpg http://example.com/65122/file_102.jpg
wget -r -P home -A jpg http://example.com/65123/file_103.jpg
wget -r -P home -A jpg http://example.com/65124/file_104.jpg
Is it possible to achieve that on a Linux distro?
I'm fairly new to Linux OS, any tips are welcome.
Use a nested for loop and some bash scripting:
for i in {65121..75121}; do for j in {100..200}; do wget -r -P home -A jpg "http://example.com/${i}/file_${j}.jpg"; done; done
Wget has loop
wget -nd -H -p -A file_{100..200}.jpg -e robots=off http://example.com/{65121..75121}/
If there are only file_{100..200}.jpg It's simpler
wget -nd -H -p -A jpg -e robots=off http://example.com/{65121..75121}/

wget --accept files containing pattern

I am trying to write a script that downloads all the files linked in a certain page. These files must contain specific strings and have certain extensions.
Let's say that I want to download all files that contain the string "1080" or "1080p" etc. and that have as extension ".mov" ".avi" ".wmv" etc. This was to show that both the strings and the extensions are multiple.
This is what I have done so far:
wget -Amov -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com
Any help is really apreciated.
Thank you.
You can add a pattern for the -A switch, like this:
wget -A "*1080*mov" -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com
This example will get all files with "1080", except gif & png files:
wget -A "*1080*" -R gif,png -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com

wget -O for non-existing save path?

I can't wget while there is no path already to save. I mean, wget doens't work for the non-existing save paths. For e.g:
wget -O /path/to/image/new_image.jpg http://www.example.com/old_image.jpg
If /path/to/image/ is not previously existed, it always returns:
No such file or directory
How can i make it work to automatically create the path and save?
Try curl
curl http://www.site.org/image.jpg --create-dirs -o /path/to/save/images.jpg
mkdir -p /path/i/want && wget -O /path/i/want/image.jpg http://www.com/image.jpg
To download a file with wget, into a new directory, use --directory-prefix without -O:
wget --directory-prefix=/new/directory/ http://www.example.com/old_image.jpg
Using -O new_file in conjunction with --directory-prefix, will not create the new directory structure, and will save the new file in the current directory.
It may even fail with "No such file or directory" error, if you specify -O /new/directory/new_file
I was able to create folder if it doesn't exists with this command:
wget -N http://www.example.com/old_image.jpg -P /path/to/image
wget is only getting a file NOT creating the directory structure for you (mkdir -p /path/to/image/), you have to do this by urself:
mkdir -p /path/to/image/ && wget -O /path/to/image/new_image.jpg http://www.example.com/old_image.jpg
You can tell wget to create the directory (so you dont have to use mkdir) with the parameter --force-directories
alltogether this would be
wget --force-directories -O /path/to/image/new_image.jpg http://www.example.com/old_image.jpg
After searching a lot, I finally found a way to use wget to download for non-existing path.
wget -q --show-progress -c -nc -r -nH -i "$1"
=====
Clarification
-q
--quiet --show-progress
Kill annoying output but keep the progress-bar
-c
--continue
Resume download if the connection lost
-nc
--no-clobber
Overwriting file if exists
-r
--recursive
Download in recursive mode (What topic creator asked for!)
-nH
--no-host-directories
Tell wget do not use the domain as a directory (for e.g: https://example.com/what/you/need
- without this option, it will download to "example.com/what/you/need")
-i
--input-file
File with URLs need to be download (in case you want to download a lot of URLs,
otherwise just remove this option)
Happy wget-ing!

Wget Output in Recursive mode

I am using wget -r to download 3 .zip files from a specified webpage. Here is what I have so far:
wget -r -nd -l1 -A.zip http://www.website.com/example
Right now, the zip files all begin with abc_*.zip where * seems to be a random. I want to have the first downloaded file to be called xyz_1.zip, the second to be xyz_2.zip, and the third to be xyz_3.zip.
Is this possible with wget?
Many thanks!
I don't think it's possible with wget alone. After downloading you could use some simple shell scripting to rename the files, like:
i=1; for f in abc_*.zip; do mv "$f" "xyz_$i.zip"; i=$(($i+1)); done
Try to get a listing first and then download each file separately.
let n=1
wget -nv -l1 -r --spider http://www.website.com/example 2>&1 | \
egrep -io 'http://.*\.zip'| \
while read url; do
wget -nd -nv -O $(echo $url|sed 's%^.*/\(.*\)_.*$%\1%')_$n.zip "$url"
let n++
done
I don't think there is a way you can do it within a single wget command.
wget does have a -O option which you can use to tell it which file to output to, but it won't work in your case because multiple files will get concatenated together.
You will have to write a script which renames the files from abc_*.zip to xyz_*.zip after wget has completed.
Alternatively, invoke wget for one zip file at a time and use the -O option.

Resources