Using wget on Linux to scan sub-folders for specific files - linux

Using wget -r -P Home -A jpg http://example.com will result me a list of files from that website directory, what i'm searching for is how do i query a search like: wget -r -P home -A jpg http://example.com/from 65121 to 75121/ file_ 100 to 200.jpg
Example(s):
wget -r -P home -A jpg http://example.com/65122/file_102.jpg
wget -r -P home -A jpg http://example.com/65123/file_103.jpg
wget -r -P home -A jpg http://example.com/65124/file_104.jpg
Is it possible to achieve that on a Linux distro?
I'm fairly new to Linux OS, any tips are welcome.

Use a nested for loop and some bash scripting:
for i in {65121..75121}; do for j in {100..200}; do wget -r -P home -A jpg "http://example.com/${i}/file_${j}.jpg"; done; done

Wget has loop
wget -nd -H -p -A file_{100..200}.jpg -e robots=off http://example.com/{65121..75121}/
If there are only file_{100..200}.jpg It's simpler
wget -nd -H -p -A jpg -e robots=off http://example.com/{65121..75121}/

Related

Create Directory, download file and execute command from list of URL

I am working on a Red Hat Linux server. My end goal is to run CRB-BLAST on multiple fasta files and have the results from those in separate directories.
My approach is to download the fasta files using wget then run the CRB-BLAST. I have multiple files and would like to be able to download them each to their own directory (the name perhaps should come from the URL list files), then run the CRB-BLAST.
Example URLs:
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_3370_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_CB_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_13_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_37_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_123_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_195_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_31_chr.v0.1.liftover.CDS.fasta.gz
Ideally, the file name determines the directory name, for example, TC_3370/.
I think there might be a solution with cat URL.txt | mkdir | cd | wget | crb-blast
Currently I just run the commands in line:
mkdir TC_3370
cd TC_3370/
wget url
http://assemblies/Genomes/final_assemblies/10x_meta_assemblies_v1.0/TC_3370_chr.v1.0.maker.CDS.fasta.gz
crb-blast -q TC_3370_chr.v1.0.maker.CDS.fasta.gz -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
Try this Shellcheck-clean program:
#! /bin/bash -p
while read -r url; do
file=${url##*/}
dir=${file%%_chr.*}
mkdir -v -- "$dir"
(
cd "./$dir" || exit 1
wget -- "$url"
crb-blast -q "$file" -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
)
done <URL.txt
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${url##*/} etc.
The subshell (( ... )) is used to ensure that the cd doesn't affect the main program.
Another implementation
#!/bin/sh
# Read lines as url as long as it can
while read -r url
do
# Get file name by stripping-out anything up to the last / from the url
file_name=${url##*/}
# Get the destination dir name by stripping anything from the first __chr
dest_dir=${file_name%%_chr*}
# Compose the wget output path
fasta_path="$dest_dir/$file_name"
if
# Successfully created the destination directory AND
mkdir -p -- "$dest_dir" &&
# Successfully downloaded the file
wget --output-file="$fasta_path" --quiet -- "$url"
then
# Process the fasta file into fna
fna_path="$dest_dir/TCV2_annot_cds.fna"
crb-blast -q "$fasta_path" -t "$fna_path" -e 1e-20 -h 4 -o rbbh_TC
else
# Cleanup remove destination directory if any of mkdir or wget failed
rm -fr -- "$dest_dir"
fi
# reading from the URL.txt file for the whole while loop
done < URL.txt
Download files from list is task for -i file option, if you have file named say urls.txt with one URL per line you might simply do
wget -i urls.txt
Note that this will put all files inside current working directory, so if you wish to have them in separate dirs, you would need to move them after wget finish.

how to recursively fetch some data with pattern using wget

I am trying to download some specific files from this website (http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/), they keep 10 days data. I want to download all the files starting with "ST4" from all the directories starting with "nam_pcpn_anal". I could download all the files staring with "ST4" from one folder like :
wget -r -nd -N --no-parent -nH --cut-dirs=100 -P ~/test/ -A ST4* 'http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/nam_pcpn_anal.20160625/'
but I do not know how to search ST4 recursively. I thought the following should work but nope!
wget -r -nd -N --no-parent -nH --cut-dirs=100 -P ~/test/ -A ST4* --accept nam_pcpn_anal*/ST4* 'http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/'
Any idea!
The wget manual shows:
-I list
--include-directories=list
Specify a comma-separated list of directories you wish to follow
when downloading. Elements of list may contain wildcards.
So, you could try:
wget -r -nd -N --no-parent -nH --cut-dirs=100 -P ~/test/ \
-A 'ST4*' -I '*/nam_pcpn_anal.*' \
'http://nomads.ncep.noaa.gov/pub/data/nccf/com/hourly/prod/'

Linux downloading a file appending to file

So I'm downloading a file and getting the data from the download, not actually storing the file itself.
Just want to see the speeds and log them.
wget http://www.google.com/download -a log.log -O /dev/null &
wget http://www.google.com/download -a log.log -O /dev/null &
wget http://www.google.com/download -a log.log -O /dev/null
I am trying to download simultaneously but the output is overlapping, how do I prevent this?
You could first write your output to different files and then merge them together.
This should work:
wget http://www.google.com/download -a log1.log &
wget http://www.google.com/download -a log2.log &
wget http://www.google.com/download -a log3.log ;
cat log2.log >> log1.log;
cat log3.log >> log1.log
Now you should have all your output in log1.log
if you want with append command is ok. if you dont want append just overwrite change -a command to -o
wget http://www.google.com/download -o log.log -O /dev/null

UNIX loop through list of url and save using wget

I am trying to download many files and can do so in a long manner using unix, but how can I do it using a loop function? I have many tables like CA30 and CA1-3 to download. Can I put the table names in a list list("CA30", "CA1-3") and have a loop go through the list?
#!/bin/bash
# get the CA30 files and put into folder for CA30
sudo wget -PO "https://www.bea.gov/regional/zip/CA30.zip"
sudo mkdir -p in/CA30
sudo unzip O/CA30.zip -d in/CA30
# get the CA30 files and put into folder for CA1-3
sudo wget -PO "https://www.bea.gov/regional/zip/CA1-3.zip"
sudo mkdir -p in/CA1-3
sudo unzip O/CA1-3.zip -d in/CA1-3
#!/bin/bash
for base in CA30 CA1-3; do
# get the $base files and put into folder for $base
wget -PO "https://www.bea.gov/regional/zip/${base}.zip"
mkdir -p in/${base}
unzip O/${base}.zip -d in/${base}
done
I have removed sudo - there's no reason to perform unprivileged operations with superuser privileges. If you can write into a particular folder, change the working directory.

wget --accept files containing pattern

I am trying to write a script that downloads all the files linked in a certain page. These files must contain specific strings and have certain extensions.
Let's say that I want to download all files that contain the string "1080" or "1080p" etc. and that have as extension ".mov" ".avi" ".wmv" etc. This was to show that both the strings and the extensions are multiple.
This is what I have done so far:
wget -Amov -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com
Any help is really apreciated.
Thank you.
You can add a pattern for the -A switch, like this:
wget -A "*1080*mov" -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com
This example will get all files with "1080", except gif & png files:
wget -A "*1080*" -R gif,png -r -np -nc -l1 --no-check-certificate -e robots=off http://www.example.com

Resources