Create Directory, download file and execute command from list of URL - linux

I am working on a Red Hat Linux server. My end goal is to run CRB-BLAST on multiple fasta files and have the results from those in separate directories.
My approach is to download the fasta files using wget then run the CRB-BLAST. I have multiple files and would like to be able to download them each to their own directory (the name perhaps should come from the URL list files), then run the CRB-BLAST.
Example URLs:
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_3370_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_CB_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_13_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_37_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_123_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_195_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_31_chr.v0.1.liftover.CDS.fasta.gz
Ideally, the file name determines the directory name, for example, TC_3370/.
I think there might be a solution with cat URL.txt | mkdir | cd | wget | crb-blast
Currently I just run the commands in line:
mkdir TC_3370
cd TC_3370/
wget url
http://assemblies/Genomes/final_assemblies/10x_meta_assemblies_v1.0/TC_3370_chr.v1.0.maker.CDS.fasta.gz
crb-blast -q TC_3370_chr.v1.0.maker.CDS.fasta.gz -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC

Try this Shellcheck-clean program:
#! /bin/bash -p
while read -r url; do
file=${url##*/}
dir=${file%%_chr.*}
mkdir -v -- "$dir"
(
cd "./$dir" || exit 1
wget -- "$url"
crb-blast -q "$file" -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
)
done <URL.txt
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${url##*/} etc.
The subshell (( ... )) is used to ensure that the cd doesn't affect the main program.

Another implementation
#!/bin/sh
# Read lines as url as long as it can
while read -r url
do
# Get file name by stripping-out anything up to the last / from the url
file_name=${url##*/}
# Get the destination dir name by stripping anything from the first __chr
dest_dir=${file_name%%_chr*}
# Compose the wget output path
fasta_path="$dest_dir/$file_name"
if
# Successfully created the destination directory AND
mkdir -p -- "$dest_dir" &&
# Successfully downloaded the file
wget --output-file="$fasta_path" --quiet -- "$url"
then
# Process the fasta file into fna
fna_path="$dest_dir/TCV2_annot_cds.fna"
crb-blast -q "$fasta_path" -t "$fna_path" -e 1e-20 -h 4 -o rbbh_TC
else
# Cleanup remove destination directory if any of mkdir or wget failed
rm -fr -- "$dest_dir"
fi
# reading from the URL.txt file for the whole while loop
done < URL.txt

Download files from list is task for -i file option, if you have file named say urls.txt with one URL per line you might simply do
wget -i urls.txt
Note that this will put all files inside current working directory, so if you wish to have them in separate dirs, you would need to move them after wget finish.

Related

Executing same command for several files in same repository in linux

I'd like to execute the following command for several files in same repository in linux:
../../../../../openSMILE-2.1.0/SMILExtract -C ../../../../../openSMILE-2.1.0/config/IS13_ComParE.conf -I inputfilename.wav -D outputfilename.csv
there are several files (named 1.wav, 2.wav, 3.wav) in the directory and if I execute
../../../../../openSMILE-2.1.0/SMILExtract -C ../../../../../openSMILE-2.1.0/config/IS13_ComParE.conf -nologfile 1 -noconsoleoutput 1 -I 1.wav -D 1.csv
it outputs 1.csv.
How can I create 1.csv, 2.csv, 3.csv, .. by executing just one single command in linux? (or do I have to make .sh file?)
It's probably cleaner to put the following to a script, but you can type it directly into the bash command line as well:
#! /bin/bash
for file in *.wav ; do
prefix=${file%.wav} # Remove from the right.
../../../../../openSMILE-2.1.0/SMILExtract \
-C ../../../../../openSMILE-2.1.0/config/IS13_ComParE.conf \
-I "$file" -D "$prefix".csv
done

Extracting zip file and then cd into it with different filename

I am creating a bash script to extract a tar file and cd'ing into it and then it runs another script. So far this has been working pretty well with my code below, however, i ran into a case where if the extracted folder is different than the .tar file name then it would cause an issue. So my question is, how should I handle unique cases where the file name is different than then .tar filename.
e.g,) my_file.tar ---> after extraction ----> my_different_file_name
#!/bin/bash
fname=$1
echo the file you are about to extract is $fname
if [ -f $fname ]; then #if the file exists
tar -xvzf $fname #tar it
cd ${fname%.*} #the `%.*` will extract filename from filename.tgz and cd into it
echo ${fname%.*}
echo $PWD
loadIt #another script to load
fi
You could do a:
topDir=$(tar -xvzf $fname | sed "s|/.*$||" | uniq)
[ $(wc -w <<< $topDir) == 1 ] || exit 1
echo topDir=$topDir
Explanation: the first command untars vebosely (outputs all files it's untarring), and then gets all the leading directory names, and pipes them into uniq. (so basically it returns a list of all the top level directories in the tar file). The next line checks that there's exactly one entry in topDir, otherwise it exits.
At this point $topdir will be the directory you want to cd into.
Maybe you could do something like that:
cd $(tar -tf $fname | head -1)
If you don't mind moving the directory around after you extract it you can do something like this
# Create a temporary directory
$ tmpd=$(mktemp -d)
# Change to the temporary directory
$ pushd "$tmpd"
# Extract the tarball
$ tar -xf "$fname"
# Glob the directory name
$ d=(*)
# Error if we have more (or less) than one directory
$ [ "${#d}" = 0 ] || exit 1
# Explicitly use just the first directory (optional since `$d` does the same thing)
$ d=${d[0]}
# Move the extracted directory to the previous directory
$ mv "$d" "$OLDPWD"
# Change back to the starting directory
$ popd
# Remove the (now empty) temporary directory
$ rmdir "$tmpd"
# Change into the extracted directory
$ cd "$d"
# Run 'loadIt'
$ loadIt

Trying with piping commands into an if statement

I have a bash script that puts a bunch of commands to make a directory into a text file. Then it cats the file into sh to run the commands. What I am trying to do is only run the command if the directory doesn't already exist.
Here is what I have:
A text file with something like this:
mkdir /path/to/a/directory
mkdir /path/to/another/directory
mkdir /path/to/yet/another/directory
In my script I have a line like this
cat /path/to/my/file.txt | sh
But is there a way to do something like this?
cat /path/to/my/file.txt | if path already exists then go to the next, if not | sh
In other words I would like to skip the attempt to make the directory if the path already exists.
Update: The OP has since clarified that use of mkdir is just an example, and that he needs a generic mechanism to conditionally execute lines from a text file containing shell commands, based on whether the commands refers to an existing directory or not:
while read -r cmd dir; do [[ -d $dir ]] || eval "$cmd $path"; done < /path/to/my/file.txt
The while loop reads the text file containing the shell commands line by line.
read -r cmd dir parses each line into the first token - assumed to be the command (mkdir in the sample input) - and the rest, assumed to be the directory path.
[[ -d $dir ]] tests the existence of the directory path, and || only executes its RHS if the test fails, i.e., if the directory does not exist.
eval "$cmd $path" then executes the line; note that use of eval here is not any less secure than piping to sh - in both cases you must trust the strings representing the commands. (Using eval from the current Bash shell means that Bash will execute the command, not sh, but I'm assuming that's not a problem.)
Original answer, based on the assumption that mkdir is actually used:
The simplest approach in your case is to add the -p option to your mkdir calls, which will quietly ignore attempts to create a directory that already exists:
mkdir -p /path/to/a/directory
mkdir -p /path/to/another/directory
mkdir -p /path/to/yet/another/directory
To put it differently: mkdir -p ensures existence of the target dir., whether that dir. already exists or has to be created.
(mkdir -p can still fail, such as when the target path is a file rather than a dir., or if you have insufficient permissions to create the dir.)
You can then simply pass the file to sh (no need for cat and a pipe, which is less efficient):
sh /path/to/my/file.txt
In case you do not control creation of the input file, you can use sed to insert the -p option:
sed 's/^mkdir /&-p /' /path/to/my/file.txt | sh
I'm not clear if you want to check for the existence of files or directories.. but here's how to to it:
Run your command if the file exists:
[ -f /path/to/my/file.txt ] && cat /path/to/my/file.txt | sh
or to check for directories:
[ -d /path/to/my/directory ] && cat /path/to/my/file.txt | sh
Write your own mkdir function.
Assuming your file doesn't use mkdir -p anywhere this should work.
mkdir() {
for dir; do
[ -d "$dir" ] || mkdir "$dir"
done
}
export -f mkdir
sh < file

UNIX loop through list of url and save using wget

I am trying to download many files and can do so in a long manner using unix, but how can I do it using a loop function? I have many tables like CA30 and CA1-3 to download. Can I put the table names in a list list("CA30", "CA1-3") and have a loop go through the list?
#!/bin/bash
# get the CA30 files and put into folder for CA30
sudo wget -PO "https://www.bea.gov/regional/zip/CA30.zip"
sudo mkdir -p in/CA30
sudo unzip O/CA30.zip -d in/CA30
# get the CA30 files and put into folder for CA1-3
sudo wget -PO "https://www.bea.gov/regional/zip/CA1-3.zip"
sudo mkdir -p in/CA1-3
sudo unzip O/CA1-3.zip -d in/CA1-3
#!/bin/bash
for base in CA30 CA1-3; do
# get the $base files and put into folder for $base
wget -PO "https://www.bea.gov/regional/zip/${base}.zip"
mkdir -p in/${base}
unzip O/${base}.zip -d in/${base}
done
I have removed sudo - there's no reason to perform unprivileged operations with superuser privileges. If you can write into a particular folder, change the working directory.

Wget Output in Recursive mode

I am using wget -r to download 3 .zip files from a specified webpage. Here is what I have so far:
wget -r -nd -l1 -A.zip http://www.website.com/example
Right now, the zip files all begin with abc_*.zip where * seems to be a random. I want to have the first downloaded file to be called xyz_1.zip, the second to be xyz_2.zip, and the third to be xyz_3.zip.
Is this possible with wget?
Many thanks!
I don't think it's possible with wget alone. After downloading you could use some simple shell scripting to rename the files, like:
i=1; for f in abc_*.zip; do mv "$f" "xyz_$i.zip"; i=$(($i+1)); done
Try to get a listing first and then download each file separately.
let n=1
wget -nv -l1 -r --spider http://www.website.com/example 2>&1 | \
egrep -io 'http://.*\.zip'| \
while read url; do
wget -nd -nv -O $(echo $url|sed 's%^.*/\(.*\)_.*$%\1%')_$n.zip "$url"
let n++
done
I don't think there is a way you can do it within a single wget command.
wget does have a -O option which you can use to tell it which file to output to, but it won't work in your case because multiple files will get concatenated together.
You will have to write a script which renames the files from abc_*.zip to xyz_*.zip after wget has completed.
Alternatively, invoke wget for one zip file at a time and use the -O option.

Resources