Creating a script from a complex bash command - linux

I need to run the following command in a folder containing a lot of gzipped files.
perl myscript.pl -d <path> -f <protocol> "<startdate time>"
"<enddate time>" -o "0:00 23:59" -v -g -b <outputfilename>
My problem is that the command does not take gzipped files as input. So, I would need to unzip all those gzipped files on the fly and run this command on those unzipped files. These gzipped files are in another folder where I am not allowed to unzip them. I want a shell script that will take the path of the remote gzipped files and store it under path (which is also going to be a argument to the script). Do the unzipping and then run the above command.
N.B: The "protocol", "startdate time", "enddate time", "outputfilename" don't have to be arguments for now I will just put them directly in the script so that it is less complex.

You can do:
for fname in path/to/*.gz; do gunzip -c "$fname" | perl myscript.pl ; done
Expanded:
for fname in path/to/*.gz; do
gunzip -c "$fname" | perl myscript.pl
done
And to make it accept filenames with spaces:
old_IFS=$IFS
IFS=$'\n'
for fname in path/to/*.gz; do
gunzip -c "$fname" | perl myscript.pl -f <protocol> "<startdate time>" \
"<enddate time>" -o "0:00 23:59" -v -g -b <outputfilename>
done
IFS=$old_IFS
This way, you make the script read standard input, which will contain the file content, without having to use temporary files.
EDIT: Here's a wrapper script that solves the problem like initially suggested in the question:
`myscriptwrapper`:
#!/bin/bash
gzip_path="$1"
temp_path="$2"
#loop thru files from gzip_pah\th
for fname in $gzip_path/*.gz; do
basename=`basename $fname`
#fill them in the target dir
gunzip "$fname" -c > "$temp_path/$basename"
done
#finally, call our script
perl myscript.pl -d "$temp_path" -f <protocol> "<startdate time>" "<enddate time>" -o "0:00 23:59" -v -g -b <outputfilename>
EDIT 2: Using tar.gz files:
`myscriptwrapper`:
#!/bin/bash
gzip_path="$1"
temp_path="$2"
cd "$temp_path"
#loop thru files from gzip_pah\th
for fname in $gzip_path/*.tar.gz; do
tar -xzf $fname
done
#finally, call our script
perl myscript.pl -d "$temp_path" -f <protocol> "<startdate time>" "<enddate time>" -o "0:00 23:59" -v -g -b <outputfilename>

Related

Create Directory, download file and execute command from list of URL

I am working on a Red Hat Linux server. My end goal is to run CRB-BLAST on multiple fasta files and have the results from those in separate directories.
My approach is to download the fasta files using wget then run the CRB-BLAST. I have multiple files and would like to be able to download them each to their own directory (the name perhaps should come from the URL list files), then run the CRB-BLAST.
Example URLs:
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_3370_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_CB_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_13_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_37_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_123_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_195_chr.v0.1.liftover.CDS.fasta.gz
http://assemblies/Genomes/final_assemblies/10x_assemblies_v0.1/TC_31_chr.v0.1.liftover.CDS.fasta.gz
Ideally, the file name determines the directory name, for example, TC_3370/.
I think there might be a solution with cat URL.txt | mkdir | cd | wget | crb-blast
Currently I just run the commands in line:
mkdir TC_3370
cd TC_3370/
wget url
http://assemblies/Genomes/final_assemblies/10x_meta_assemblies_v1.0/TC_3370_chr.v1.0.maker.CDS.fasta.gz
crb-blast -q TC_3370_chr.v1.0.maker.CDS.fasta.gz -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
Try this Shellcheck-clean program:
#! /bin/bash -p
while read -r url; do
file=${url##*/}
dir=${file%%_chr.*}
mkdir -v -- "$dir"
(
cd "./$dir" || exit 1
wget -- "$url"
crb-blast -q "$file" -t TCV2_annot_cds.fna -e 1e-20 -h 4 -o rbbh_TC
)
done <URL.txt
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${url##*/} etc.
The subshell (( ... )) is used to ensure that the cd doesn't affect the main program.
Another implementation
#!/bin/sh
# Read lines as url as long as it can
while read -r url
do
# Get file name by stripping-out anything up to the last / from the url
file_name=${url##*/}
# Get the destination dir name by stripping anything from the first __chr
dest_dir=${file_name%%_chr*}
# Compose the wget output path
fasta_path="$dest_dir/$file_name"
if
# Successfully created the destination directory AND
mkdir -p -- "$dest_dir" &&
# Successfully downloaded the file
wget --output-file="$fasta_path" --quiet -- "$url"
then
# Process the fasta file into fna
fna_path="$dest_dir/TCV2_annot_cds.fna"
crb-blast -q "$fasta_path" -t "$fna_path" -e 1e-20 -h 4 -o rbbh_TC
else
# Cleanup remove destination directory if any of mkdir or wget failed
rm -fr -- "$dest_dir"
fi
# reading from the URL.txt file for the whole while loop
done < URL.txt
Download files from list is task for -i file option, if you have file named say urls.txt with one URL per line you might simply do
wget -i urls.txt
Note that this will put all files inside current working directory, so if you wish to have them in separate dirs, you would need to move them after wget finish.

How to copy latest file from sftp to local directory using shell script?

I have multiple file in SFTP server from which I need to copy only latest file. I have written sample code but in that I am passing filename. What logic I need to add that it identify the latest file from sftp and copy it into my local?
In SFTP server -
my_data_20220428.csv
my_data_20220504.csv
my_data_20220501.csv
my_data_20220429.csv
The code which I am running-
datadir="/script/data"
cd ${datadir}
rm -f ${datadir}/my_data*.csv
rm -f ${logfile}
lftp<<END_SCRIPT
open sftp://${sftphost}
user ${sftpuser} ${sftppassword}
cd ${sftpfolder}
lcd $datadir
mget my_data_20220504.csv
bye
END_SCRIPT
what changes I need to do it automatically pick the latest file from server without hardcoding the filename?
You can try this script mainly copied from your sample, so it is expected that the variables have already been created.
#!/usr/bin/env bash
datadir="/script/data"
rm -f "$datadir"/my_data*.csv
rm -f "$logfile"
new=$(echo "ls -halt $sftpfolder" | lftp -u "${sftpuser}","${sftppassword}" sftp://"${sftphost}" | sed -n '/my_data/s/.* \(.*\)/\1/p' | head -1)
lftp -u "${sftpuser}","${sftppassword}" sftp://"${sftphost}" << --EOF--
cd "$sftpfolder"
lcd "$datadir"
get "$new"
bye
--EOF--
You could try:
latest=$(lftp "sftp://$sftpuser:$sftppassword#$sftphost" \
-e "cd $sftpfolder; glob rels -1t *.csv; bye" |
head -1)
lftp "sftp://$sftpuser:$sftppassword#myhost" \
-e "cd $sftpfolder; mget $latest; bye"

Commands work on terminal but not in shell script

The following commands work on my terminal but not in my shell script. I later found out that my terminal was /bin/tcsh. Can somebody tell me what changes I need to do for /bin/sh. Here are the commands I need to change:
cp source_dir/*/dir1/*.xml destination_dir/
Error in sh-> cp: cannot stat `source_dir/*/dir1/*.xml': No such file or directory
sed -i "s+${initial_name}+${final_name}+" $file_name
This one does not complain but does not work as well.
I am adding an example for testing. The code tends to rename the names of xml files and also the contents of xml files. For example-
The file name crr.ya.na.aa.xml should be changed to aa.xml
The same name inside crr.ya.na.aa.xml should also be changed from crr.ya.na.aa to aa
Here is the code:
#!/bin/sh
# Create dir structure for testing
rm -rf audience
mkdir audience
mkdir audience/dir1 audience/dir2 audience/dir3
mkdir audience/dir1/ipxact audience/dir2/ipxact audience/dir3/ipxact
touch audience/dir1/ipxact/crr.ya.na.aa.xml
echo "<spirit:name>crr.ya.na.aa</spirit:name>" > audience/dir1/ipxact/crr.ya.na.aa.xml
touch audience/dir2/ipxact/crr.ya.na.bb.xml
echo "<spirit:name>crr.ya.na.bb</spirit:name>" > audience/dir2/ipxact/crr.ya.na.bb.xml
touch audience/dir3/ipxact/crr.ya.na.cc.xml
echo "<spirit:name>crr.ya.na.cc</spirit:name>" > audience/dir3/ipxact/crr.ya.na.cc.xml
# Create a dir for ipxact_drop files if it does not exist
mkdir -p ipxact_drop
rm -rf ipxact_drop/*
cp audience/*/ipxact/*.xml ipxact_drop/
ls ipxact_drop/ > ipxact_drop_files.log
cat ipxact_drop_files.log | \
awk '{ split($0,a,"."); print a[length(a)-1] "." a[length(a)] }' ipxact_drop_files.log > file_names.log
cat ipxact_drop_files.log | \
awk '{ split($0,a,"."); print "mv ipxact_drop/" $0 " ipxact_drop/" a[length(a)-1] "." a[length(a)] }' ipxact_drop_files.log > command.log
chmod +x command.log
./command.log
while read line
do
echo ipxact_drop/$line
initial_name=`grep -m 1 crr ipxact_drop/$line | sed -e 's/<spirit:name>//' | sed -e 's/<\/spirit:name>//' `
final_name="${line%.*}"
echo $initial_name
echo $final_name
sed -i "s+${initial_name}+${final_name}+" ipxact_drop/$line
done < file_names.log
echo " ***** SCRIPT RUN FINISHED *****"
Only the sed command at the end is not working
I was reading some other posts and understood that xml files can have problems with scripts. Here is what that worked for me upto now.
To remove cp error: replace #!/bin/sh -f with #!/bin/sh
To remove sed error for the test input: replace sed -i ...... with sed -i.back ....

Executing same command for several files in same repository in linux

I'd like to execute the following command for several files in same repository in linux:
../../../../../openSMILE-2.1.0/SMILExtract -C ../../../../../openSMILE-2.1.0/config/IS13_ComParE.conf -I inputfilename.wav -D outputfilename.csv
there are several files (named 1.wav, 2.wav, 3.wav) in the directory and if I execute
../../../../../openSMILE-2.1.0/SMILExtract -C ../../../../../openSMILE-2.1.0/config/IS13_ComParE.conf -nologfile 1 -noconsoleoutput 1 -I 1.wav -D 1.csv
it outputs 1.csv.
How can I create 1.csv, 2.csv, 3.csv, .. by executing just one single command in linux? (or do I have to make .sh file?)
It's probably cleaner to put the following to a script, but you can type it directly into the bash command line as well:
#! /bin/bash
for file in *.wav ; do
prefix=${file%.wav} # Remove from the right.
../../../../../openSMILE-2.1.0/SMILExtract \
-C ../../../../../openSMILE-2.1.0/config/IS13_ComParE.conf \
-I "$file" -D "$prefix".csv
done

bash: process one file after the other

I wrote a script that should take the first tar-file, execute script.sh, then the second tar-file and so on.
This is how script.sh looks like:
tarball=(`ls -a | cut -d "." -f 1`)
mkdir ./$tarball
tar -zxvf $tarball.tar -C ./$tarball
I execute script.sh with the following command:
for tarball in ./*.tar; do bash script.sh; done
but the assignment of the variable tarball only takes the first file and processes it (after the code posted above there are some awk commands that write some output to a file).
How do I script that after the first tar-file is taken, the second is taken and so on?
You can just do this in your script or just put the while loop in your script and pipe whatever list you want into the script:
ls -1 | cut -d "." -f 1 |
while read tarball
do
mkdir "$tarball"
tar -zxvf "${tarball}.tar" -C ./"$tarball"
done

Resources