abyss-pe: variables to assemble multiple genomes with one command - linux

How do I rewrite the following to properly replace the variable with my genomeID? (I have it working with this method in Spades and Masurca assemblers, so it's something about Abyss that doesnt like this approach and I need a work-around)
I am trying to run abyss on a cluster server but am running into trouble with how abyss-pe is reading my variable input:
my submit file loads a script for each genome listed in a .txt file
my script writes in the genome name throughout the script
the abyss assembly fumbles the variable replacement
Input.sub:
queue genomeID from genomelisttest.txt
Input.sh:
#!/bin/bash
genomeID=$1
cp /mnt/gluster/harrow2/trim_output/${genomeID}_trim.tar.gz ./
tar -xzf ${genomeID}_trim.tar.gz
rm ${genomeID}_trim.tar.gz
for k in `seq 86 10 126`; do
mkdir k$k
abyss-pe -C k$k name=${genomeID} k=$k lib='pe1 pe2' pe1='../${genomeID}_trim/${genomeID}_L1_1.fq.gz ../${genomeID}_trim/${genomeID}_L1_2.fq.gz' pe2='../${genomeID}_trim/${genomeID}_L2_1.fq.gz ../${genomeID}_trim/${genomeID}_L2_2.fq.gz'
done
Error that I get:
`../enome_trim/enome_L1_1.fq.gz': No such file or directory
This is where "enome" is supposed to replace with a five digit genomeID, which happens properly in the earlier part of the script up to this point, where abyss comes in.

pe1='../'"$genomeID"'_trim/'"$genomeID"'_L1_1.fq.gz ...'
I added a single quote before and after the variable

Related

Shell script Fetching data from 5 different directories

I'm trying to run a shell script to get data from multiple directories.
My target (targetDir) has 5 directories. So the program, when executed, should search data from these 5 different directories, but when I execute it, it treats all the 5 folders same line. Any advice?
targetDir="snavis_bub snavis_bub2 snavis_bub3 snavis_hdw snavis_ldw"
datadir=/opt/pkg/home/tools/zform/marnel/$targetDir/of_inspect
Upon execute:
./orsInspect.sh: line 60:
cd: /opt/pkg/home/tools/zform/marnel/snavis_bub,snavis_bub2,snavis_bub3,snavis_hdw,snavis_ldw/oref_inspect: No such file or directory
Many things you can do. For example you can use arrays and for loops and perform a task each iteration of the loop:
#!/bin/bash
declare -a targetDirs=("snavis_bub" "snavis_bub2" "snavis_bub3" "snavis_hdw" "snavis_ldw")
for the_dir in "${targetDirs[#]}" ;do
datadir="/opt/pkg/home/tools/zform/marnel/${the_dir}/of_inspect"
echo "$datadir"
# ... do something for each datadir
done
example output (just echoing):
/opt/pkg/home/tools/zform/marnel/snavis_bub/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_bub2/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_bub3/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_hdw/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_ldw/of_inspect

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.
(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:
Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)
Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)
Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.
Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.
for sequence in ./sequences/{.,}*;
do
echo "$sequence";
blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
done
When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".
If I can provide any further information to clarify anything, please let me know.
If you're only interested in *.fa files I would limit your input to only those matching files like this:
for sequence in sequences/*.fa;
do
I can propose you the following improvements:
for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
# ${variable%something} is the part of $variable
# before the string "something"
# basename path/to/file is the name of the file
# without the full path
# $(some command) allows you to use the result of the command as a string
# Combining the above, we can form a string based on our fasta file
# This string can be useful to name stuff in a clean manner later
sequence_name=$(basename ${fasta_file%.fa})
echo ${sequence_name}
# Create a directory for the results for this sequence
# -p option avoids a failure in case the directory already exists
mkdir -p ${sequence_name}
# Define the name of the file for the results
# (including our previously created directory in its path)
blast_results=${sequence_name}/${sequence_name}_blast.txt
blastp -query ${fasta_file} -db database.faa \
-out ${blast_results} \
-evalue 1e-10 -outfmt 7
# Define a file name for the top hits
top_hits=${sequence_name}/${sequence_name}_top_hits.txt
# alternatively, using "%"
#top_hits=${blast_results%_blast.txt}_top_hits.txt
# No need to cat: awk can take a file as argument
awk '/hits found/{getline;print}' ${blast_results} \
| grep -v "#" > ${sequence_name}_top_hits.txt
done
I made more intermediate variables, with (hopefully) meaningful names.
I used \ to escape line ends and allow putting commands in several lines.
I hope this improves code readability.
I haven't tested. There may be typos.
You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using
mkdir 'folder_name'
then you need to redirect your -o outputs to those files, something like this
'command' -o /path/to/output/folder
To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.
One last thing, be careful with your use of colons, it should look something like this:
for filename in *.fa; do 'command'; done

Linux - Recursively list all the zip files and keep only latest modified 5 files and delete the remaining

In command line, How can we recursively find out all the zip files in a directory and its sub directories and keep only the latest modified 5 files and delete the remaining.
The files paths would be something like below:
basedirectory/2015/12/18/abc.zip
basedirectory/2015/12/18/def.zip
basedirectory/2015/12/18/ghi.zip
basedirectory/2015/12/18/jkl.zip
basedirectory/2015/12/08/mno.zip
basedirectory/2015/12/08/pqr.zip
basedirectory/2015/12/08/stu.zip
basedirectory/2015/12/07/stu.zip
I have a way, but it involves several (easy) steps. There are probably more elegant ways of doing this, but here is how I know how. They come from a couple sources, which I list at the end of my answer. You will use the already installed utilites cd, find, ls, rm and head. it will involve a creating and executing two bash scripts.
Open a terminal and change into your base directory with cd ~/basedirectory
This sets up the following commands. It is important that you stay in this directory for the rest of the commands.
Type findpwd-name *.zip > find_zip
This creates a list of all the zip files with the full path relative to the directory you changed in to. Instead of printing them to the screen, it writes them to a find_zip file in the directory you changed into.
type cp find_zip remove_old_zip
This creates a second, duplicate file that you will later use to delete the old files.
Open the find_zip file in your favorite text editor. If you're not used to using any, you can use gedit. If you don't have it, install it with sudo apt-get udpate && sudo apt-get install gedit
Do a search and replace as follows (in gedit): search for \n , and replace it with " \\n"
This places the list of folders within quotes. the first backslash places a "\" at the end of each line, which means continue reading the next line and execute all the code together. The \n preserves the line endings. The last " puts a quote at the beginning of each line. You need the quotes to escape special characters like ' and ( that may be in your file name.
Create 2 new lines at the top of the file and type:
!/bin/bash
ls -lt \
The first line turns your file into a bash script. The second line will list all the files you found with the find command and order them by date.
Create a new line at the bottom of your file and type: | head -5. Save and exit the file.
| is a "pipe" that will take the output of the ordered file list that ls creates and feed it into the head command. The head command will list just the 5 most recently modified files and display or print them on your screen.
As a result of steps 5-7, your file should go from looking like this:
basedirectory/2015/12/18/abc.zip
basedirectory/2015/12/18/def.zip
basedirectory/2015/12/18/ghi.zip
basedirectory/2015/12/18/jkl.zip
basedirectory/2015/12/08/mno.zip
basedirectory/2015/12/08/pqr.zip
basedirectory/2015/12/08/stu.zip
basedirectory/2015/12/07/stu.zip
to this:
#!/bin/bash
ls -lt \
basedirectory/2015/12/18/abc.zip \
basedirectory/2015/12/18/def.zip \
basedirectory/2015/12/18/ghi.zip \
basedirectory/2015/12/18/jkl.zip \
basedirectory/2015/12/08/mno.zip \
basedirectory/2015/12/08/pqr.zip \
basedirectory/2015/12/08/stu.zip \
basedirectory/2015/12/07/stu.zip \
| head -5
Type bash find_zip into in the terminal. With your newfound list of the 5 most recent files, open up the remove_old_zip file created in step 3.
You will also be turning this file into a bash script, but it will remove all but the five newest files.
Delete the lines in the remove_old_zip file containing the 5 files you want to keep.
Do a search and replace as follows (in gedit): search for \n , and replace it with " \\n"
This is the same as step 5.
Create 2 new lines at the top of the file and type:
!/bin/bash
rm \
This is similar to step 6 except that rm will delete the files still listed.
remove the final \ on the final line of the remove_old_zip file. Save and exit.
Type bash remove_old_zip.
Type rm find_zip remove_old_zip.
This remove the two scripts, which are now useless since the files have been deleted.
sources:
How can I list (ls) the 5 last modified files in a directory?
http://www.geekinterview.com/talk/758-how-to-continue-to-next-line.html
List files recursively in Linux CLI with path relative to the current directory

What does this bash script command mean (sed - e)?

I'm totally new to bash scripting but i want to solve this problem..
the command is:
objfil=`echo ${srcfil} | sed -e "s,c$,o,"`
the idea about the bash script program is to check for the source files, and check if there is an adjacent object file in the OBJ directory, if so, the rest of the program runs smoothly, if not, the iteration terminates and skips the current source file, and moves on to the next one.. it works with .c files but not on the headers, since the object filenames depend on .c files.. i want to write this command so it checks the object files not just the .c but the .h files too.. but without skipping them. i know i have to do something else too, but i need to understand what this line of command does exactly to move on. Thanks. (Sorry for my english)
UPDATE:
if test -r ${curOBJdir}/${objfil}
then
cp -v ${srcfil} ./SAVEDSRC/${srcfil}
fdone="NO"
linenums=ALL
else
fdone="YES"
err="${curOBJdir}/${objfil} is missing - ${srcfil} skipped)"
echo ${err}
echo ${err} >>${log}
fi
while test ${fdone} == "NO"
do
#rest of code ...
here is the rest of the program.. i tried to comment out the "test" part to ignore the comparison just because i only want my script to work on .h files, but without checking the e.g abc.h files has an abc.o file.. (the object file generation is needed because the end of the script there's a comparison between the hexdump of the original and modified object files). The whole script is for changing the basic types with typedefs like int to sint32_t for example.
This concrete command will substitute all c's right before line-end to o:
srcfill=abcd.c
objfil=`echo ${srcfil} | sed -e "s,c$,o,"`
echo $objfil
Output:
abcd.o
P.S. It uses a different match/replace separator: default is / but it uses ,.

How do you format output string in bash script for input by another script?

I need to unzip a bunch of student assignment (jar) files so that I can use a script to submit the contents to the Moss (Stanford) plagiarism detection server. I did the same thing in Java which was trivial but I'm trying to re-implement to as a bash script.
I am trying to do the following:
Get a list of student names (each student has a directory).
In each student directory, sub-directories exist numbered from 1 to the
latest submission. I need to get the directory with the highest
number.
Inside of each of those submission directories contains a
jar file that I need. I copy each jar into a temp directory with the
same name as the student and unzip it.
I need that temp directory listing formatted as a string in the form
/tempDir/studentName1/.languageExt /tempDir/studentName2/.languageExt
The student directory has the basic structure:
Student_Root_Directory:
Student1
Student2
Student1
Sub-Directories: 1 2 3 4 5
1: student1.jar
2: student1.jar
...
Student2
Sub-Directories: 1 2 3
1. student2.jar
...
To do the first 3 steps above I did:
#!/bin/bash
# Extract all jar files into a temp directory called /home/moss/tempJarFiles/studentName
# $1 is the command line argument that contains the path to the institution submission dir.
# $2 is the language extension: .c, .cpp, .java, .py
students=`ls $1`
student_dir=$1
languageExt=$2
mossDir="/home/moss"
tempDir="/home/moss/tempJarStorage"
for student in $students
do
latestSubmissionDir=`ls -t $student_dir/$student | head -1`
for jarDir in $latestSubmissionDir
do
mkdir $tempDir/$student
cp $student_dir/$student/$jarDir/*.jar $tempDir/$student
unzip -d $tempDir/$student/ -o -j $tempDir/$student/$student.jar *.$languageExt
rm $tempDir/$student/$student.jar
done
done
...which results in a number of student directories being created in a temp directory that contains only the unzipped contents for the student submissions.
I need the ls output of the new temp directories formatted as a string that contains:
/tempDir/studentName1/\*.languageExt /tempDir/studentName2/\*.languageExt
I have tried variations on
find "$tempDir" -iname "*.$languageExt" -printf "%p/*.$languageExt"
using iname and not - but I either have output that contains extra directory information such as $tempDir/*.languageExt (when I just need the subdirectories $tempDir/$studentName/*.languageExt) or I have output where the path for every source file is also listed such as:
$tempDir/$studentName/studentNameA.java
$tempDir/$studentName/studentNameB.java
when I only need
$tempDir/$studentName/*.java
I think this should be really easy and I'm just over thinking it. Any hints for improving the script also appreciated.
Here's a revised version of the script hat may work:
#/bin/bash
# Extract all jar files into a temp directory called /home/moss/tempJarFiles/studentName
# $1 is the command line argument that contains the path to the institution submission dir.
# $2 is the language extension: c, cpp, java, py
students_dir=$1
languageExt=$2
studentPathsT=( "$students_dir"/*/ )
mossDir='/home/moss'
tempDir='/home/moss/tempJarStorage'
for studentPathT in "${studentPathsT[#]}"; do
student=$(basename "$studentPathT")
mkdir "$tempDir/$student"
submissionDirsT=( "$studentPathT"*/ )
latestSubmissionDirT=${submissionDirsT[${#submissionDirsT[#]-1]}
cp "$latestSubmissionDirT"*.jar "$tempDir/$student/"
unzip -d "$tempDir/$student/" -o -j "$tempDir/$student/*.jar" "*.$languageExt"
rm "$tempDir/$student"/*.jar
done
# Note that at this point `"$tempDir"/*/*.$languageExt` would expand
# to all extracted submission files, across all students.
# Finally, output each student's extracted files as an unexpanded glob à la
# /{tempDir}/{studentName1}/*.{languageExt}
for pT in "$tempDir"/*/; do
echo "$pT*.$languageExt"
# Note: If there is a chance that your filenames contain
# embedded newlines (rare in practice) using `echo` won't work properly
# as #Charles Duffy points out.
# If that is a concern, use
# printf '%s\0' "$pT*.$languageExt"
# and process the output with a utility that can process NUL characters
# as separators, such as `xargs -0`.
done
It avoids using ls and only uses pathname expansion and array variables so as to properly deal with paths that contain embedded spaces and other shell metacharacters.
suffix ...T in variable names indicates that a particular path or array of paths is *T*erminated, i.e, that it ends in a /.
The assumption is that the numbered subdirectories do not go beyond 9, as the implicit lexical sorting of pathname expansion is relied upon; if the numbers go higher, explicit numerical sorting must be applied.
Note that the globs (pathname patterns) passed to unzip are intentionally double-quoted, as they should be interpreted by unzip, not the shell.
Note that, based on your original code, I've assumed that $languageExt does NOT start with . (e.g., cpp rather than .cpp), despite what your comment says.

Resources