Integrating several shell scripts into one script - linux

I would like to integrate a few short scripts into one script where I can update an argument for the input file from the command line. I am going through 22 files and counting lines where $5!="1".
Here is a sample head of the input file:
Currently, I have the following 3 short scripts:
CHROM POS N_ALLELES N_CHR {FREQ}
2 45895 2 162 0.993827 0.00617284
2 45953 2 162 0.993827 0.00617284
2 264985 2 162 1 0
2 272051 2 162 0.944444 0.0555556
1) count lines (saved as wcYRI.sh): $5!="1"{sum++}END{print sum}
2) apply linecount (saved as check-annos.sh): awk -f wcYRI.sh ~/folder$1/file$1
3) apply linecount for 22 files, sum the output:
for i in {1..22};
do sh check-annos.sh $i; done
| awk '{sum+=$1}END{print sum}'
Its relatively simple, but sometimes script 1 gets a little longer for data files that look like this:
Chr Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene AAChange.refGene LJB2_SIFT LJB2_PolyPhen2_HDIV LJB2_PP2_HDIV_Pred LJB2_PolyPhen2_HVAR LJB2_PolyPhen2_HVAR_Pred LJB2_LRT LJB2_LRT_Pred LJB2_MutationTaster LJB2_MutationTaster_Pred LJB_MutationAssessor LJB_MutationAssessor_Pred LJB2_FATHMM LJB2_GERP++ LJB2_PhyloP LJB2_SiPhy
16 101593 101593 C T exonic POLR3K nonsynonymous SNV POLR3K:NM_016310:exon2:c.G164A:p.G55E 0.000000 0.997 D 0.913 D 0.000000 D 0.999989 D 2.205 medium 0.99 5.3 2.477000 17.524
...and I am using an awk file like this (performing an array match) as input -f to script 2 above:
NR==FNR{
arr[$1$2];next
}
$1$2 in arr && $0~/exonic/&&/nonsynonymous SNV/{nonsyn++};
$1$2 in arr && $0~/exonic/&&/synonymous SNV/ && $0!~/nonsynonymous/{syn++}
END{
print nonsyn,"nonsyn YRI","\t",syn,"YRI syn"
}
My goal is to integrate this process a bit more so I don't need to go into script 2 and change the ~/folder$1/file$1 each time-- I'd like to be able to use ~/folder$1/file$1 as an input at the command line. However when I try to use something like this in a for-loop at the command line, it doesn't accept $1 the way it does when $1 is built into a separate script being called by the for-do-done loop (as in script 3 --i.e. script 3 will take script 2, but I can't just enter the contents of script 2 explicitly into the for-loop as an argument(s)).
I am actually not so concerned about having a separate AWK file to handle the line parsing, the main thing annoying me is that I am modifying script 2 for each folder/file set, and I would like to be able to do this from the command line so that the script knows when I tell it ~/folder$1/file$1, to cycle through numbers 1-22 and I so can save one universal script for this process, since I have many folder/file combinations to look at.
Any advice is appreciated for shortening the pipeline in general, but specifically the command line argument problem is bugging me a lot!

If I understand the problem correctly, I see two ways to handle it. If the path format is consistent (i.e. the number always occurs twice, in the same positions), you could make the script accept the parts of the path as two different parameters. The script would look like this:
#!/bin/bash
folderPrefix="$1"
filePrefix="$2"
for num in {1..22}; do
awk -f wcYRI.sh "$folderPrefix$num/$filePrefix$num"
done |
awk '{sum+=$1}END{print sum}'
... and then you'd run it with ./scriptname ~/folder file. Alternately, if you need to be able to define the folder/file path format more flexibly, you could do something like this:
#!/bin/bash
for num in {1..22}; do
eval "awk -f wcYRI.sh $1"
done |
awk '{sum+=$1}END{print sum}'
... and then run it with ./scriptname '~/folder$num/file$num'. Note that the single-quotes are needed here so that the $var references don't get expanded until eval forces them to be.
BTW, the file wcYRI.sh is an awk script, not a shell script, so I'd recommend changing its file extension to prevent confusion. Actually, the preferred way to do this (for both shell and awk scripts) is to add a shebang line as the first line in the script (see my examples above; for an awk script it would be #!/usr/bin/awk -f), then make the script executable, and then run it with just ./scriptname and let the shebang take care of specifying the interpreter (sh, bash, awk -f, whatever).

Related

BASH - Is there a way to save output of a command to a file with the source formatting?

So to put it less confusing:
I run a command which prints some formatted values in the bash e.g.:
NodeID (lot of whitespace) Heap_size (again) Time
And when i try to save the output with Name:~$ script > file.txt, the output is:
ESC[93mnode_s1aESC[0m^MESC[25C1.0g
Expected output:
node_s1a 1.0g ...
node_s2aaaaa 2.0g ...
Is there a way to save raw output with the formatting into a text file ?
You can use the printf command which is like the printf function in C or Java.
printf "%-20s%s" Name:~$ script >> test.txt
I'm assuming Name:~$ and script are variables because I've never seen them before.

balancing the bash calculations

We have a tool for cutting adaptors https://github.com/vsbuffalo/scythe/blob/master/README.md and we wanted it to be used on all the files in the raw folder and make an output of each file separately as OUT+File Name.
Something is wrong with this script I wrote, because it doesn't take each file separately, and the whole thing doesn't work properly. It's gonna generateing empty file named OUT+files
Expected operation will looks:
take file1, use scythe on it, write output as OUTfile1
take file2 etc.
#!/bin/bash
FILES=/home/dave/raw/*
for f in $FILES
do
echo "Processing the $f file..."
/home/deve/scythe/scythe -a /home/dev/scythe/illumina_adapters.fa -o "OUT"+$f $f
done
Additionally, I noticed (testing for a single file) that the script uses only one core out of 130 available. Is there any way to improve it?
There is no string concatenation operator in shell. Use juxtaposition instead; it's "OUT$f", not "OUT"+$f.

Prevent script running with same arguments twice

We are looking into building a logcheck script that will tail a given log file and email when the given arguments are found. I am having trouble accurately determining if another version of this script is running with at least one of the same arguments against the same file. Script can take the following:
logcheck -i <filename(s)> <searchCriterion> <optionalEmailAddresses>
I have tried to use ps aux with a series of grep, sed, and cut, but it always ends up being more code than the script itself and seldom works very efficiently. Is there an efficient way to tell if another version of this script is running with the same filename and search criteria? A few examples of input:
EX1 .\logcheck -i file1,file2,file3 "foo string 0123" email#address.com
EX2 .\logcheck -s file1 Hello,World,Foo
EX3 .\logcheck -i file3 foo email#address1.com,email#address2.com
In this case 3 should not run because 1 is already running with parameters file3 and foo.
There are many solutions for your problem, I would recommend creating a lock file, with the following format:
arg1Ex1 PID#(Ex1)
arg2Ex1 PID#(Ex1)
arg3Ex1 PID#(Ex1)
arg4Ex1 PID#(Ex1)
arg1Ex2 PID#(Ex2)
arg2Ex2 PID#(Ex2)
arg3Ex2 PID#(Ex2)
arg4Ex2 PID#(Ex2)
when your script starts:
It will search in the file for all the arguments it has received (awk command or grep)
If one of the arguments is present in the list, fetch the process PID (awk 'print $2' for example) to check if it is still running (ps) (double check for concurrency and in case of process ended abnormally previously garbage might remain inside the file)
If the PID is still there, the script will not run
Else append the arguments to the lock file with the current process PID and run the script.
At the end, of the execution you remove the lines that contains the arguments that have been used by the script, or remove all lines with its PID.

Using AWK and setting results to bash variables/arrays?

I have a file that replicates the results from show processlist command from mySQL.
The file looks like this:
*************************** 1. row ***************************
Id: 1
User: system user
Host:
db: NULL
Command: Connect
Time: 1030455
State: Waiting for master to send event
Info: NULL
*************************** 2. row ***************************
Id: 2
User: system user
Host:
db: NULL
Command: Connect
Time: 1004
State: Has read all relay log; waiting for the slave
I/O thread to update it
Info: NULL
And it keeps going on for a few more times in the same structure.
I want to use AWK to only get these parameters: Time,ID,Command and State, and store every one of these parameters into a different variable or array so that I can later use / print them in my bash shell.
The problem is, I am pretty bad with AWK, I dont know how to both seperate the parameters I want from the file and also set them as a bash variable or array.
Many thanks in advance for the help!
EDIT: Here is my code so far
echo "Enter age"
read age
cat data | awk 'BEGIN{ RS="row"
FS="\n"
OFS="\n"}
{ print $2,$7}
' | awk 'BEGIN{ RS="Id"}
{if ($4 > $age){print $2}}'
The file 'data' contains blocks like I have pasted above. The code should, if the 'age' entered is smaller than the Time parameter in the data file (which is $4 in my awk code), return the ID parameter, but it returns nothing.
If I remove the if statement and print $4 instead of $2 this is my output
Enter age
1
1030455
1004
2144
2086
0
So I was thinking maybe that blank line is somehow messing up my AWK print? Is there a simple way to ignore that blank line while keeping my other data?
This is how you'd use awk to produce the values you want as a set of tab-separated fields on each line per "row" block from the input:
$ cat tst.awk
BEGIN {
RS="[*]+ [[:digit:]]+[]. row [*]+\n"
FS="\n"
OFS="\t"
}
NR>1 {
sub(/\n$/,"") # remove the trailing newline
gsub(/\n\s+/," ") # compress all multi-line fields into single lines
gsub(OFS," ") # ensure the only OFS in the output IS between fields
delete n2v
for (i=1; i<=NF; i++) {
name = gensub(/:.*/,"","",$i)
value = gensub(/^[^:]+:\s+/,"","",$i)
n2v[name] = value
}
if (n2v["Time"]+0 > age) { # force a numeric comparison
print n2v["Time"], n2v["Id"], n2v["Command"], n2v["State"]
}
}
$ awk -v age=2000 -f tst.awk file
1030455 1 Connect Waiting for master to send event
If the target age is already stored in a shell variable just init the awk variable from the shell variable of the same name:
$ age="2000"
$ awk -v age="$age" -f tst.awk file
The above uses GNU awk for multi-char RS (which you already had), gensub(), \s, and delete array.
When you say "and store every one of these parameters into a different variable or array" it could mean one of several things so I'll leave that part up to you but you might be looking for something like:
arr=( $(awk '...') )
or
awk '...' |
while IFS="\t" read -r Time Id Command State
do
<do something with those 4 vars>
done
but by far the most likely situation is that you don't want to use shell at all but instead just stay inside awk.
Remember - every time you write a loop in shell just to manipulate text you have the wrong approach. UNIX shell is an environment from which to call UNIX tools and the UNIX tool for general text manipulation is awk.
Until you edit your question to tell us more about your problem though, we can't guess what the right solution is from this point on.
At the first level you have your shell which you use to run any other child process. It's impossible to modify parents environment from within child process. When you run your bash script file (which has +x right) it's spawned as new process (child). It can set it's own environment but when it ends its live you'll get back to the original (parent).
You can set some variables on bash and export them to it's environment. It'll be inherited by it's children. However it can't be done in opposite direction (parent can't inherit from its child).
If you wish to execute some commands from the script file in the current bash's context you can source the script file. source ./your_script.sh or . ./your_script.sh will do that for you.
If you need to run awk to filter some data for you and keep results in the bash you can do:
awk ... | read foo
This works as read is shell buildin function rather than external process (check type read, help, help read, man bash to check it by yourself).
or:
foo=`awk ....`
There are many other constructions you can use. Whatever bash script you do please compare your code with bash pitfalls webpage.

"bad interpreter" error message when trying to run awk executable

I'm trying to make an awk file executable. I've written the script, and did chmod +x filename. Here is the code:
#!/bin/awk -v
'TOPNUM = $1
## pick1 - pick one random number out of y
## main routine
BEGIN {
## set seed
srand ()
## get a random number
select = 1 +int(rand() * TOPNUM)
# print pick
print select
}'
When I try and run the program and put in a variable for the TOPNUM:
pick1 50
I get the response:
-bash: /home/petersone/bin/pick1: /bin/awk: bad interpreter: No such file or directory
I'm sure that there's something simple that I'm messing up, but I simply cannot figure out what it is. How can I fix this?
From a command line, run this command:
which awk
This will print the path of AWK, which is likely /usr/bin/awk. Correct the first line and your script should work.
Also, your script shouldn't have the single-quote characters at the beginning and end. You can run AWK from the command line and pass in a script as a quoted string, or you can write a script in a file and use the #!/usr/bin/awk first line, with the commands just in the file.
Also, the first line of your script isn't going to work right. In AWK, setup code needs to be inside the BEGIN block, and $1 is a reference to the first word in the input line. You need to use ARGV[1] to refer to the first argument.
http://www.gnu.org/software/gawk/manual/html_node/ARGC-and-ARGV.html
As #TrueY pointed out, there should be a -f on the first line:
#!/usr/bin/awk -f
This is discussed here: Invoking a script, which has an awk shebang, with parameters (vars)
Working, tested version of the program:
#!/usr/bin/awk -f
## pick1 - pick one random number out of y
## main routine
BEGIN {
TOPNUM = ARGV[1]
## set seed
srand ()
## get a random number
select = 1 +int(rand() * TOPNUM)
# print pick
print select
}
Actually this form is more preferrable:
#! /bin/awk -E
Man told:
-E Similar to -f, however, this is option is the last one processed and should be used with #! scripts, particularly for CGI applications, to avoid passing in options or source code (!) on the command line from a URL. This option disables command-line variable assignments

Resources