I thought it would be easy to define a string such as "1 2 3" and use it within AWK (GAWK) to extract the required fields, how wrong I have been.
I have tried creating AWK arrays, BASH arrays, splitting, string substitution etc, but could not find any method to use the resulting 'chunks' (ie the column/field numbers) in a print statement.
I believe Akshay Hegde has provided an excellent solution with the get_cols function, here
but it was over 8 years ago, and I am really struggling to work out 'how it works', namely, what this is doing;
s = length(s) ? s OFS $(C[i]) : $(C[i])
I am unable to post a comment asking for clarification due to my lack of reputation (and it is an old post).
Is someone able to explain how the solution works?
NB I don't think I need the sub as I using the following to cleanup (replace all non-numeric characters with a comma, ie seperator, and sort numerically)
Columns=$(echo $Input_string | sed 's/[^0-9]\+/,/g') Columns=$(echo $Columns | xargs -n1 | sort -n | xargs)
(using this string, the awk would be Executed using awk -v cols=$Columns -f test.awk infile in the given solution)
Given the informative answer from #Ed Morton, with a nice worked example, I have attempted to remove the need for a function (and also an additional awk program file). The intention is to have this within a shell script, and I would rather it be self contained, but also, further investigation into 'how it works'.
Fields="1 2 3"
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print "s="s " arr1="Column[1]" arr2="Column[2]" arr3="Column[3]}'
The results have surprised me (taking note of my Comment to Ed)
s=1 2 3 arr1=1 arr2=2 arr3=3
The above clearly shows the split has worked into the array, but I thought s would include $ for each ternary operator concatenation, ie "$1 $2 $3"
Moreso, I was hoping to append the actual file to the above command, which I have found allows me to use echo $string | awk '{program}' file.name
NB it is a little insulting that my question has been marked as -1 indicating little research effort, as I have spent days trying to work this out.
Taking all the information above, I think s results in "1 2 3", but the print doesn't accept this in the same way as it does as it is called from a function, simply trying to 'print 1 2 3' in relation to the file, which seems to be how all my efforts have ended up.
This really confuses me, as Ed's 'diagonal' example works from command line, indicating that concept of 'print s' is absolutely fine when used with a file name input.
Can anyone suggest how this (example below) can work?
I don't know if using echo pipe and appending the file name is strictly allowed, but it appears to work (?!?!?!)
(failed result)
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print s}' myfile.txt
This appears to go through myfile.txt and output all lines containing many comma separated values, ie the whole file (I haven't included the values, just for illustration only)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
what this is doing; s = length(s) ? s OFS $(C[i]) : $(C[i])
You have encountered a ternary operator, it has following syntax
condition ? valueiftrue : valueiffalse
length function, when provided with single argument does return number of characters, in GNU AWK integer 0 is considered false, others integers are considered true, so in this case it is is not empty check. When s is not empty (it might be also not initalized yet, as GNU AWK will assume empty string in such case), it is concatenated with output field separator (OFS, default is space) and C[i]-th field value and assigned to variable s, when s is empty value of C[i]-th field value. Used multiple time this allows building of string of values sheared by OFS, consider following simple example, let say you want to get diagonal of 2D matrix, stored in file.txt with following content
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
then you might do
awk '{s = length(s) ? s OFS $(NR) : $(NR)}END{print s}' file.txt
which will get output
1 7 13 19 25
Explanation: NR is number row, so 1st row $(NR) is 1st field, for 2nd row it is 2nd field, for 3rd it is 3rd field and so on
(tested in GNU Awk 5.0.1)
I have the following simple script that tries to count
the tag encoded with "CB:Z" in SAM/BAM file:
samtools view -h small.bam | grep "CB:Z:" |
sed 's/.*CB:Z:\([ACGT]*\).*/\1/' |
sort |
uniq -c |
awk '{print $2 " " $1}'
Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.
This line sed 's/.*CB:Z:\([ACGT]*\).*/\1/' is very time consuming.
How can I speed it up?
The reason I used the Regex is that the "CB" tag column-wise position
is not fixed. Sometimes it's at column 20 and sometimes column 21.
Example BAM file can be found HERE.
Update
Speed comparison on complete 40 million lines file:
My initial code:
real 21m47.088s
user 26m51.148s
sys 1m27.912s
James Brown's with AWK:
real 1m28.898s
user 2m41.336s
sys 0m6.864s
James Brown's with MAWK:
real 1m10.642s
user 1m41.196s
sys 0m6.484s
Another awk, pretty much like #tripleee's, I'd assume:
$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) { # use match for the regex match
a[substr($0,RSTART+5,RLENGTH-5)]++ # len(CB:z:)==5, hence +-5
}
END {
for(i in a)
print i,a[i] # sample output,tweak it to your liking
}'
Sample output:
...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...
Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.
Notice 2: Quoting #Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.
With perl
perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'
CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z:. \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable
Sample time with small.bam input file. mawk is fastest for this input, but it might change for larger input file.
# script.awk is the one mentioned in James Brown's answer
# result here shown with GNU awk
$ time LC_ALL=C awk -f script.awk small.bam > f1
real 0m0.092s
# mawk is faster compared to GNU awk for this use case
$ time LC_ALL=C mawk -f script.awk small.bam > f2
real 0m0.054s
$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3
real 0m0.064s
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
Better to avoid parsing the output of samtools view in the first place. Here's one way to get what you need just using python and the pysam library:
import pysam
from collections import defaultdict
counts = defaultdict(int)
tag = 'CB'
with pysam.AlignmentFile('small.bam') as sam:
for aln in sam:
if aln.has_tag(tag):
counts[ aln.get_tag(tag) ] += 1
for k, v in counts.items():
print(k, v)
Following your original pipeline approach:
pcre2grep -o 'CB:Z:\K[^\t]*' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
In case you're interested in trying to speed up sed (although it's not likely to be the fastest):
sed 't a;s/CB:Z:/\n/;D;:a;s/\t/\n/;P;d' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
above syntax is compatible with GNU sed.
regrading the AWK based solutions, i've noticed few taking advantage of FS.
I'm not too familiar with BAM format. If CB only show up once per line, then
mawk/mawk2/gawk -b 'BEGIN { FS = "CB:Z:";
} $2 ~ /^[ACGT]/ { # if FS never matches, $2 would be beyond
# end of line, then this would just match
# against null string, & eval to false
seen[substr($2, 1, -1 + match($2, /[^ACGT]|$/))]++
} END { for (x in seen) { print seen[x] " " x } }'
If it shows up more than once, then change that to a loop of any field greater than 1. This version uses the laziest evaluation model possible to speed it up, then do all the uniq -c item.
While this is rather similar to the best answer above, by having FS pre-split the fields, it causes match() and substr() to do a lot less work. I'm simply matching 1 single char after the genetic sequence, and directly using its return, minus 1, as the substring length, and skipping RSTART or RLENGTH all together.
Regarding :
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
there's absolutely no need to have them physically output to disk and do a diff. Just simply have the output of each piped to a very high speed hashing algorithm that adds close to no time (when the output is gigantic enough you might end up saving time versus going to disk.
my personal favorite is xxhash in 128-bit mode, available via python pip. it's NOT a cryptographic hash, but it's much faster than even something like MD5. This method also allows for hassle-free compare since the benchmark timing of it will also perform the accuracy check.
I am facing a problem to extract a specific value in a .txt file using grep and awk.
I show below an excerpt from the .txt file:
"-
bravais-lattice index = 2
lattice parameter (alat) = 10.0000 a.u.
unit-cell volume = 250.0000 (a.u.)^3
number of atoms/cell = 2
number of atomic types = 1
number of electrons = 28.00
number of Kohn-Sham states= 18
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 300.0000 Ry
convergence threshold = 1.0E-09
mixing beta = 0.7000"
I also defined some variable: ELEMENT and lat.
I want to extract the "unit-cell volume" value which is equal to 250.00.
I tried the following to extract the value using grep and awk:
volume=`grep "unit-cell volume" ./latt.10/$ELEMENT.scf.latt_$lat.out | awk '{printf "%15.12f\n",$5}'`
However, when i run the bash file I always get 00.000000 as a result instead of the correct value of 250.00.
Can anyone help, please?
Thanks in advance.
awk '{printf "%15.12f\n",$5}'
You're asking awk to print out the fifth field of the line ($5).
unit-cell volume = 250.0000 (a.u.)^3
1 2 3 4 5
The fifth field is (a.u.)^3, which you are then asking awk to interpret as a number via the %f format code. It's not a number, though (or actually, doesn't start with a number), and when awk is asked to treat a non-numeric string as a number, it uses 0 instead. Thus it prints 0.
Solution: use $4 instead.
By the way, you can skip invoking grep by using awk itself to select the line, e.g.
awk /^ unit-cell/ {...}
The /^ unit-cell/ is a regular expression that matches "unit-cell" (with a leading space) at the beginning of the line. Adjust as necessary if you have other lines that start with unit-cell which you don't want to select.
You never need grep when you're using awk since awk can do anything useful that grep can do. It sounds like this is all you need:
$ awk -F'=' '/unit-cell volume/{printf "%.2f\n",$2}' file
250.00
The above works because when FS is = that means $2 is <spaces>250.000 (a.u.)^3 and when awk is asked to convert a string to a number it strips off leading spaces and anything after the numeric part so that leaves 250.000 to be converted to a number by %.2f.
In the script you posted $5 was failing because the 5th space-separated field in:
$1 $2 $3 $4 $5
<unit-cell> <volume> <=> <250.0000> <(a.u.)^3>
is (a.u.)^3 - you could have just added print $5 to see that.
Since you are processing key-value pairs where the key can have variable amount on space in it, you need to tune that field number ($4, $5 etc.) separately for each record you want to process unless you set the field separator (FS) appropriately to FS=" *= *". Then the key will always be in $1 and value in $2.
Then use split to split the value and unit parts from each other.
Also, you can loose that grep by defining in awk a pattern (or condition, /unit-cell volume/) for that printaction:
$ awk 'BEGIN{FS=" *= *"} /unit-cell volume/{split($2,a," +");print a[1]}' file
250.0000
Explained:
$ awk '
BEGIN { FS=" *= *" } # set appropriate field separator
/unit-cell volume/ { # pattern or condition
split($2,a," +") # split value part to value and possible unit parts
print a[1] # output value part
}' file