Separating data into files based on random ID numbers using shell script - linux

I have a tab delimited file(1993_NYA.tab) with radiosonde data containing its' ID. I want to extract data of each ID into separate tab files. The file looks like this.
1993-01-01T10:45:03 083022143 250 78.93018 11.95426 960.0 -16.8 76 1.7 276
1993-01-01T10:45:16 083022143 300 78.93011 11.95529 953.7 -17.2 77 1.8 288
1993-01-01T10:45:30 083022143 350 78.93000 11.95638 947.3 -17.6 79 2.0 297
Here 083022143 is the ID but it changes randomly(not in ascending order). The code I tried is as follows.
ID=$(cat 1993_NYA.tab | cut -f 2 | sort | uniq)
for i in {$ID}
do
awk -F '\t' '$2 = "$i"' 1993_NYA.tab > 1993_$i.tab
done
This is not storing data of a particular ID into filename containing the same ID. Can anyone please help.

There are three small mistakes.
{$ID} should be just $ID.
'$2 = "$i"' has the assignment operator = rather than the comparison ==.
'$2 = "$i"' does not interpolate the value of i into the argument because of the single quotes; we can write e. g. "\$2 == $i" instead.
With the above changes, your code works. But awk has built-in magic which makes the task much easier. The single command
awk -F '\t' '{print >"1993_"$2".tab"}' 1993_NYA.tab
does what you want.

Related

AWK - string containing required fields

I thought it would be easy to define a string such as "1 2 3" and use it within AWK (GAWK) to extract the required fields, how wrong I have been.
I have tried creating AWK arrays, BASH arrays, splitting, string substitution etc, but could not find any method to use the resulting 'chunks' (ie the column/field numbers) in a print statement.
I believe Akshay Hegde has provided an excellent solution with the get_cols function, here
but it was over 8 years ago, and I am really struggling to work out 'how it works', namely, what this is doing;
s = length(s) ? s OFS $(C[i]) : $(C[i])
I am unable to post a comment asking for clarification due to my lack of reputation (and it is an old post).
Is someone able to explain how the solution works?
NB I don't think I need the sub as I using the following to cleanup (replace all non-numeric characters with a comma, ie seperator, and sort numerically)
Columns=$(echo $Input_string | sed 's/[^0-9]\+/,/g') Columns=$(echo $Columns | xargs -n1 | sort -n | xargs)
(using this string, the awk would be Executed using awk -v cols=$Columns -f test.awk infile in the given solution)
Given the informative answer from #Ed Morton, with a nice worked example, I have attempted to remove the need for a function (and also an additional awk program file). The intention is to have this within a shell script, and I would rather it be self contained, but also, further investigation into 'how it works'.
Fields="1 2 3"
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print "s="s " arr1="Column[1]" arr2="Column[2]" arr3="Column[3]}'
The results have surprised me (taking note of my Comment to Ed)
s=1 2 3 arr1=1 arr2=2 arr3=3
The above clearly shows the split has worked into the array, but I thought s would include $ for each ternary operator concatenation, ie "$1 $2 $3"
Moreso, I was hoping to append the actual file to the above command, which I have found allows me to use echo $string | awk '{program}' file.name
NB it is a little insulting that my question has been marked as -1 indicating little research effort, as I have spent days trying to work this out.
Taking all the information above, I think s results in "1 2 3", but the print doesn't accept this in the same way as it does as it is called from a function, simply trying to 'print 1 2 3' in relation to the file, which seems to be how all my efforts have ended up.
This really confuses me, as Ed's 'diagonal' example works from command line, indicating that concept of 'print s' is absolutely fine when used with a file name input.
Can anyone suggest how this (example below) can work?
I don't know if using echo pipe and appending the file name is strictly allowed, but it appears to work (?!?!?!)
(failed result)
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print s}' myfile.txt
This appears to go through myfile.txt and output all lines containing many comma separated values, ie the whole file (I haven't included the values, just for illustration only)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
what this is doing; s = length(s) ? s OFS $(C[i]) : $(C[i])
You have encountered a ternary operator, it has following syntax
condition ? valueiftrue : valueiffalse
length function, when provided with single argument does return number of characters, in GNU AWK integer 0 is considered false, others integers are considered true, so in this case it is is not empty check. When s is not empty (it might be also not initalized yet, as GNU AWK will assume empty string in such case), it is concatenated with output field separator (OFS, default is space) and C[i]-th field value and assigned to variable s, when s is empty value of C[i]-th field value. Used multiple time this allows building of string of values sheared by OFS, consider following simple example, let say you want to get diagonal of 2D matrix, stored in file.txt with following content
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
then you might do
awk '{s = length(s) ? s OFS $(NR) : $(NR)}END{print s}' file.txt
which will get output
1 7 13 19 25
Explanation: NR is number row, so 1st row $(NR) is 1st field, for 2nd row it is 2nd field, for 3rd it is 3rd field and so on
(tested in GNU Awk 5.0.1)

How to speed-up sed that uses Regex on very large single cell BAM file

I have the following simple script that tries to count
the tag encoded with "CB:Z" in SAM/BAM file:
samtools view -h small.bam | grep "CB:Z:" |
sed 's/.*CB:Z:\([ACGT]*\).*/\1/' |
sort |
uniq -c |
awk '{print $2 " " $1}'
Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.
This line sed 's/.*CB:Z:\([ACGT]*\).*/\1/' is very time consuming.
How can I speed it up?
The reason I used the Regex is that the "CB" tag column-wise position
is not fixed. Sometimes it's at column 20 and sometimes column 21.
Example BAM file can be found HERE.
Update
Speed comparison on complete 40 million lines file:
My initial code:
real 21m47.088s
user 26m51.148s
sys 1m27.912s
James Brown's with AWK:
real 1m28.898s
user 2m41.336s
sys 0m6.864s
James Brown's with MAWK:
real 1m10.642s
user 1m41.196s
sys 0m6.484s
Another awk, pretty much like #tripleee's, I'd assume:
$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) { # use match for the regex match
a[substr($0,RSTART+5,RLENGTH-5)]++ # len(CB:z:)==5, hence +-5
}
END {
for(i in a)
print i,a[i] # sample output,tweak it to your liking
}'
Sample output:
...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...
Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.
Notice 2: Quoting #Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.
With perl
perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'
CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z:. \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable
Sample time with small.bam input file. mawk is fastest for this input, but it might change for larger input file.
# script.awk is the one mentioned in James Brown's answer
# result here shown with GNU awk
$ time LC_ALL=C awk -f script.awk small.bam > f1
real 0m0.092s
# mawk is faster compared to GNU awk for this use case
$ time LC_ALL=C mawk -f script.awk small.bam > f2
real 0m0.054s
$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3
real 0m0.064s
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
Better to avoid parsing the output of samtools view in the first place. Here's one way to get what you need just using python and the pysam library:
import pysam
from collections import defaultdict
counts = defaultdict(int)
tag = 'CB'
with pysam.AlignmentFile('small.bam') as sam:
for aln in sam:
if aln.has_tag(tag):
counts[ aln.get_tag(tag) ] += 1
for k, v in counts.items():
print(k, v)
Following your original pipeline approach:
pcre2grep -o 'CB:Z:\K[^\t]*' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
In case you're interested in trying to speed up sed (although it's not likely to be the fastest):
sed 't a;s/CB:Z:/\n/;D;:a;s/\t/\n/;P;d' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
above syntax is compatible with GNU sed.
regrading the AWK based solutions, i've noticed few taking advantage of FS.
I'm not too familiar with BAM format. If CB only show up once per line, then
mawk/mawk2/gawk -b 'BEGIN { FS = "CB:Z:";
} $2 ~ /^[ACGT]/ { # if FS never matches, $2 would be beyond
# end of line, then this would just match
# against null string, & eval to false
seen[substr($2, 1, -1 + match($2, /[^ACGT]|$/))]++
} END { for (x in seen) { print seen[x] " " x } }'
If it shows up more than once, then change that to a loop of any field greater than 1. This version uses the laziest evaluation model possible to speed it up, then do all the uniq -c item.
While this is rather similar to the best answer above, by having FS pre-split the fields, it causes match() and substr() to do a lot less work. I'm simply matching 1 single char after the genetic sequence, and directly using its return, minus 1, as the substring length, and skipping RSTART or RLENGTH all together.
Regarding :
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
there's absolutely no need to have them physically output to disk and do a diff. Just simply have the output of each piped to a very high speed hashing algorithm that adds close to no time (when the output is gigantic enough you might end up saving time versus going to disk.
my personal favorite is xxhash in 128-bit mode, available via python pip. it's NOT a cryptographic hash, but it's much faster than even something like MD5. This method also allows for hassle-free compare since the benchmark timing of it will also perform the accuracy check.

How to extract two part-numerical values from a line in shell script

I have multiple text files in this format. I would like to extract lines matching this pattern "pass filters and QC".
File1:
Before main variant filters, 309 founders and 0 nonfounders present.
0 variants removed due to missing genotype data (--geno).
9302015 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
7758518 variants and 309 people pass filters and QC.
Calculating allele frequencies... done.
I was able to grep the line, but when I tried to assign to line variable it just doesn't work.
grep 'people pass filters and QC' File1
line="$(echo grep 'people pass filters and QC' File1)"
I am new to shell script and would appreciate if you could help me do this.
I want to create a tab separated file with just
"File1" "7758518 variants" "309 people"
GNU awk
gawk '
BEGIN { patt = "([[:digit:]]+ variants) .* ([[:digit:]]+ people) pass filters and QC" }
match($0, patt, m) {printf "\"%s\" \"%s\" \"%s\"\n", FILENAME, m[1], m[2]}
' File1
You are almost there, just remove double quotes and echo from your command:
line=$(grep 'people pass filters and QC' File1)
Now view the value stored in variable:
echo $line
And if your file structure is same, i.e., it will always be in this form: 7758518 variants and 309 people pass filters and QC, you can use awk to get selected columns from output. So complete command would be like below:
OIFS=$IFS;IFS=$'\n';for i in $line;do echo $i;echo '';done | awk -F "[: ]" '{print $1"\t"$2" "$3"\t"$5" "$6}';IFS=$OIFS
Explanation:
IFS means internal field separator, and we are setting it to newline character, because we need to use it in for loop.
But before that, we are taking it's backup in another variable OIFS, so we can restore it later.
We are using a for loop to iterate through all the matched strings, and using awk to select, 1st, 2nd, 3rd , 4th and 5th column as per your requirement.
But please note, if your file structure varies, we may need to use a different technique to extract "7758518 variants" and "309 people" part.

How to extract specific value using grep and awk?

I am facing a problem to extract a specific value in a .txt file using grep and awk.
I show below an excerpt from the .txt file:
"-
bravais-lattice index = 2
lattice parameter (alat) = 10.0000 a.u.
unit-cell volume = 250.0000 (a.u.)^3
number of atoms/cell = 2
number of atomic types = 1
number of electrons = 28.00
number of Kohn-Sham states= 18
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 300.0000 Ry
convergence threshold = 1.0E-09
mixing beta = 0.7000"
I also defined some variable: ELEMENT and lat.
I want to extract the "unit-cell volume" value which is equal to 250.00.
I tried the following to extract the value using grep and awk:
volume=`grep "unit-cell volume" ./latt.10/$ELEMENT.scf.latt_$lat.out | awk '{printf "%15.12f\n",$5}'`
However, when i run the bash file I always get 00.000000 as a result instead of the correct value of 250.00.
Can anyone help, please?
Thanks in advance.
awk '{printf "%15.12f\n",$5}'
You're asking awk to print out the fifth field of the line ($5).
unit-cell volume = 250.0000 (a.u.)^3
1 2 3 4 5
The fifth field is (a.u.)^3, which you are then asking awk to interpret as a number via the %f format code. It's not a number, though (or actually, doesn't start with a number), and when awk is asked to treat a non-numeric string as a number, it uses 0 instead. Thus it prints 0.
Solution: use $4 instead.
By the way, you can skip invoking grep by using awk itself to select the line, e.g.
awk /^ unit-cell/ {...}
The /^ unit-cell/ is a regular expression that matches "unit-cell" (with a leading space) at the beginning of the line. Adjust as necessary if you have other lines that start with unit-cell which you don't want to select.
You never need grep when you're using awk since awk can do anything useful that grep can do. It sounds like this is all you need:
$ awk -F'=' '/unit-cell volume/{printf "%.2f\n",$2}' file
250.00
The above works because when FS is = that means $2 is <spaces>250.000 (a.u.)^3 and when awk is asked to convert a string to a number it strips off leading spaces and anything after the numeric part so that leaves 250.000 to be converted to a number by %.2f.
In the script you posted $5 was failing because the 5th space-separated field in:
$1 $2 $3 $4 $5
<unit-cell> <volume> <=> <250.0000> <(a.u.)^3>
is (a.u.)^3 - you could have just added print $5 to see that.
Since you are processing key-value pairs where the key can have variable amount on space in it, you need to tune that field number ($4, $5 etc.) separately for each record you want to process unless you set the field separator (FS) appropriately to FS=" *= *". Then the key will always be in $1 and value in $2.
Then use split to split the value and unit parts from each other.
Also, you can loose that grep by defining in awk a pattern (or condition, /unit-cell volume/) for that printaction:
$ awk 'BEGIN{FS=" *= *"} /unit-cell volume/{split($2,a," +");print a[1]}' file
250.0000
Explained:
$ awk '
BEGIN { FS=" *= *" } # set appropriate field separator
/unit-cell volume/ { # pattern or condition
split($2,a," +") # split value part to value and possible unit parts
print a[1] # output value part
}' file

Finding duplicate entries across very large text files in bash

I am working with very large data files extracted from a database. There are duplicates across these files that I need to remove. If there are duplicates they will exist across files not within the same file. The files contain entries that look like the following:
File1
623898/bn-oopi-990iu/I Like Potato
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
File2
568798/jj-ytut-786hh/Hello Mike
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
So the Sesame Street line will have to be removed possibly even across 5 files but at least remain in one of them. From what I have been able to grab so far I can perform the following cat * | sort | uniq -cd to give me each duplicated line and the number of times they have been duplicated. But have no way of getting the file name. cat * | sort | uniq -cd | grep "" * doesn't work. Any ideas or approaches for a solution would be great.
Expanding your original idea:
sort * | uniq -cd | awk '{print $2}' | grep -Ff- *
i.e. form the output, print only the duplicate strings, then search all the files for them (list of things to search from taken form -, i.e. stdin), literally (-F).
Something along these lines might be useful:
awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...
twalberg's solution works perfectly but if your files are really large it could exhaust the available memory because it creates one entry in an associative array per encountered unique record. If it happens, you can try a similar approach where there is only one entry per duplicate record (I assume you have GNU awk and your files are named *.txt):
sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt
Note that if you have many duplicates it could also exhaust the available memory. You can solve this by splitting the dup file in smaller chunks and run the awk script on each chunk.

Resources