Print range of rows if a condition is met in AWK - linux

What I am trying to do is to show 2 rows above and 2 rows below a line that meets a certain criteria without a pipe, using awk. For example, I am searching for the string 's62234' and when found, I want to print all the rows bounded in the blue rectangle as shown in the attached screenshot.
This is the file I am using (thefmifile.txt)
s62098:x:1271:504:Velizar Vrabchev,SI,3,1:/home/SI/s62098:/bin/bash
s62101:x:1272:504:Georgi Georgiev,SI,3,5:/home/SI/s62101:/bin/bash
s62108:x:1273:504:Sherif Kunch,SI,3,1:/home/SI/s62108:/bin/bash
s62111:x:1274:504:Yulian Bizeranov,SI,3,3:/home/SI/s62111:/bin/bash
s62121:x:1275:504:Daniel Dimitrov,SI,2,1:/home/SI/s62121:/bin/bash
s62133:x:1276:504:Ivaylo Ivanov,SI,2,2:/home/SI/s62133:/bin/bash
s62160:x:1277:504:Veniyana Tsolova,SI,2,3:/home/SI/s62160:/bin/bash
s62199:x:1278:504:Nikola Petrov,SI,2,5:/home/SI/s62199:/bin/bash
s62219:x:1279:504:Viliyan Ivanov,SI,2,6:/home/SI/s62219:/bin/bash
s62234:x:1280:504:Viktoriya Dobreva,SI,2,3:/home/SI/s62234:/bin/bash
s855264:x:1281:504:Toni Dupkarski,SI,4,2:/home/SI/s855264:/bin/bash
s81555:x:1282:503:Elena Georgieva,KN,2,0:/home/KN/s81555:/bin/bash
s81585:x:1283:503:Stela Marinova,KN,2,0:/home/KN/s81585:/bin/bash
s81441:x:1284:503:Vesela Plamenova Borislavova , KN, k2, g7:/home/KN/s81441:/bin/bash
s81644:x:1285:503:Viktor Rusev, KN, k2, g7:/home/KN/s81644:/bin/bash
s81628:x:1286:503:Iliyan Yordanov Yordanov, KN, k2, g6:/home/KN/s81628:/bin/bash
s81490:x:1287:503:Yana Spasova, KN, k2, g6:/home/KN/s81490:/bin/bash
What I have tried is using awk to find the row that meets the criteria and use NR to get the numbers of the other rows needed, but seems I am missing something.
Here is the command I used:
cat thefmifile.txt | awk -F ':' '$1==s62234 {for (x = NR -2; x <= NR + 2; x++){print}}'
Output is in the below screenshot.
And this is the desired output:
s62199:x:1278:504:Nikola Petrov,SI,2,5:/home/SI/s62199:/bin/bash
s62219:x:1279:504:Viliyan Ivanov,SI,2,6:/home/SI/s62219:/bin/bash
s62234:x:1280:504:Viktoriya Dobreva,SI,2,3:/home/SI/s62234:/bin/bash
s855264:x:1281:504:Toni Dupkarski,SI,4,2:/home/SI/s855264:/bin/bash
s81555:x:1282:503:Elena Georgieva,KN,2,0:/home/KN/s81555:/bin/bash
When it is {print x} it shows the numbers of the lines I need, but is there some way to access the lines of the file as elements in array and just to use this 'x' as an index (e.g. something like NR[x])?
Or Is there some other way to retrieve these rows?
Thank you!

$ awk -v n=2 -F':' '$1=="s62234"{for (i=0;i<n;i++) print buf[(NR+i)%n]; c=n+1} c&&c--; {buf[NR%n]=$0}' file
s62199:x:1278:504:Nikola Petrov,SI,2,5:/home/SI/s62199:/bin/bash
s62219:x:1279:504:Viliyan Ivanov,SI,2,6:/home/SI/s62219:/bin/bash
s62234:x:1280:504:Viktoriya Dobreva,SI,2,3:/home/SI/s62234:/bin/bash
s855264:x:1281:504:Toni Dupkarski,SI,4,2:/home/SI/s855264:/bin/bash
s81555:x:1282:503:Elena Georgieva,KN,2,0:/home/KN/s81555:/bin/bash
buf[] is just an array storing the n lines preceding the current line so those can be printed when your $1=="s62234" condition is met. c&&c--; represents a true condition which will cause awk to print (the default action) the current line plus n subsequent lines due to c=n+1 also being set when your condition is met - i.e. it'll print the current line and decrement c until c reaches zero.

Could you please try following, simple grep could handle this task.
grep -A2 -B2 '^s62234:' Input_file
Also more accurately you could try following to match exact string with grep:
grep -C2 '^s62234:' Input_file

That's easily doable with grep:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM
With awk, you could do something like this:
awk -F ':' '$1==s62234{print l2;print l1;a=3}a&&a-->0{print}{l2=l1;l1=$0}' thefmifile.txt
You can handle the number of before-lines dynamically by storing them in an array and using a loop.

Related

How to cut file into chuck

How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt
You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.
If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record
Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt

How to extract specific value using grep and awk?

I am facing a problem to extract a specific value in a .txt file using grep and awk.
I show below an excerpt from the .txt file:
"-
bravais-lattice index = 2
lattice parameter (alat) = 10.0000 a.u.
unit-cell volume = 250.0000 (a.u.)^3
number of atoms/cell = 2
number of atomic types = 1
number of electrons = 28.00
number of Kohn-Sham states= 18
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 300.0000 Ry
convergence threshold = 1.0E-09
mixing beta = 0.7000"
I also defined some variable: ELEMENT and lat.
I want to extract the "unit-cell volume" value which is equal to 250.00.
I tried the following to extract the value using grep and awk:
volume=`grep "unit-cell volume" ./latt.10/$ELEMENT.scf.latt_$lat.out | awk '{printf "%15.12f\n",$5}'`
However, when i run the bash file I always get 00.000000 as a result instead of the correct value of 250.00.
Can anyone help, please?
Thanks in advance.
awk '{printf "%15.12f\n",$5}'
You're asking awk to print out the fifth field of the line ($5).
unit-cell volume = 250.0000 (a.u.)^3
1 2 3 4 5
The fifth field is (a.u.)^3, which you are then asking awk to interpret as a number via the %f format code. It's not a number, though (or actually, doesn't start with a number), and when awk is asked to treat a non-numeric string as a number, it uses 0 instead. Thus it prints 0.
Solution: use $4 instead.
By the way, you can skip invoking grep by using awk itself to select the line, e.g.
awk /^ unit-cell/ {...}
The /^ unit-cell/ is a regular expression that matches "unit-cell" (with a leading space) at the beginning of the line. Adjust as necessary if you have other lines that start with unit-cell which you don't want to select.
You never need grep when you're using awk since awk can do anything useful that grep can do. It sounds like this is all you need:
$ awk -F'=' '/unit-cell volume/{printf "%.2f\n",$2}' file
250.00
The above works because when FS is = that means $2 is <spaces>250.000 (a.u.)^3 and when awk is asked to convert a string to a number it strips off leading spaces and anything after the numeric part so that leaves 250.000 to be converted to a number by %.2f.
In the script you posted $5 was failing because the 5th space-separated field in:
$1 $2 $3 $4 $5
<unit-cell> <volume> <=> <250.0000> <(a.u.)^3>
is (a.u.)^3 - you could have just added print $5 to see that.
Since you are processing key-value pairs where the key can have variable amount on space in it, you need to tune that field number ($4, $5 etc.) separately for each record you want to process unless you set the field separator (FS) appropriately to FS=" *= *". Then the key will always be in $1 and value in $2.
Then use split to split the value and unit parts from each other.
Also, you can loose that grep by defining in awk a pattern (or condition, /unit-cell volume/) for that printaction:
$ awk 'BEGIN{FS=" *= *"} /unit-cell volume/{split($2,a," +");print a[1]}' file
250.0000
Explained:
$ awk '
BEGIN { FS=" *= *" } # set appropriate field separator
/unit-cell volume/ { # pattern or condition
split($2,a," +") # split value part to value and possible unit parts
print a[1] # output value part
}' file

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:
foo0:bar0
foo1:bar1
foo2:bar2
first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:
sort -t':' -k1,1 filename.txt > output_sorted.txt
I have also been able to drop all duplicates with:
sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt
These operations take 4-8 min.
I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.
I think I can achieve this with awk with this code:
BEGIN { FS = ":" }
{
count[$1]++;
if (count[$1] == 1){
first[$1] = $0;
}
if (count[$1] == 2){
print first[$1];
}
if (count[$1] > 1){
print $0;
}
}
running it with:
awk -f awk.dups input_sorted.txt > output_dup.txt
Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:
uniq -D sorted_file.txt > output_dup.txt
The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware
i7-4700HQ
16GB ram
256GB ssd samsung 850 pro
Anything that can help is welcome,
Thanks.
SOLUTION FROM BELOW
Using:
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'
with the command time I get the following performance.
real 0m46.058s
user 0m40.352s
sys 0m2.984s
If your file is already sorted you don't need to store more than one line, try this
$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input
If you try this please post the timings...
I have changed the awk script slightly because I couldn't fully understand what was happening in the above awnser.
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c>=1{if(c==1){print p0;} print $0}' sorted.input > duplicate.entries
I have tested and this produces the same output as the above but might be easier to understand.
{if(p!=$1){p=$1; c=0; p0=$0} else c++}
If the first token in the line is not the same as the previous we will save the first token then set c to 0 and save the whole line into p0. If it is the same we increment c.
c>=1{if(c==1){print p0;} print $0}
In the case of the repeat, we check if its first repeat. If thats the case we print save line and current line, if not just print current line.

Merge values for same key

Is that possible to use awk to values of same key into one row?
For instance
a,100
b,200
a,131
a,102
b,203
b,301
Can I convert them to a file like this:
a,100,131,102
b,200,203,301
You can use awk like this:
awk -F, '{a[$1] = a[$1] FS $2} END{for (i in a) print i a[i]}' file
a,100,131,102
b,200,203,301
We use -F, to use comma as delimiter and use array a to keep aggregated value.
Reference: Effective AWK Programming
If Perl is an option,
perl -F, -lane '$a{$F[0]} = "$a{$F[0]},$F[1]"; END{for $k (sort keys %a){print "$k$a{$k}"}}' file
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with $F[0]
$F[0] is the first element in #F (the key)
$F[1] is the second element in #F (the value)
%a is a hash which stores a string containing all matches of each key
tl;dr
If you presort the input, it is possible to use sed to join the lines, e.g.:
sort foo | sed -nE ':a; $p; N; s/^([^,]+)([^\n]+)\n\1/\1\2/; ta; P; s/.+\n//; ba'
A bit more explanation
The above one-liner can be saved into a script file. See below for a commented version.
parse.sed
# A goto label
:a
# Always print when on the last line
$p
# Read one more line into pattern space and join the
# two lines if the key fields are identical
N
s/^([^,]+)([^\n]+)\n\1/\1\2/
# Jump to label 'a' and redo the above commands if the
# substitution command was successful
ta
# Assuming sorted input, we have now collected all the
# fields for this key, print it and move on to the next
# key
P
s/.+\n//
ba
The logic here is as follows:
Assume sorted input.
Look at two consecutive lines. If their key fields match, remove the key from the second line and append the value to the first line.
Repeat 2. until key matching fails.
Print the collected values and reset to collect values for the next key.
Run it like this:
sort foo | sed -nEf parse.sed
Output:
a,100,102,131
b,200,203,301
With datamash
$ datamash -st, -g1 collapse 2 <ip.txt
a,100,131,102
b,200,203,301
From manual:
-s, --sort
sort the input before grouping; this removes the need to manually pipe the input through 'sort'
-t, --field-separator=X
use X instead of TAB as field delimiter
-g, --group=X[,Y,Z]
group via fields X,[Y,Z]
collapse
comma-separated list of all input values

extracting first line from file command such that

I have a file with almost 5*(10^6) lines of integer numbers. So, my file is big enough.
The question is all about extract specific lines, filtering them by a condition.
For example, I'd like to:
Extract the N first lines without read entire file.
Extract the lines with the numbers less or equal X (or >=, <=, <, >)
Extract the lines with a condition related a number (math predicate)
Is there a cleaver way to perform these tasks? (using sed or awk or cat or head)
Thanks in advance.
To extract the first $NUMBER lines,
head -n $NUMBER filename
Assuming every line contains just a number (although it will also work if the first token is one), 2 can be solved like this:
awk '$1 >= 1234 && $1 < 5678' filename
And keeping in spirit with that, 3 is just the extension
awk 'condition' filename
It would have helped if you had specified what condition is supposed to be, though. This way, you'll have to read the awk documentation to find out how to code it. Again, the number will be represented by $1.
I don't think I can explain anything about the head call, it's really just what it says on the tin. As for the awk lines: awk, like sed, works linewise. awk fetches lines in a loop and applies your code to each line. This code takes the form
condition1 { action1 }
condition2 { action2 }
# and so forth
For every line awk fetches, the conditions are checked in the order they appear, and the associated action to each condition is performed if the condition is true. It would, for example, have been possible to extract the first $NUMBER lines of a file with awk like this:
awk -v number="$NUMBER" '1 { print } NR == number { exit }' filename
where 1 is synonymous with true (like in C) and NR is the line number. The -v command line option initializes the awk variable number to $NUMBER. If no action is specified, the default action is { print }, which prints the whole line. So
awk 'condition' filename
is shorthand for
awk 'condition { print }' filename
...which prints every line where the condition holds.

Resources