How to compare 2 lists of ranges in bash? - linux

Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.
file1:
1 4
5 7
8 11
12 15
file2:
3 4
8 13
20 24
Desired output:
3 4
8 13
My best attempt has been:
awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next};
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then
{print $1, $2}}}}}' file1 file2 > output.txt
This returns an empty file.
I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.
Any help appreciated!

It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:
declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges
# read ranges in second file, store then in arrays min and max
while read a b; do
min+=( "$a" );
max+=( "$b" );
done < file2
# read ranges in first file
while read a b; do
# loop over indexes of min (and max) array
for i in "${!min[#]}"; do
if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
echo "${min[i]} ${max[i]}" # print range
unset min[i] max[i] # performance optimization
fi
done
done < file1
This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.
EDIT 1: improved the range overlap test.
EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.
EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:
awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
{ for (i in a)
if ($1 <= b[i] && a[i] <= $2) {
print a[i], b[i]; delete a[i]; delete b[i];
}
}' file2 file1

awk solution:
awk 'NR==FNR{ a[$1]=$2; next }
{ for(i in a)
if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) {
print i,a[i]; delete a[i];
}
}' file2 file1
The output:
3 4
8 13

awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2
Break down:
FNR == 1 && NR == 1 {
file=1
}
FNR == 1 && NR != 1 {
file=2
}
file == 1 {
for (q=1;q<=NF;q++) {
nums[$q]=$0
}
}
file == 2 {
for ( p=1;p<=NF;p++) {
for (i in nums) {
if (i == $p) {
print $0
}
}
}
}
Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.

For GNU awk as I'm controlling the for scanning order for optimizing time:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_desc"
}
NR==FNR { # hash file1 to a
if(($2 in a==0) || $1<a[$2]) # avoid collisions
a[$2]=$1
next
}
{
for(i in a) { # in desc order
# print "DEBUG: For:",$0 ":", a[i], i # remove # for debug
if(i+0>$1) { # next after
if($1<=i+0 && a[i]<=$2) {
print
next
}
}
else
next
}
}
Test data:
$ cat file1
0 3 # testing for completely overlapping ranges
1 4
5 7
8 11
12 15
$ cat file2
1 2 # testing for completely overlapping ranges
3 4
8 13
20 24
Output:
$ awk -f program.awk file1 file2
1 2
3 4
8 13
and
$ awk -f program.awk file2 file1
0 3
1 4
8 11
12 15

If Perl solution is preferred, then below one-liner would work
/tmp> cat marla1.txt
1 4
5 7
8 11
12 15
/tmp> cat marla2.txt
3 4
8 13
20 24
/tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
3 4
8 13
/tmp>

If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in file2, we take further and further ranges in file1 until we know whether these overlap with R. Once we know this, we switch to the next range in file2.
#!/bin/bash
exec 3< "$1" # file whose ranges are checked for overlap with those ...
exec 4< "$2" # ... from this file, and if so, are written to stdout
l4=-1 # lower bound of current range from file 2
u4=-1 # upper bound
# initialized with -1 so the first range is read on the first iteration
echo "Ranges in $1 that intersect any ranges in $2:"
while read l3 u3; do # read next range from file 1
if (( u4 >= l3 )); then
(( l4 <= u3 )) && echo "$l3 $u3"
else # the upper bound from file 2 is below the lower bound from file 1, so ...
while read l4 u4; do # ... we read further ranges from file 2 until ...
if (( u4 >= l3 )); then # ... their upper bound is high enough
(( l4 <= u3 )) && echo "$l3 $u3"
break
fi
done <&4
fi
done <&3
The script can be called with ./script.sh file2 file1

Related

Substitute values based on condition and compare values from multiple files with AWK

I've spent the day trying to figure this out but didn't succeed. I have two files like this:
File1:
chr id pos
14 ABC-00 123
13 AFC-00 345
5 AFG-99 988
File2:
index id chr
1 ABC-00 14
2 AFC-00 11
3 AFG-99 7
I wanna check if the value for chr in File 1 != from chr in File 2 for the same ID, if this is true I want to print some columns from both files to have an output like this one below.
Expected output file:
ID OLD_chr(File1) NEW_chr(File2)
AFC-00 13 11
AFG-99 5 7
.....
Total number of position changes: 2
I got one caveat though. In File 1 I have to substitute some values in the $1 column before comparing the files. Like this:
30 and 32 >> X
31 >> Y
33 >> MT
Because in File 2 that's how those values are coded. And then compare the two files. How in the hell can I achieve this?
I've tried to recode File 1:
awk '{
if($1=30 || $1=32) gsub(/30|32/,"X",$1);
if($1=31) gsub(/31/,"Y",$1);
if($1=33) gsub(/33/,"MT",$1);
print $0
}' File 1 > File 1 Recoded
And I was trying to match the columns and print the output with:
awk 'NR==FNR{a[$1]=$1;next} (a[$1] !=$3){print $2, a[$1], $3 }' File 1 File 2 > output file
$ cat tst.awk
BEGIN {
map[30] = map[32] = "X"
map[31] = "Y"
map[33] = "MT"
print "ID", "Old_chr("ARGV[1]")", "NEW_chr("ARGV[2]")"
}
NR==FNR {
a[$2] = ($1 in map ? map[$1] : $1)
next
}
a[$2] != $3 {
print $2, a[$2], $3
cnt++
}
END {
print "Total number of position changes: " cnt+0
}
.
$ awk -f tst.awk file1 file2
ID Old_chr(file1) NEW_chr(file2)
AFC-00 13 11
AFG-99 5 7
Total number of position changes: 2
Like this:
awk '
BEGIN{ # executed at the BEGINning
print "ID OLD_chr("ARGV[1]") NEW_chr("ARGV[2]")"
}
FNR==NR{ # this code block for File1
if ($1 == 30 || $1 == 32) $1 = "X"
if ($1 == 31) $1 = "Y"
if ($1 == 33) $1 = "MT"
a[$2]=$1
next
}
{ # this for File2
if (a[$2] != $3) {
print $2, a[$2], $3
count++
}
}
END{ # executed at the END
print "Total number of position changes: " count+0
}
' File1 File2
ID OLD_chr(File1) NEW_chr(File2)
AFC-00 13 11
AFG-99 5 7
Total number of position changes: 2

AWK print every other column, starting from the last column (and next to last column) for N interations (print from right to left)

Hopefully someone out there in the world can help me, and anyone else with a similar problem, find a simple solution to capturing data. I have spent hours trying a one liner to solve something I thought was a simple problem involving awk, a csv file, and saving the output as a bash variable. In short here's the nut...
The Missions:
1) To output every other column, starting from the LAST COLUMN, with a specific iteration count.
2) To output every other column, starting from NEXT TO LAST COLUMN, with a specific iteration count.
The Data (file.csv):
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
Desired results for Mission 1:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Desired results for Mission 2:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
My Attempts:
The closes I have come to solving any of the above problems, is an ugly pipe (which is OK for skinning a cat) for Mission 1. However, it doesn't use any declared iterations (which should be 5). Also, I'm completely lost on solving Mission 2.
Any help to simplify the below and solving Mission 2 will be HELLA appreciated!
outcome=$( awk 'BEGIN {FS = "#"} {for (i = 0; i <= NF; i += 2) printf ("%s%c", $(NF-i), i + 2 <= NF ? "#" : "\n");}' file.csv | sed 's/##.*//g' | awk -F# '{for (i=NF;i>0;i--){printf $i"#"};printf "\n"}' | sed 's/#$//g' | awk -F# '{$1="";print $0}' OFS=# | sed 's/^#//g' );
Also, if doing a loop for a specific number of iterations is helpful in solving this problem, then magic number is 5. Maybe a solution could be a for-loop that is counting from right to left and skipping every other column as 1 iteration, with the starting column declared as an awk variable (Just a thought I have no way of knowing how to do)
Thank you for looking over this problem.
There are certainly more elegant ways to do this, but I am not really an awk person:
Part 1:
awk -F# '{ x = ""; for (f = NF; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Part 2:
awk -F# '{ x = ""; for (f = NF - 1; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
The literal 5 in each of those is your "number of iterations."
Sample data:
$ cat mission.dat
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
One awk solution:
NOTE: OP can add logic to validate the input parameters.
$ cat mission
#!/bin/bash
# format: mission { 1 | 2 } { number_of_fields_to_display }
mission=${1} # assumes user inputs "1" or "2"
offset=$(( mission - 1 )) # subtract one to determine awk/NF offset
iteration_count=${2} # assume for now this is a positive integer
awk -F"#" -v offset=${offset} -v itcnt=${iteration_count} 'BEGIN { OFS=FS }
{ # we will start by counting fields backwards until we run out of fields
# or we hit "itcnt==iteration_count" fields
loopcnt=0
for (i=NF-offset ; i>=0; i-=2) # offset=0 for mission=1; offset=1 for mission=2
{ loopcnt++
if (loopcnt > itcnt)
break
fstart=i # keep track of the field we want to start with
}
# now printing our fields starting with field # "fstart";
# prefix the first printf with a empty string, then each successive
# field is prefixed with OFS=#
pfx = ""
for (i=fstart; i<= NF-offset; i+=2)
{ printf "%s%s",pfx,$i
pfx=OFS
}
# terminate a line of output with a linefeed
printf "\n"
}
' mission.dat
Some test runs:
###### mission #1
# with offset/iteration = 4
$ mission 1 4
2.25#1.5#1#3.25
5.25#4#3#3.25
.2#.5#1#3.75
13.75#13#8.5#6
.2#.5#3#6.5
10.25#10.5#11#12.75
#with offset/iteration = 5
$ mission 1 5
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
# with offset/iteration = 6
$ mission 1 6
12#2#2.25#1.5#1#3.25
7#9#5.25#4#3#3.25
4#4#.2#.5#1#3.75
3#8#13.75#13#8.5#6
10#1#.2#.5#3#6.5
8#7#10.25#10.5#11#12.75
###### mission #2
# with offset/iteration = 4
$ mission 2 4
4#3#1#1
6#5#4#2
1#1#2#3
8#8#6#4
3#2#3#5
7#7#8#6
# with offset/iteration = 5
$ mission 2 5
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
# with offset/iteration = 6;
# notice we pick up field #1 = empty string so output starts with a '#'
$ mission 2 6
#SayWhat#4#3#1#1
#Smarty#6#5#4#2
#IfYouLike#1#1#2#3
#LaughingHard#8#8#6#4
#AtFunny#3#2#3#5
#PunchLines#7#7#8#6
this is probably not what you're asking but perhaps will give you an idea.
$ awk -F_ -v skip=4 -v endoff=0 '
BEGIN {OFS=FS}
{offset=(NF-endoff)%skip;
for(i=offset;i<=NF-endoff;i+=skip) printf "%s",$i (i>=(NF-endoff)?ORS:OFS)}' file
112_116_120
122_126_130
132_136_140
142_146_150
you specify the number of skips between columns and the end offset as input variables. Here, for last column end offset is set to zero and skip column is 4.
For clarity I used the input file
$ cat file
_111_112_113_114_115_116_117_118_119_120
_121_122_123_124_125_126_127_128_129_130
_131_132_133_134_135_136_137_138_139_140
_141_142_143_144_145_146_147_148_149_150
changing FS for your format should work.

Accumulating different lines into different files in awk

I have a huge .txt file (15 GB) and having almost 30 million lines.
I want to put its lines to different files based on the 4th column. And the unique number of 4th column is around 2 million.
file1.txt
1 10 ABC KK-LK
1 33 23 KK-LK
2 34 32 CK-LK,LK
11 332 2 JK#
11 23 2 JK2
Right now, I can separate these lines to different files in the same folder as following:
awk '{ print $0 >> $4"_sep.txt" }' file1.txt
And it results in 4 different files as:
KK-LK_sep.txt
1 10 ABC KK-LK
1 33 23 KK-LK
and
CK-LK,LK_sep.txt
2 34 32 CK-LK,LK
and
JK#_sep.txt
11 332 2 JK#
and finally,
JK2_sep.txt
11 23 2 JK2
What I want is, to not to put 2 million files in one folder, to separate them into 20 different folders. I can make folders as folder1,2,3....:
mkdir folder{1..20}
With the answers below, I suppose something like broken below code would work:
#!/bin/env bash
shopt -s nullglob
numfiles=(*)
numfiles=${#numfiles[#]}
numdirs=(*/)
numdirs=${#numdirs[#]}
(( numfiles -= numdirs ))
echo $numfiles
var1=$numfiles
awk -v V1=var1 '{
if(V1 <= 100000)
{
awk '{ print $0 >> $4"_sep.txt" }' file1.txt
}
else if(V1 => 100000)
{
cd ../folder(cnt+1)
awk '{ print $0 >> $4"_sep.txt" }' file1.txt
}
}'
But then, how can I make this a loop and stop adding up to the folder1 once it has 100.000 files in it, and start adding files to folder2 and so on?
Maybe this is what you want (untested since your question doesn't include an example we can test against):
awk '
!($4 in key2out) {
if ( (++numKeys % 100000) == 1 ) {
dir = "dir" ++numDirs
system("mkdir -p " dir)
}
key2out[$4] = dir "/" $4 "_sep.txt"
}
{ print > key2out[$4] }
' file1.txt
That relies on GNU awk for managing the number of open files internally. With other awks you'd need to change that last line to { print >> key2out[$4]; close(key2out[$4]) } or otherwise handle how many concurrently open files you have to avoid getting a "too many open files" error, e.g. if your $4 values are usually grouped together then more efficiently than opening and closing the output file on every single write, you could just do it when the $4 value changes:
awk '
$4 != prevKey { close(key2out[prevKey]) }
!($4 in key2out) {
if ( (++numKeys % 100000) == 1 ) {
dir = "dir" ++numDirs
system("mkdir -p " dir)
}
key2out[$4] = dir "/" $4 "_sep.txt"
}
{ print >> key2out[$4]; prevKey=$4 }
' file1.txt
something like this?
count the unique keys and increment bucket after threshold.
count += !keys[$4]++;
bucket=count/100000;
ibucket=int(bucket);
ibucket=ibucket==bucket?ibucket:ibucket+1;
folder="folder"ibucket

How to calculate the percent in linux

Sample input data:
Col1, Col2
120000,1261
120000,119879
120000,117737
120000,14051
200000,58411
200000,115292
300000,279892
120000,98572
250000,249598
120000,14051
......
I used Excel with follow steps:
Col3=Col2/Col1.
Format Col3 with percentage
Use countif to group by Col3
How to do this task with awk or other way in linux command line ?
Expected result:
percent|count
0-20% | 10
21-50% | 5
51-100%| 10
I calculated the percent but i'm still finding the way to group by Col3
cat input.txt |awk -F"," '$3=100*$2/$1'
awk approach:
awk 'BEGIN {
FS=",";
OFS="|";
}
(NR > 1){
percent = 100 * $2 / $1;
if (percent <= 20) {
a["0-20%"] += 1;
} else if (percent <= 50) {
a2 += 1;
a["21-50%"] += 1;
} else {
a["51-100%"] += 1;
}
}
END {
print "percent", "count"
for (i in a) {
print i, a[i];
}
}' data
Sample output:
percent|count
0-20%|3
21-50%|1
51-100%|6
A generic self documented. Need some fine tuning depending on group name in result (due to +1% or not but not the real purpose)
awk -F ',' -v Step='0|20|50|100' '
BEGIN {
# define group
Gn = split( Step, aEdge, "|")
}
NR>1{
# Define wich percent
L = $2 * 100 / ($1>0 ? $1 : 1)
# in which group
for( j=1; ( L < aEdge[j] || L >= aEdge[j+1] ) && j < Gn;) j++
# add to group
G[j]++
}
# print result ordered
END {
print "percent|count"
for( i=1;i<Gn;i++) printf( "%d-%d%%|%d\n", aEdge[i], aEdge[i+1], G[i])
}
' data
another awk with parametric bins and formatted output.
$ awk -F, -v OFS=\| -v bins='20,50,100' '
BEGIN {n=split(bins,b)}
NR>1 {for(i=1;i<=n;i++)
if($2/$1 <= b[i]/100)
{a[b[i]]++; next}}
END {print "percent","count";
b[0]=-1;
for(i=1;i<=n;i++)
printf "%-7s|%3s\n", b[i-1]+1"-"b[i]"%",a[b[i]]}' file
percent|count
0-20% | 3
21-50% | 1
51-100%| 6
Pure bash:
# arguments are histogram boundaries *in ascending order*
hist () {
local lower=0$(printf '+(val*100>sum*%d)' "$#") val sum count n;
set -- 0 "$#" 100;
read -r
printf '%7s|%5s\n' percent count;
while IFS=, read -r sum val; do echo $((lower)); done |
sort -n | uniq -c |
while read count n; do
printf '%2d-%3d%%|%5d\n' "${#:n+1:2}" $count;
done
}
Example:
$ hist 20 50 < csv.dat
percent|count
0- 20%| 3
20- 50%| 1
50-100%| 6
Potential Issue: Does not print intervals with no values:
$ hist 20 25 45 50 < csv.dat
percent|count
0- 20%| 3
25- 45%| 1
50-100%| 6
Explanation:
lower is set to an expression which will count the number of percentages less than 100*val/num
The list of intervals is augmented with 0 and 100 so that the limits print correctly
The header line is ignored
The output header is printed
For each csv row, read the variables $num and $val and send the numeric evaluation of $lower (which uses those variables) to...
count the number of instances of each interval count...
and print the interval and count
Another, in GNU awk, using switch and regex to identify the values (since parsing was tagged in OP):
NR>1{
switch(p=$2/$1){
case /0\.[01][0-9]|\.20/:
a["0-20%"]++;
break;
case /\.[2-4][0-9]|\.50/:
a["21-50%"]++;
break;
default:
a["51-100%"]++
}
}
END{ for(i in a)print i, a[i] }
Run it:
$ awk -F, -f program.awk file
21-50% 1
0-20% 3
51-100% 6

in-file selection of a number and only keeping those lines starting with that number in linux

I have files with the format given below. Please note that the entries are space seperated.
16402 8 3858 3877 3098 3099
3858 -9.0743538e+01 1.5161710e+02 -5.4964638e+00
3244 -9.7903877e+01 1.8551400e-13 1.0194137e+01
3877 -9.2467590e+01 1.5160857e+02 -5.4969416e+00
4330 -9.3877419e+01 8.8259323e+01 -5.4966841e+00
3098 -9.2476135e+01 1.5336685e+02 -5.4963140e+00
5431 -6.1601208e+01 3.3540974e+01 1.0309820e+01
3099 -9.0752136e+01 1.5337535e+02 -5.4963264e+00
3600 -6.3099121e+01 1.3944173e+02 -5.4964156e+00
5418 -6.6785469e+01 2.9993099e+01 1.0291004e+01
There are lines with 6 entries and there are files with 4 entries. The lines with total of 6 entries have last 4 entries as the node number and the lines with 4 entries are those node numbers with there spatial coordinates. I want to keep only those nodes in the 4 entry lines which are listed in the 6 digit lines and delete all the others so that my files would look like
16402 8 3858 3877 3098 3099
3858 -9.0743538e+01 1.5161710e+02 -5.4964638e+00
3877 -9.2467590e+01 1.5160857e+02 -5.4969416e+00
3098 -9.2476135e+01 1.5336685e+02 -5.4963140e+00
3099 -9.0752136e+01 1.5337535e+02 -5.4963264e+00
This file is already created with some data processing so keep the format is important. I have thousands of lines with 6 digits entries and 4 digit entries in a file so a general solution would be helpful for me to learn and apply to other cases too. Any suggestion with sed or awk?
thanks
I would store the 4 numbers in an array, and then test that $1 occurs in the array.
awk '
NF == 6 {
delete n
for (i=3; i<=NF; i++)
n[$i]=1
print
next
}
$1 in n
' file
If the 6-field lines consistently appear before the 4-field lines they select, then
awk 'NF == 6 { for(i = 3; i <= 6; ++i) a[$i]; print } NF == 4 && $1 in a' filename
will work. That is as follows:
NF == 6 { # in a six-field line:
for(i = 3; i <= 6; ++i) a[$i] # remember the relevant fields
print
}
NF == 4 && $1 in a # and subsequently select four-field lines
# by them
Otherwise, you'll need a second pass over the file and handle the six-field lines in the first and the four-field lines in the second pass:
awk 'NR == FNR && NF == 6 { for(i = 3; i <= 6; ++i) a[$i]; print } FNR != NR && NF == 4 && $1 in a' filename filename
You can use the following awk script:
awk 'NF==6{print;b=b" "$3" "$4" "$5" "$6}NF==4{if(b ~ "\\y"$1"\\y") print}' input.txt
Explanation:
The command manages a buffer which contains all the last 4 fields of the lines with six columns. The variable is called b. Every time awk enters a line with six columns it prints that line and appends it to b.
If a line with 4 columns is entered awk checks if b contains the value of the first field $1 using the function match().
Output:
16402 8 3858 3877 3098 3099
3858 -9.0743538e+01 1.5161710e+02 -5.4964638e+00
3877 -9.2467590e+01 1.5160857e+02 -5.4969416e+00
3098 -9.2476135e+01 1.5336685e+02 -5.4963140e+00
3099 -9.0752136e+01 1.5337535e+02 -5.4963264e+00
Note, if it is safe that a line with 6 columns only applies to the following lines with 4 columns until a new line with 6 columns appears, the command can be changed to:
awk 'NF==6{print;b=b" "$3" "$4" "$5" "$6}NF==4{if(b ~ "\\y"$1"\\y") print}' input.txt
which would perform a lot better since the maximum buffer size will be only a single line.

Resources