Extracting several rows with overlap using awk - linux

I have a big file that looks like this (it has actually 12368 rows):
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
What I want to do is, delete the header, then extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on...
What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
From a previous post, I got this:
tail -n +2 myfile.txt | awk 'BEGIN{file=1} ++count && count==2000 {print > "window"file; file++; count=500} {print > "window"file}'
But that isn't what I want. I don't have the 500 lines overlap and my first window has 1999 rows instead of 2000.
Any help would be appreciated

awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}
END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' file
try above line, you could change the numbers to fit your new requirement. you can define your own filenames too.
little test with t=10 (your 2000) and d=5 (your 500)
kent$ cat f
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
kent$ awk -v i=1 -v t=10 -v d=5 'NR>1{a[NR-1]=$0}END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' f
kent$ head *.txt
==> 1.txt <==
1
2
3
4
5
6
7
8
9
10
==> 6.txt <==
6
7
8
9
10
11
12
13
14
15
==> 11.txt <==
11
12
13
14
15

awk is not ideal for this. In Python you could do something like
with open("data") as fin:
lines = fin.readlines()
# remove header
lines = lines[1:]
# print the lines
i = 0
while True:
print "\n starting window"
if len(lines) < i+3000:
# we're done. whatever is left in the file will be ignored
break
for line in lines[i:i+3000]:
print line[:-1] # remove \n
i += 3000 - 500

Reading the entire file into memory is usually not a great idea, and in this case is not necessary. Given a line number, you can easily compute which files it should go into. For example:
awk '{
a = int( NR / (t-d));
b = int( (NR-t) / (t-d)) ;
for( f = b; f<=a; f++ ) {
if( f >= 0 && (f * (t-d)) < NR && ( NR <= f *(t-d) + t))
print > ("window"(f+1))
}
}' t=2000 d=500

Related

How to write a code for more than one file in awk

I wrote a script in AWK called exc7
./exc7 file1 file2
In every file there is a matrix
file1 :
2 6 7
10 5 4
3 8 4
file2:
-60 10
10 -60
The code that I wrote is :
#!/usr/bin/awk -f
{
for (i=1;i<=NF;i++)
A[NR,i]=$i
}
END{
for (i=1;i<=NR;i++){
sum += A[i,1]
}
for (i=1;i<=NF;i++)
sum2 += A[1,i]
for (i=0;i<=NF;i++)
sum3 += A[NR,i]
for (i=0;i<=NR;i++)
sum4 += A[i,NF]
print sum,sum2,sum3,sum4
if (sum==sum2 && sum==sum3 && sum==sum4)
print "yes"
}
It should check for every file if the sum of the first column and the last and the first line and the last is the same. It will print the four sum and say yes if they are equal.Then it should print the largest sum of all number in all the files.
when I try it on one file it is right like when I try it on file1 it prints:
15 15 15 15
yes
but when I try it on two or more files like file1 file 2 the output is :
-35 8 -50 -31
you should use FNR instead of NR and with gawk you can use ENDFILE instead of END. However, this should work with any awk
awk 'function sumline(last,rn) {n=split(last,lr);
for(i=1;i<=n;i++) rn+=lr[i];
return rn}
function printresult(c1,r1,rn,cn) {print c1,r1,rn,cn;
print (r1==rn && c1==cn && r1==c1)?"yes":"no"}
FNR==1{if(last)
{rn=sumline(last);
printresult(c1,r1,rn,cn)}
rn=cn=c1=0;
r1=sumline($0)}
{c1+=$1;cn+=$NF;last=$0}
END {rn=sumline(last);
printresult(c1,r1,rn,cn)}' file1 file2
15 15 15 15
yes
-50 -50 -50 -50
yes
essentially, instead of checking end of file, you can check start of the file and print out the previous file's results. Need to treat first file differently. You still need the END block to handle the last file.
UPDATE
Based on the questions you asked, I think it's better for you to keep your script as is and change the way you call it.
for file in file1 file2;
do echo "$file"; ./exc7 "$file";
done
you'll be calling the script once for each file, so all the complications will go away.

NTILE a column in csv - Linux

I have a csv file that reads like this:
a,b,c,2
d,e,f,3
g,h,i,3
j,k,l,4
m,n,o,5
p,q,r,6
s,t,u,7
v,w,x,8
y,z,zz,9
I want to assign quintiles to this data (like we do it in sql), using preferably bash command in linux. The quintiles, if assigned as a new column, will make the final output look like:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,z,9, 5
The only thing i am able to achieve is to add a new incremental column to the csv file:
`awk '{$3=","a[$3]++}1' f1.csv > f2.csv`
But not sure how do the quintiles. Please help. thanks.
awk '{a[NR]=$0}
END{
for(i=1;i<=NR;i++) {
p=100/NR*i
q=1
if(p>20){q=2}
if(p>40){q=3}
if(p>60){q=4}
if(p>80){q=5}
print a[i] ", " q
}
}' file
Output:
a,b,c,2, 1
d,e,f,3, 2
g,h,i,3, 2
j,k,l,4, 3
m,n,o,5, 3
p,q,r,6, 4
s,t,u,7, 4
v,w,x,8, 5
y,z,zz,9, 5
Short wc + awk approach:
awk -v n=$(cat file | wc -l) \
'BEGIN{ OFS=","; n=sprintf("%.f\n", n*0.2); c=1 }
{ $(NF+1)=" "c }!(NR % n){ ++c }1' file
n=$(cat file | wc -l) - get the total number of lines of the input file file
n*0.2 - 1/5th (20 percent) of the range
$(NF+1)=" "c - set next last field with current rank value c
The output:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,zz,9, 5

How to compare 2 lists of ranges in bash?

Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.
file1:
1 4
5 7
8 11
12 15
file2:
3 4
8 13
20 24
Desired output:
3 4
8 13
My best attempt has been:
awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next};
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then
{print $1, $2}}}}}' file1 file2 > output.txt
This returns an empty file.
I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.
Any help appreciated!
It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:
declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges
# read ranges in second file, store then in arrays min and max
while read a b; do
min+=( "$a" );
max+=( "$b" );
done < file2
# read ranges in first file
while read a b; do
# loop over indexes of min (and max) array
for i in "${!min[#]}"; do
if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
echo "${min[i]} ${max[i]}" # print range
unset min[i] max[i] # performance optimization
fi
done
done < file1
This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.
EDIT 1: improved the range overlap test.
EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.
EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:
awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
{ for (i in a)
if ($1 <= b[i] && a[i] <= $2) {
print a[i], b[i]; delete a[i]; delete b[i];
}
}' file2 file1
awk solution:
awk 'NR==FNR{ a[$1]=$2; next }
{ for(i in a)
if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) {
print i,a[i]; delete a[i];
}
}' file2 file1
The output:
3 4
8 13
awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2
Break down:
FNR == 1 && NR == 1 {
file=1
}
FNR == 1 && NR != 1 {
file=2
}
file == 1 {
for (q=1;q<=NF;q++) {
nums[$q]=$0
}
}
file == 2 {
for ( p=1;p<=NF;p++) {
for (i in nums) {
if (i == $p) {
print $0
}
}
}
}
Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.
For GNU awk as I'm controlling the for scanning order for optimizing time:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_desc"
}
NR==FNR { # hash file1 to a
if(($2 in a==0) || $1<a[$2]) # avoid collisions
a[$2]=$1
next
}
{
for(i in a) { # in desc order
# print "DEBUG: For:",$0 ":", a[i], i # remove # for debug
if(i+0>$1) { # next after
if($1<=i+0 && a[i]<=$2) {
print
next
}
}
else
next
}
}
Test data:
$ cat file1
0 3 # testing for completely overlapping ranges
1 4
5 7
8 11
12 15
$ cat file2
1 2 # testing for completely overlapping ranges
3 4
8 13
20 24
Output:
$ awk -f program.awk file1 file2
1 2
3 4
8 13
and
$ awk -f program.awk file2 file1
0 3
1 4
8 11
12 15
If Perl solution is preferred, then below one-liner would work
/tmp> cat marla1.txt
1 4
5 7
8 11
12 15
/tmp> cat marla2.txt
3 4
8 13
20 24
/tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
3 4
8 13
/tmp>
If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in file2, we take further and further ranges in file1 until we know whether these overlap with R. Once we know this, we switch to the next range in file2.
#!/bin/bash
exec 3< "$1" # file whose ranges are checked for overlap with those ...
exec 4< "$2" # ... from this file, and if so, are written to stdout
l4=-1 # lower bound of current range from file 2
u4=-1 # upper bound
# initialized with -1 so the first range is read on the first iteration
echo "Ranges in $1 that intersect any ranges in $2:"
while read l3 u3; do # read next range from file 1
if (( u4 >= l3 )); then
(( l4 <= u3 )) && echo "$l3 $u3"
else # the upper bound from file 2 is below the lower bound from file 1, so ...
while read l4 u4; do # ... we read further ranges from file 2 until ...
if (( u4 >= l3 )); then # ... their upper bound is high enough
(( l4 <= u3 )) && echo "$l3 $u3"
break
fi
done <&4
fi
done <&3
The script can be called with ./script.sh file2 file1

In bash how can I split a column in several column of fixed dimension

how can I split a single column in several column of fixed dimension, for example I have a column like this:
1
2
3
4
5
6
7
8
and for size p. ex 4, I want to obtain
1 5
2 6
3 7
4 8
or for size p. ex 2, I want to obtain
1 3 5 7
2 4 6 8
Using awk:
awk '
BEGIN {
# Numbers of rows to print
n=4;
}
{
# Add to array with key = 0, 1, 2, 3, 0, 1, 2, ..
l[(NR-1)%n] = l[(NR-1)%n] " " $0
};
END {
# print the array
for (i = 0; i < length(l); i++) {
print l[i];
}
}
' file
OK, this is a bit long winded and not infallible but the following should work:
td=$( mktemp -d ); split -l <rows> <file> ${td}/x ; paste $( ls -1t ${td}/x* ) ; rm -rf ${td}; unset td
Where <cols> is the number of rows you want and file is your input file.
Explanation:
td=$( mktemp -d )
Creates a temporary directory so that we can put temporary files into it. Store this in td - it's possible that your shell has a td variable already but if you sub-shell for this your scope should be OK.
split -l <rows> <file> f ${td}/x
Split the original file into many smaller file, each <rows> long. These will be put into your temp directory and all files will be prefixed with x
paste $( ls -1t ${td}/x* )
Write these files out so that the lines in consecutive columns
rm -rf ${td}
Remove the files and directory.
unset td
Clean the environment.
Assuming you know the number of rows in your column (here, 8):
n=8
# to get output with 4 rows:
seq $n | pr -ts" " -$((n/4))
1 5
2 6
3 7
4 8
# to get output with 2 rows:
seq $n | pr -ts" " -$((n/2))
1 3 5 7
2 4 6 8
If you know the desired output width you can use column.
# Display in columns for an 80 column display
cat file | column -c 80
$ cat tst.awk
{ a[NR] = $0 }
END {
OFS=","
numRows = (numRows ? numRows : 1)
numCols = ceil(NR / numRows)
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
for ( colNr=1; colNr<=numCols; colNr++ ) {
idx = rowNr + ( (colNr - 1) * numRows )
printf "%s%s", a[idx], (colNr<numCols ? OFS : ORS)
}
}
}
function ceil(x, y){y=int(x); return(x>y?y+1:y)}
$ awk -v numRows=2 -f tst.awk file
1,3,5,7
2,4,6,8
$ awk -v numRows=4 -f tst.awk file
1,5
2,6
3,7
4,8
Note that above produces a CSV with the same number of fields in every row even when the number of input rows isn't an exact multiple of the desired number of output rows:
$ seq 10 | awk -v numRows=4 -f tst.awk
1,5,9
2,6,10
3,7,
4,8,
See https://stackoverflow.com/a/56725452/1745001 for how to do the opposite, i.e. generate a number of rows given a specified number of columns.

in-file selection of a number and only keeping those lines starting with that number in linux

I have files with the format given below. Please note that the entries are space seperated.
16402 8 3858 3877 3098 3099
3858 -9.0743538e+01 1.5161710e+02 -5.4964638e+00
3244 -9.7903877e+01 1.8551400e-13 1.0194137e+01
3877 -9.2467590e+01 1.5160857e+02 -5.4969416e+00
4330 -9.3877419e+01 8.8259323e+01 -5.4966841e+00
3098 -9.2476135e+01 1.5336685e+02 -5.4963140e+00
5431 -6.1601208e+01 3.3540974e+01 1.0309820e+01
3099 -9.0752136e+01 1.5337535e+02 -5.4963264e+00
3600 -6.3099121e+01 1.3944173e+02 -5.4964156e+00
5418 -6.6785469e+01 2.9993099e+01 1.0291004e+01
There are lines with 6 entries and there are files with 4 entries. The lines with total of 6 entries have last 4 entries as the node number and the lines with 4 entries are those node numbers with there spatial coordinates. I want to keep only those nodes in the 4 entry lines which are listed in the 6 digit lines and delete all the others so that my files would look like
16402 8 3858 3877 3098 3099
3858 -9.0743538e+01 1.5161710e+02 -5.4964638e+00
3877 -9.2467590e+01 1.5160857e+02 -5.4969416e+00
3098 -9.2476135e+01 1.5336685e+02 -5.4963140e+00
3099 -9.0752136e+01 1.5337535e+02 -5.4963264e+00
This file is already created with some data processing so keep the format is important. I have thousands of lines with 6 digits entries and 4 digit entries in a file so a general solution would be helpful for me to learn and apply to other cases too. Any suggestion with sed or awk?
thanks
I would store the 4 numbers in an array, and then test that $1 occurs in the array.
awk '
NF == 6 {
delete n
for (i=3; i<=NF; i++)
n[$i]=1
print
next
}
$1 in n
' file
If the 6-field lines consistently appear before the 4-field lines they select, then
awk 'NF == 6 { for(i = 3; i <= 6; ++i) a[$i]; print } NF == 4 && $1 in a' filename
will work. That is as follows:
NF == 6 { # in a six-field line:
for(i = 3; i <= 6; ++i) a[$i] # remember the relevant fields
print
}
NF == 4 && $1 in a # and subsequently select four-field lines
# by them
Otherwise, you'll need a second pass over the file and handle the six-field lines in the first and the four-field lines in the second pass:
awk 'NR == FNR && NF == 6 { for(i = 3; i <= 6; ++i) a[$i]; print } FNR != NR && NF == 4 && $1 in a' filename filename
You can use the following awk script:
awk 'NF==6{print;b=b" "$3" "$4" "$5" "$6}NF==4{if(b ~ "\\y"$1"\\y") print}' input.txt
Explanation:
The command manages a buffer which contains all the last 4 fields of the lines with six columns. The variable is called b. Every time awk enters a line with six columns it prints that line and appends it to b.
If a line with 4 columns is entered awk checks if b contains the value of the first field $1 using the function match().
Output:
16402 8 3858 3877 3098 3099
3858 -9.0743538e+01 1.5161710e+02 -5.4964638e+00
3877 -9.2467590e+01 1.5160857e+02 -5.4969416e+00
3098 -9.2476135e+01 1.5336685e+02 -5.4963140e+00
3099 -9.0752136e+01 1.5337535e+02 -5.4963264e+00
Note, if it is safe that a line with 6 columns only applies to the following lines with 4 columns until a new line with 6 columns appears, the command can be changed to:
awk 'NF==6{print;b=b" "$3" "$4" "$5" "$6}NF==4{if(b ~ "\\y"$1"\\y") print}' input.txt
which would perform a lot better since the maximum buffer size will be only a single line.

Resources