awk - all rows where half of columns are bigger than x - linux

As the title suggests I'm trying to find all rows in an large tsv file, where at least 50% of the columns have a value bigger than a value x using awk.
E.g for x=5:
9 6 7 2 3
0 1 2 7 6
1 3 8 9 10
should return
9 6 7 2 3
1 3 8 9 10

awk to the rescue!
$ awk -v t=5 '{c=0; for(i=1;i<=NF;i++) c+=($i>t)} c/NF>0.5' file
9 6 7 2 3
1 3 8 9 10

Using Perl:
perl -ane '$x = 5; print if #F / 2 <= grep $_ > $x, #F' -- file.tsv

Using an input .tsv file which looks like this:
Num1 Num2 Num3 Num4 Num5
9 6 7 2 3
0 1 2 7 6
1 3 8 9 10
This code will do it in a awk script. I've left comments to see
the form of a script so you can adjust accordingly.
#!/usr/bin/awk -f
# reads from stdin.
# Usage: $ ./bigcols.awk < input1.tsv
# Run at start.
BEGIN {
# print "Start"
# print "TSV setting. Field seperator set to tab."
FS = "\t"
# He wants to find lines with avg greater than var x
x=5
}
# main. Run for each record. This code uses newlines to denote records.
{
# Find lines which are of this form: (skip header)
# #+,
# ie. start with one or more numbers in column 1.
if ($1 ~ /^[0-9]+/) {
the_avg = ($1 + $2 + $3 + $4 + $5)/5
if (the_avg > x) {
print $1, $2, $3, $4, $5
}
}
}
# run at end
#END { print "Stop" }

Related

how do I remove gaps between int values in a file?

Given a file containing two columns of integers, I want to get rid of the gaps between the integer values. By gap I mean that if we take two integers A and B, in a way that there is no C such as A
1 2
1 3
2 5
6 9
3 5
7 9
11 6
7 11
to this:
1 2
1 3
2 4
5 7
3 4
6 7
8 5
6 8
In the first two columns, the present integers are {1,2,3,5,6,7,9,11}. The missing values are {4,8,10}. the goal is to decrease every integer by the number of missing values that are smaller than it.
so 5,6 and 7 are decreased by 1, 9 us decreased by 2, and 11 is decreased by 3.
so the values {1,2,3,5,6,7,9,11} are replaced by {1,2,3,4,5,6,7,8}.
does anyone know how to do it efficiently, using a linux command, a bash script or and awk command?
Thank you!
Edit:
I tried to do it but I didn't find a way to do it in a shell script, I had to write a c program which executes shell scripts.
the first part just sorts the file, the second, does what I talked about in the question.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define MAX_INTS 100000000
void process_file(char *path){
//FIRST PART
char *outfpath="tmpfile";
char *command=calloc(456+3*strlen(path)+strlen(outfpath),sizeof(char));
sprintf(command,"#!/bin/bash \nvar1=$( cat %s | head -n 4 && ( cat %s | tail -n +5 | awk '{split( $0, a, \" \" ); asort( a ); for( i = 1; i <= length(a); i++ ) printf( \"%c%c \", a[i] ); printf( \"\\n\" ); }' | sort -n -k1,1 -k2 | uniq) )\nvar2=$( ( (echo \"$var1\" | tail -n +5 | cut -f 1 -d\" \") && (echo \"$var1\" | tail -n +5 | cut -f 2 -d\" \" ) ) | sort -n -k1,1 | uniq | awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' )\necho \"$var1\" > %s\necho \"$var2\"| tr \"\\n\" \" \" > %s",path,path,'%','s',path,outfpath);
if(system(command)==-1){
fprintf(stderr,"Erreur à l'exécution de la commande \n%s\n",command);
}
//the first part only sorts the file and puts in outpath the list of the missing integers
//SECOND PART
long unsigned start=0,end=0,val,index=0;
long unsigned *intvals=calloc(MAX_INTS,sizeof(long unsigned));
FILE *f=fopen(outfpath,"r");
//reads the files and loads the missing ints to the array intvals
while(fscanf(f,"%lu ",&val)==1){
end=index;
intvals[index]=val;
index++;
}
if (index==0) return;
intvals=realloc(intvals,index*sizeof(long unsigned));
fclose(f);
free(command);
f=fopen(path,"r+w");
char *line=calloc(1000,sizeof(char));
command=calloc(1000,sizeof(char));
char *str;
long unsigned v1,v2,
d1=0,d2=0,
c=0,prec=-1,start_l=0;
int pos1, pos2;
//read a file containing two columns of ints
//for each pair v1 v2, count d1 d2,
//such as d1 is the number of missing values smaller than v1, d2 the number of missing values smaller than v2
//and overrwrite the line in the file using sed with the values v1-d1 and v2-d2
while(fgets(line,1000,f)!=NULL && line[0]=='#'){ continue; }
do{
str=strtok(line," \t");
v1=atoi(str);
str=strtok(NULL," \t");
v2=atoi(str);
if(prec!=v1) {
prec=v1;
d2=d1;
start_l=start;
}
for(index=start;index<=end;index++){
if(intvals[index]<v1){
d1++;
start++;
c=1;
}else{
start=d1;
break;
}
}
for(index=start_l;index<=end;index++){
if(intvals[index]<v2){
d2++;
start_l++;
c=1;
}else{
break;
}
}
if(c){
sprintf(command,"sed -i 's/%lu %lu/%lu %lu/' %s",v1,v2,v1-d1,v2-d2,path);
if(system(command)==-1){
fprintf(stderr,"Erreur à l'exécution de la commande \n%s\n",command);
}
}
c=0;
}while(fgets(line,1000,f)!=NULL);
fclose(f);
free(command);
free(line);
free(intvals);
}
int main(int argc,char* argv[]){
process_file(argv[1]);
return 0;
}
This might do it:
awk '(NR==FNR){for(i=1;i<=NF;++i) {a[$i]; max=(max<$i?$i:max)};next}
(FNR==1) {for(i=1;i<=max;++i) if(i in a) a[i]=++c }
{for(i=1;i<=NF;++i) $i=a[$i]}1' file file
If file has as input:
1 2
1 3
2 5
6 9
3 5
7 9
11 6
7 11
The above command will return:
1 2
1 3
2 4
5 7
3 4
6 7
8 5
6 8
The idea of this method is to keep track of an array a which is indexed by the old value and returns the new value : a[old]=new. We scan the file twice and store all possible values in a[old]. When we read the file for the second time, we first check what the new values are going to be. When that is done, we just update all the fields with the new values and print the result.
The above can also be done by reading the file a single time, you just need to buffer a bit:
awk '{b[FNR]=$0;for(i=1;i<=NF;++i) {a[$i]; max=(max<$i?$i:max)}}
END {
for(i=1;i<=max;++i) if(i in a) a[i]=++c
for(n=1;n<=FNR;++n) {
$0=b[n]
for(i=1;i<=NF;++i) $i=a[$i]
print
}
}' file
Using GNU awk and asorti():
$ gawk '{ # GNU awk only or implement sort
a[$1];a[$2] # hash field values to a array
f1[NR]=$1;f2[NR]=$2 # hash fields $1 and $2 index on NR
}
END { # after all data is hashed
asorti(a,a,"#ind_num_asc") # sort index of a where the values are
for(i in a) # make a reverse map
b[a[i]]=i
for(i=1;i<=NR;i++) # iterate the stored "records"
print b[f1[i]],b[f2[i]] # print and fetch from reverse map
}' file
a[] stores the uniques field values: a[6] a[5] then asorti() re-indexes a[]: a[1]=5 a[2]=6 and we get correponding new values. b[] is reverse mapping of a[]: b[5]=1 b[6]=2 which is used to get new values for old field values when outputing.
Output:
1 2
1 3
2 4
5 7
3 4
6 7
8 5
6 8
Assuming your input looks like this:
input.txt
2 1
4 3
5 5
6 2
1 4
8 7
9 6
7 9
Note: No 3 in col1 and 8 in col2 just to make it easier to track.
Then sort each column individually and store it:
$sort -k1,1 input.txt | awk '{ print $1}' > 1_sorted
$cat 1_sorted
1
2
4
5
6
7
8
9
$sort -k2,2 input.txt | awk '{ print $2}' > 2_sorted
$cat 2_sorted
1
2
3
4
5
6
7
9
Now just merge the two files:
$paste -d' ' 1_sorted 2_sorted > merged_again
$ cat merged_again
1 1
2 2
4 3
5 4
6 5
7 6
8 7
9 9
There might be a more performant / elegant method but I can't think of it right now.

Insert a row and a column in a matrix using awk

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt
$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file
Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file
With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

Compare columns from two files and print not match

I want to compare the first 4 columns of file1 and file2. I want to print all lines from file1 + the lines from file2 that are not in file1.
File1:
2435 2 2 7 specification 9-8-3-0
57234 1 6 4 description 0-0 55211
32423 2 44 3 description 0-0 24242
File2:
2435 2 2 7 specification
7624 2 2 1 namecomplete
57234 1 6 4 description
28748 34 5 21 gateway
32423 2 44 3 description
832758 3 6 namecomplete
output:
2435 2 2 7 specification 9-8-3-0
57234 1 6 4 description 0-0 55211
32423 2 44 3 description 0-0 24242
7624 2 2 1 namecomplete
28748 34 5 21 gateway
832758 3 6 namecomplete
I don't understand how to print things that don't match.
You can do it with an awk script like this:
script.awk
FNR == NR { mem[ $1 $2 $3 $4 $5 ] = 1;
print
next
}
{ key = $1 $2 $3 $4 $5
if( ! ( key in mem) ) print
}
And run it like this: awk -f script.awk file1 file2 .
The first part memorizes the first 5 fields, prints the whole line and moves to the next line. This part is exclusively applied to lines from the first file.
The second part is only applied to lines from the second file. It checks if the line is not in mem, in that case the line was not in file1 and it is printed.

How to sum column of different file in bash scripting

I have two files:
file-1
1 2 3 4
1 2 3 4
1 2 3 4
file-2
0.5
0.5
0.5
Now I want to add column 1 of file-2 to column 3 of file-1
Output
1 2 3.5 4
1 2 3.5 4
1 2 3.5 4
I've tried this, but it does not work correctly:
awk '{print $1, $2, $3+file-2 }' file-2=$1_of_file-2 file-1 > file-3
I know the awk statement is not right but I want to use something like this; can anyone help me?
Your data isn't very exciting…
awk 'FNR == NR { for (i = 1; i <= NF; i++) { line[NR,i] = $i } fields[NR] = NF }
FNR != NR { line[FNR,3] += $1
pad = ""
for (i = 1; i <= fields[FNR]; i++) { printf "%s%s", pad, line[FNR,i]; pad = " " }
printf "\n"
}' file-1 file-2
The first pattern matches the lines in the first file; it saves each field into the pseudo-multidimensional array line, and also records how many fields there are in that line.
The second pattern matches the lines in the second file; it adds the value in column one to column three of the saved data, then prints out all the fields with a space between them, and adds a newline to the end.
Given this (mildly) modified input, the script (saved in file so-25657951.sh) produces the output shown:
$ cat file-1
1 2 3 4
2 3 6 5
3 4 9 6
$ cat file-2
0.1
0.2
0.3
$ bash so-25657951.sh
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
$
Note that because this slurps the whole of the first file into memory before reading anything from the second file, the input files should not be too large (say sub-gigabyte size). If they're bigger than that, you should probably devise an alternative strategy.
For example, there is a getline function (even in POSIX awk) which could be used to read a line from file 2 for each line in file 1, and you could then simply print the data without needing to accumulate anything:
awk '{ getline add < "file-2"; $3 += add; print }' file-1
This works reasonably cleanly for any size of file (as long as the files have the same number of lines — or, more precisely, as long as file-2 has at least as many lines as file-1).
This may work:
cat f1
1 2 3 4
2 3 6 5
3 4 9 6
cat f2
0.1
0.2
0.3
awk 'FNR==NR {a[NR]=$1;next} {$3+=a[FNR]}1' f2 f1
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
After I posted it, I do see that its the same as Jaypal posted in a comment.

How to extract every N columns and write into new files?

I've been struggling to write a code for extracting every N columns from an input file and write them into output files according to their extracting order.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In a simpler case below, extracting every 3 columns(fields) from an input file with a start point of the 2nd column.
for example, if the input file looks like:
aa 1 2 3 4 5 6 7 8 9
bb 1 2 3 4 5 6 7 8 9
cc 1 2 3 4 5 6 7 8 9
dd 1 2 3 4 5 6 7 8 9
and I want the output to look like this:
output_file_1:
1 2 3
1 2 3
1 2 3
1 2 3
output_file_2:
4 5 6
4 5 6
4 5 6
4 5 6
output_file_3:
7 8 9
7 8 9
7 8 9
7 8 9
I tried this, but it doesn't work:
awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>
It gave me syntax error and the more I fix the more problems coming out.
I also tried the linux command cut but while I was dealing with large files this seems effortless. And I wonder if cut would do a loop cut of every 3 fields just like the awk.
Can someone please help me with this and give a quick explanation? Thanks in advance.
Actions to be performed by awk on the input data must be included in curled braces, so the reason the awk one-liner you tried results in a syntax error is that the for cycle does not respect this rule. A syntactically correct version will be:
awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>
This is syntactically correct (almost, see end of this post.), but does not do what you think.
To separate the output by columns on different files, the best thing is to use awk redirection operator >. This will give you the desired output, given that your input files always has 10 columns:
awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>
mind the " " to specify the filenames.
EDITED: REAL WORLD CASE
If you have to loop along the columns because you have too many of them, you can still use awk (gawk), with two loops: one on the output files and one on the columns per file. This is a possible way:
#!/usr/bin/gawk -f
BEGIN{
CTOT = 24005 # total number of columns, you can use NF as well
DELTA = 800 # columns per file
START = 6 # first useful column
d = CTOT/DELTA # number of output files.
}
{
for ( i = 0 ; i < d ; i++)
{
for ( j = 0 ; j < DELTA ; j++)
{
printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
}
printf("\n") > "file_out_"i
}
}
I have tried this on the simple input files in your example. It works if CTOT can be divided by DELTA. I assumed you had floats (%f) just change that with what you need.
Let me know.
P.s. going back to your original one-liner, note that the loop is an infinite one, as i is not incremented: i+a must be substituted by i+=a, and a=3 must be inside the inner braces:
awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>
this evaluates a=3 at every cycle, which is a bit pointless. A better version would thus be:
awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>
Still, this will just print the 2nd, 5th and 8th column of your file, which is not what you wanted.
awk '{ print $2, $3, $4 >"output_file_1";
print $5, $6, $7 >"output_file_2";
print $8, $9, $10 >"output_file_3";
}' input_file
This makes one pass through the input file, which is preferable to multiple passes. Clearly, the code shown only deals with the fixed number of columns (and therefore a fixed number of output files). It can be modified, if necessary, to deal with variable numbers of columns and generating variable file names, etc.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In that case, you're correct; you need a loop. In fact, you need two loops:
awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
{
for (i = start; i < start + gap; i++)
{
file = sprintf("%s%d", filebase, i);
for (j = i; j <= NF; j += gap)
printf("%s ", $j) > file;
printf "\n" > file;
}
}' input_file
I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and gap set to 8 and start set to 2. The output below is the resulting 8 files pasted horizontally.
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
With GNU awk:
$ awk -v d=3 '{for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3",""); print "----"}' file
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
Just redirect the output to files if desired:
$ awk -v d=3 '{sfx=0; for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3","") > ("output_file_" ++sfx)}' file
The idea is just to tell gensub() to skip the first few (i-1) fields then print the number of fields you want (d = 3) and ignore the rest (.*). If you're not printing exact multiples of the number of fields you'll need to massage how many fields get printed on the last loop iteration. Do the math...
Here's a version that'd work in any awk. It requires 2 loops and modifies the spaces between fields but it's probably easier to understand:
$ awk -v d=3 '{sfx=0; for(i=2;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) {str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file
I was successful using the following command line. :) It uses a for loop and pipes the awk program into it's stdin using -f -. The awk program itself is created using bash variable math.
for i in 0 1 2; do
echo "{print \$$((i*3+2)) \" \" \$$((i*3+3)) \" \" \$$((i*3+4))}" \
| awk -f - t.file > "file$((i+1))"
done
Update: After the question has updated I tried to hack a script that creates the requested 800-cols-awk script dynamically ( a version according to Jonathan Lefflers answer) and pipe that to awk. Although the scripts looks good (for me ) it produces an awk syntax error. The question is, is this too much for awk or am I missing something? Would really appreciate feedback!
Update: Investigated this and found documentation that says awk has a lot af restrictions. They told to use gawk in this situations. (GNU's awk implementation). I've done that. But still I'll get an syntax error. Still feedback appreciated!
#!/bin/bash
# Note! Although the script's output looks ok (for me)
# it produces an awk syntax error. is this just too much for awk?
# open pipe to stdin of awk
exec 3> >(gawk -f - test.file)
# verify output using cat
#exec 3> >(cat)
echo '{' >&3
# write dynamic script to awk
for i in {0..24005..800} ; do
echo -n " print " >&3
for (( j=$i; j <= $((i+800)); j++ )) ; do
echo -n "\$$j " >&3
if [ $j = 24005 ] ; then
break
fi
done
echo "> \"file$((i/800+1))\";" >&3
done
echo "}"

Resources