Put into file : index of line and total numbers of specific pattern - linux

I'm trying to invent a command which add to the file index of line where numbers of commas are less than 5 + numbers of comma in line. Let's assume result:
Let's assume input:
'abc','abc','abc','abc,'abc,'abc
'abc','abc','abc','abc,'abc,'abc,'abc
'abc','abc','abc','abc,'abc,'abc,'abc,'abc'
So in first line there are 5 commas, in second - 6 and in third - 7
and the expected result:
Index: 2 Number of commas : 6
Index: 3 Number of commas : 7
I invented that command which put into errors.csv all contents of line if comma > 50.
awk -F , 'NF > 50' <filename.csv >> errors.csv
The hardest for me is - how to retrieve and put into file index of line ??
Could you support me?

You can get this expected output using NR and NF variables of awk:
awk -F"," '{ if(NF > 6) printf("Index: %d Number of commas : %d\n", NR, NF-1); }' filename.csv
NR gives you the number of records in a file.

Related

Split and compare in awk

I want to split and comparison in awk command.
Input file (tab-delimited)
1 aaa 1|3
2 bbb 3|3
3 ccc 0|2
Filtration
First column value > 1
First value of third column value splitted by "|" > 2
Process
Compare first column value if bigger than 1
Split third column value by "|"
Compare first value of the third column if bigger than 2
Print if the first value bigger than 2 only
Command line (example)
awk -F "\t" '{if($1>1 && ....?) print}' file
Output
2 bbb 3|3
Please let me know command line for above processing.
You can set the field separator to either tab or pipe and check the 1st and 3rd values:
awk -F'\t|\\|' '$1>1 && $3>2' file
or
awk -F"\t|\\\\|" '$1>1 && $3>2' file
You can read about all this character escaping in this comprehensive answer by Ed Morton in awk: fatal: Invalid regular expression when setting multiple field separators.
Otherwise, you can split the 3rd field and check the value of the first slice:
awk -F"\t" '{split($3,a,"|")} $1>1 && a[1]>=2' file

How to format decimal space using awk in linux

original file :
a|||a 2 0.111111
a|||book 1 0.0555556
a|||is 2 0.111111
now i need to control third columns with 6 decimal space
after i tried awk {'print $1,$2; printf "%.6f\t",$3'}
but the output is not what I want
result :
a|||a 2
0.111111 a|||book 1
0.055556 a|||is 2
that's weird , how can I do that will just modify third columns
Your print() is adding a newline character. Include your third field inside it, but formatted. Try with sprintf() function, like:
awk '{print $1,$2, sprintf("%.6f", $3)}' infile
That yields:
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111
Print adds a newline on the end of printed strings, whereas printf by default doesn't. This means a newline is added after every second field and none is added after the third.
You can use printf for the whole string and manually add a newline.
Also I'm not sure why you are adding a tab to the end of the lines, so i removed that
awk '{printf "%s %d %.6f\n",$1,$2,$3}' file
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111

Awk statement generating extra output line; skip blank input lines

jon,doe 5 5
sam,smith 10 5
I am required to calculate average for row & column. So basically the inputfile contains name score1 and score2 and i am required to read the contents from a file and then calculate average row-wise and column-wise. I am getting the desired result but there is one extra '0' that i am getting due to white space i would appreciate if someone could help.
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt
This is the output that i am getting-
name Score1 Score2 Average
jon,doe 5 5 5
sam,smith 10 5 7.5
0
Average 7.5 5
It looks like you have an extra empty or blank (all-whitespace) line in your input file.
Adding NF==0 {next} as the first pattern-action pair will skip all empty or blank lines and give the desired result.
NF==0 only matches if no fields (data) were found in the input line.
The next statement skips remaining statements for the current input line (record) and continues processing on the next line (record).
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} NF==0 {next} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt

how to print 3rd field in 3rd column itself

In my file I have 3 fields, I want to print only the third field in the third column only but output is getting to the first row. Please check my file and output:
cat filename
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1,2,3,4,5,5
q,w,e,r t,y,g,t,i 9,8,7,6,5,5
I'm using the following command to print the third field only in the third column
cat filename |awk '{print $3}' |tr ',' '\n'
OUTPUT printing 3rd field strings in the 1st field place, i want that to print in only 3rd field area only
first field :-
---------------
1
2
3
4
5
5
expected output
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Input
[akshay#localhost tmp]$ cat file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1,2,3,4,5,5
q,w,e,r t,y,g,t,i 9,8,7,6,5,5
Script
[akshay#localhost tmp]$ cat test.awk
NR<3 || !NF{ print; next}
{
split($0,D,/[^[:space:]]*/)
c1=sprintf("%*s",length($1),"")
c2=sprintf("%*s",length($2),"")
split($3,A,/,/)
for(i=1; i in A; i++)
{
if(i==2)
{
$1 = c1
$2 = c2
}
printf("%s%s%s%s%d\n",$1,D[2],$2,D[3],A[i])
}
}
Output
[akshay#localhost tmp]$ awk -f test.awk file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Explanation
NR<3 || !NF{ print; next}
NR gives you the total number of records being processed or line number, in short NR variable has line number.
NF gives you the total number of fields in a record.
The next statement forces awk to immediately stop processing the
current record and go on to the next record.
If line number is less than 3 or not NF (meaning no fields in record that is blank line), print current record and go to next record.
split($0,D,/[^[:space:]]*/)
Since we are interested to preserve the formatting, so we are saving separators between fields on array D here, if you have GNU awk you can make use of 4th arg for split() - it lets you split the line into 2 arrays, one of the fields and the other of the separators between the fields and then you can just operate on the fields array and print using the separators array between each field array element to rebuild the original $0.
c1=sprintf("%*s",length($1),"") and c2=sprintf("%*s",length($2),"")
Here sprintf function is used to fill space char of field ($1 or $2) length.
split($3,A,/,/)
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array. The first piece
is stored in array[1], the second piece in array[2], and so forth. The
string value of the third argument, fieldsep, is a regexp describing
where to split string (much as FS can be a regexp describing where to
split input records). If fieldsep is omitted, the value of FS is used.
split() returns the number of elements created.
Loop till as long as i in A is true, I just came to know that i=1 and i++ control the order of traversal of the array, Thanks to Ed Morton
if(i==2)
{
$1 = c1
$2 = c2
}
when i = 1 we print a,b,c,d and d,e,f,g,h, in next iteration we modify $1 and $2 value with c1 and c2 we created above since you are interested to show only once as requested.
printf("%s%s%s%s%d\n",$1,D[2],$2,D[3],A[i])
Finally print field1 ($1), separator between field1 and field2 to we saved above, that is D[2], field2 ($2), separator between field2 and field3 and array A element only by one which we created from (split($3,A,/,/)).
$ cat tst.awk
NR<3 || !NF { print; next }
{
front = gensub(/((\S+\s+){2}).*/,"\\1","")
split($3,a,/,/)
for (i=1;i in a;i++) {
print front a[i]
gsub(/\S/," ",front)
}
}
$ awk -f tst.awk file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
The above uses GNU awk for gensub(), with other awks use match()+substr(). It also uses \S and \s shorthand for [^[:space:]] and [[:space:]].
Considering the columns are tab separated, I would say:
awk 'BEGIN{FS=OFS="\t"}
NR<=2 || !NF {print; next}
NR>2{n=split($3,a,",")
for (i=1;i<=n; i++)
print (i==1?$1 OFS $2:"" OFS ""), a[i]
}' file
This prints the 1st, 2nd and empty lines normally
Then, slices the 3rd field using the comma as separator.
Finally, loops through the amount of pieces printing each one every time; it prints the first two columns the first time, then just the last value.
Test
$ awk 'BEGIN{FS=OFS="\t"} NR<=2 || !NF {print; next} NR>2{n=split($3,a,","); for (i=1;i<=n; i++) print (i==1?$1 OFS $2:"" OFS ""), a[i]}' a
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Note the output is a bit ugly, since tab separating the columns lead them like this.

Find the first line having a variable value bigger than a specific number

I have a very huge text file and I want to know how can I find the first line in which the value of a variable is bigger than 1000?
assuming that the variable and its value have only one space in between like this:
abcd 24
Find the first occurrence of abcd greater than 1000 and print the line number and matching line and quit:
$ awk '$1=="abcd" && $2>1000{print NR, $0; exit}' file
To find any variable greater than 1000 just drop the first condition:
$ awk '$2>1000{print NR, $0; exit}' file

Resources