Find the first line having a variable value bigger than a specific number - linux

I have a very huge text file and I want to know how can I find the first line in which the value of a variable is bigger than 1000?
assuming that the variable and its value have only one space in between like this:
abcd 24

Find the first occurrence of abcd greater than 1000 and print the line number and matching line and quit:
$ awk '$1=="abcd" && $2>1000{print NR, $0; exit}' file
To find any variable greater than 1000 just drop the first condition:
$ awk '$2>1000{print NR, $0; exit}' file

Related

How to insert a column at the start of a txt file using awk?

How to insert a column at the start of a txt file running from 1 to 2059 which corresponds to the number of rows I have in my file using awk. I know the command will be something like this:
awk '{$1=" "}1' File
Not sure what to put between the speech-marks 1-2059?
I also want to include a header in the header row so 1 should only go in the second row technically.
**ID** Heading1
RQ1293939 -7.0494
RG293I32SJ -903.6868
RQ19238983 -0899977
rq747585950 988349303
FID **ID** Heading1
1 RQ1293939 -7.0494
2 RG293I32SJ -903.6868
3 RQ19238983 -0899977
4 rq747585950 988349303
So I need to insert the FID with 1 - 2059 running down the first column
What you show does not work, it just replaces the first field ($1) with a space and prints the result. If you do not have empty lines try:
awk 'NR==1 {print "FID\t" $0; next} {print NR-1 "\t" $0}' File
Explanations:
NR is the awk variable that counts the records (the lines, in our case), starting from 1. So NR==1 is a condition that holds only when awk processes the first line. In this case the action block says to print FID, a tab (\t), the original line ($0), and then move to next line.
The second action block is executed only if the first one has not been executed (due to the final next statement). It prints NR-1, that is the line number minus one, a tab, and the original line.
If you have empty lines and you want to skip them we will need a counter variable to keep track of the current non-empty line number:
awk 'NR==1 {print "FID\t" $0; next} NF==0 {print; next} {print ++cnt "\t" $0}' File
Explanations:
NF is the awk variable that counts the fields in a record (the space-separated words, in our case). So NF==0 is a condition that holds only on empty lines (or lines that contain only spaces). In this case the action block says to print the empty line and move to the next.
The last action block is executed only if none of the two others have been executed (due to their final next statement). It increments the cnt variable, prints it, prints a tab, and prints the original line.
Uninitialized awk variables (like cnt in our example) take value 0 when they are used for the first time as a number. ++cnt increments variable cnt before its value is used by the print command. So the first time this block is executed cnt takes value 1 before being printed. Note that cnt++ would increment after the printing.
Assuming you don't really have a blank row between your header line and the rest of your data:
awk '{print (NR>1 ? NR-1 : "FID"), $0}' file
Use awk -v OFS='\t' '...' file if you want the output to be tab-separated or pipe it to column -t if you want it visually tabular.

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

remove lines with similar keyword if they appear in consecutive lines

I have got a text file of following format
sam has got grade B
score for him is 70
bob has got grade A
score for him is 90
score for him is 60
ronny has got grade B
score for him is 75
tony has got grade A
score for him is 91
As we see line 4 and line 5 both have score and the grade line is missing before line 5.
one way I could think of
grep 'grade' file.txt -A 1
However this would filter only lines where grade is missing. There could be few lines where grade is there but score is missing.
Is there any other better command in unix/linux with which we can remove such consecutive lines which either have two lines containing grade or score.
Here is my awk solution,
awk '{ if (prev != $2 $3 $4) {print $0} ; prev = $2 $3 $4 ; }' file.txt
Note that this solution has a minor bug which is if there are multiple similar lines at the end, it will output one extra line at the end which can be easily removed.
awk by default use spaces to separate words in each line and name them $1, $2, $3, etc for each word in order. prev = $2 $3 $4; will save the second + third + fourth word in variable prev. if there are consecutive lines in your case, $2, $3, $4 will be the same as those in previous line. If they are not the same, print $0 will print the whole line.
Not Bash command line, but if you want to get rid of two consecutive lines having either both 'grade' or 'score', you can open vim and run
:%s/^score.*\zs\nscore.*$//
To eliminate lines that begin with 'score' following a line that begins with 'score', and
:%s/grade.*\zs\n.*grade.*$//
To eliminate lines that have 'grade' in them following a line with 'grade in it.

Awk statement generating extra output line; skip blank input lines

jon,doe 5 5
sam,smith 10 5
I am required to calculate average for row & column. So basically the inputfile contains name score1 and score2 and i am required to read the contents from a file and then calculate average row-wise and column-wise. I am getting the desired result but there is one extra '0' that i am getting due to white space i would appreciate if someone could help.
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt
This is the output that i am getting-
name Score1 Score2 Average
jon,doe 5 5 5
sam,smith 10 5 7.5
0
Average 7.5 5
It looks like you have an extra empty or blank (all-whitespace) line in your input file.
Adding NF==0 {next} as the first pattern-action pair will skip all empty or blank lines and give the desired result.
NF==0 only matches if no fields (data) were found in the input line.
The next statement skips remaining statements for the current input line (record) and continues processing on the next line (record).
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} NF==0 {next} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt

awk sum every 4th number - field

So my input file is:
1;a;b;2;c;d;3;e;f;4;g;h;5
1;a;b;2;c;d;9;e;f;101;g;h;9
3;a;b;1;c;d;3;e;f;10;g;h;5
I want to sum the numbers then write it to a file (so i need every 4th field).
I tried many sum examples on the net but i didnt found answer for my problem.
My ouput file should looks:
159
Thanks!
Update:
a;b;**2**;c;d;g
3;e;**3**;s;g;k
h;5;**2**;d;d;l
The problem is the same.
I want to sum the 3th numbers (But in the line it is 3th).
So 2+3+2.
Output: 7
Apparently you want to print every 3rd field, not every 4th. The following code loops through all fields, suming each one in a 3k+1 position.
$ awk -F";" '{for (i=1; i<=NF; i+=3) sum+=$i} END{print sum}' file
159
The value is printed after processing the whole file, in the END {} block.

Resources