How to insert a column at the start of a txt file using awk? - linux

How to insert a column at the start of a txt file running from 1 to 2059 which corresponds to the number of rows I have in my file using awk. I know the command will be something like this:
awk '{$1=" "}1' File
Not sure what to put between the speech-marks 1-2059?
I also want to include a header in the header row so 1 should only go in the second row technically.
**ID** Heading1
RQ1293939 -7.0494
RG293I32SJ -903.6868
RQ19238983 -0899977
rq747585950 988349303
FID **ID** Heading1
1 RQ1293939 -7.0494
2 RG293I32SJ -903.6868
3 RQ19238983 -0899977
4 rq747585950 988349303
So I need to insert the FID with 1 - 2059 running down the first column

What you show does not work, it just replaces the first field ($1) with a space and prints the result. If you do not have empty lines try:
awk 'NR==1 {print "FID\t" $0; next} {print NR-1 "\t" $0}' File
Explanations:
NR is the awk variable that counts the records (the lines, in our case), starting from 1. So NR==1 is a condition that holds only when awk processes the first line. In this case the action block says to print FID, a tab (\t), the original line ($0), and then move to next line.
The second action block is executed only if the first one has not been executed (due to the final next statement). It prints NR-1, that is the line number minus one, a tab, and the original line.
If you have empty lines and you want to skip them we will need a counter variable to keep track of the current non-empty line number:
awk 'NR==1 {print "FID\t" $0; next} NF==0 {print; next} {print ++cnt "\t" $0}' File
Explanations:
NF is the awk variable that counts the fields in a record (the space-separated words, in our case). So NF==0 is a condition that holds only on empty lines (or lines that contain only spaces). In this case the action block says to print the empty line and move to the next.
The last action block is executed only if none of the two others have been executed (due to their final next statement). It increments the cnt variable, prints it, prints a tab, and prints the original line.
Uninitialized awk variables (like cnt in our example) take value 0 when they are used for the first time as a number. ++cnt increments variable cnt before its value is used by the print command. So the first time this block is executed cnt takes value 1 before being printed. Note that cnt++ would increment after the printing.

Assuming you don't really have a blank row between your header line and the rest of your data:
awk '{print (NR>1 ? NR-1 : "FID"), $0}' file
Use awk -v OFS='\t' '...' file if you want the output to be tab-separated or pipe it to column -t if you want it visually tabular.

Related

Less rows than expected after comparing two files

I have two files to be compared:
"base" file from where I get values in the second column after comparing it with "temp" file
"temp" file which is continuously changing (e.g., in every loop)
"base" file:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
"temp" file:
2.3
1.8
4.5
For comparison, the following code is used:
awk 'NR==FNR{A[$1]=$2;next} {i=int($1+.01)} i in A {print A[i]}' base temp
Therefore, it outputs:
b
a
d
As noticed, even though there are decimals numbers in "temp" file, the corresponding letters are found and printed. However, I found that with a larger file (e.g., more than a couple of thousands row records in "temp" file) the code always outputs "158" rows less than the actual number of rows in the "temp" file. I do not get why this happens and would like your support to circumvent this.
In the following example, "tmpctd" is the base file and "tmpsf" is the changing file.
awk 'NR==FNR{A[$1]=$2;next} {i=int($1+.01)} i in A {print A[i]}' tmpctd tmpsf
The above comparison produces 22623 rows, but the "tmpsf" (i.e., "temp" file) has 22781 rows. Thus, 158 rows less after comparing both files. For testing please find these files here: https://file.io/pxi24ZtPt0kD and https://file.io/tHgdI3dkbKhr.
Any hints are welcomed.
PS. I updated both links, sorry for that.
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
a[int($1)]
next
}
($1 in a){
print $2
}
' temp_file base_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when temp_file is being read.
a[int($1)] ##Creating array a which has index as integer value of 1st field of current line.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if first field is present in array a then do following.
print $2 ##Printing 2nd field of currnet line.
}
' temp_file base_file ##Mentioning Input_file names here.

Subtract a constant number from a column

I have two large files (~10GB) as follows:
file1.csv
name,id,dob,year,age,score
Mike,1,2014-01-01,2016,2,20
Ellen,2, 2012-01-01,2016,4,35
.
.
file2.csv
id,course_name,course_id
1,math,101
1,physics,102
1,chemistry,103
2,math,101
2,physics,102
2,chemistry,103
.
.
I want to subtract 1 from the "id" columns of these files:
file1_updated.csv
name,id,dob,year,age,score
Mike,0,2014-01-01,2016,2,20
Ellen,0, 2012-01-01,2016,4,35
file2_updated.csv
id,course_name,course_id
0,math,101
0,physics,102
0,chemistry,103
1,math,101
1,physics,102
1,chemistry,103
I have tried awk '{print ($1 - 1) "," $0}' file2.csv, but did not get the correct result:
-1,id,course_name,course_id
0,1,math,101
0,1,physics,102
0,1,chemistry,103
1,2,math,101
1,2,physics,102
1,2,chemistry,103
You've added an extra column in your attempt. Instead set your first field $1 to $1-1:
awk -F"," 'BEGIN{OFS=","} {$1=$1-1;print $0}' file2.csv
That semicolon separates the commands. We set the delimiter to comma (-F",") and the Output Field Seperator to comma BEGIN{OFS=","}. The first command to subtract 1 from the first field executes first, then the print command executes second, so the entire record, $0, will now contain the new $1 value when it's printed.
It might be helpful to only subtract 1 from records that are not your header. So you can add a condition to the first command:
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1-1} {print $0}' file2.csv
Now we only subtract when the record number (NR) is greater than 1. Then we just print the entire record.

Awk statement generating extra output line; skip blank input lines

jon,doe 5 5
sam,smith 10 5
I am required to calculate average for row & column. So basically the inputfile contains name score1 and score2 and i am required to read the contents from a file and then calculate average row-wise and column-wise. I am getting the desired result but there is one extra '0' that i am getting due to white space i would appreciate if someone could help.
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt
This is the output that i am getting-
name Score1 Score2 Average
jon,doe 5 5 5
sam,smith 10 5 7.5
0
Average 7.5 5
It looks like you have an extra empty or blank (all-whitespace) line in your input file.
Adding NF==0 {next} as the first pattern-action pair will skip all empty or blank lines and give the desired result.
NF==0 only matches if no fields (data) were found in the input line.
The next statement skips remaining statements for the current input line (record) and continues processing on the next line (record).
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} NF==0 {next} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt

Find the first line having a variable value bigger than a specific number

I have a very huge text file and I want to know how can I find the first line in which the value of a variable is bigger than 1000?
assuming that the variable and its value have only one space in between like this:
abcd 24
Find the first occurrence of abcd greater than 1000 and print the line number and matching line and quit:
$ awk '$1=="abcd" && $2>1000{print NR, $0; exit}' file
To find any variable greater than 1000 just drop the first condition:
$ awk '$2>1000{print NR, $0; exit}' file

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux
Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.
You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.
$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Resources