Awk average of column by moving difference of grouping column variable - linux

I have a file that look like this:
1 snp1 0.0 4
1 snp2 0.2 6
1 snp3 0.3 4
1 snp4 0.4 3
1 snp5 0.5 5
1 snp6 0.6 6
1 snp7 1.3 5
1 snp8 1.3 3
1 snp9 1.9 4
File is sorted by column 3. I want the average of 4th column grouped by column 3 every 0.5 unit apart. For example it should output like this:
1 snp1 0.0 4.4
1 snp6 0.6 6.0
1 snp7 1.3 4.0
1 snp9 1.9 4.0
I can print all positions without average like this:
awk 'NR==1 {pos=$3; print $0} $3>=pos+0.5{pos=$3; print $0}' input
But I am not able to figure out how to print average of 4th column. It would be great if someone can help me to find solution to this problem. Thanks!

Something like this, maybe:
awk '
NR==1 {c1=$1; c2=$2; v=$3; n=1; s=$4; next}
$3>v+0.5 {print c1, c2, v, s/n; c1=$1; c2=$2; v=$3; n=1; s=$4; next}
{n+=1; s+=$4}
END {print c1, c2, v, s/n}
' input

Related

How to match two different length and different column text file with header using join command in linux

I have two different length text files A.txt and B.txt
A.txt looks like :
ID pos val1 val2 val3
1 2 0.8 0.5 0.6
2 4 0.9 0.6 0.8
3 6 1.0 1.2 1.3
4 8 2.5 2.2 3.4
5 10 3.2 3.4 3.8
B.txt looks like :
pos category
2 A
4 B
6 A
8 C
10 B
I want to match pos column and in both files and want the output like this
ID catgeory pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I used the join function join -1 2 -2 1 <(sort -k2 A.txt) <(sort -k1 B.txt) > C.txt
The C.txt comes without a header
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I want to get output with a header from the join function. kindly help me out
Thanks in advance
In case you are ok with awk, could you please try following. Written and tested with shown samples in GNU awk.
awk 'FNR==NR{a[$1]=$2;next} ($2 in a){$2=a[$2] OFS $2} 1' B.txt A.txt | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when B.txt is being read.
a[$1]=$2 ##Creating array a with index of 1st field and value is 2nd field of current line.
next ##next will skip all further statements from here.
}
($2 in a){ ##Checking condition if 2nd field is present in array a then do following.
$2=a[$2] OFS $2 ##Adding array a value along with 2nd field in 2nd field as per output.
}
1 ##1 will print current line.
' B.txt A.txt | column -t ##Mentioning Input_file names and passing awk program output to column to make it look better.
As you requested... It is perfectly possible to get the desired output using just GNU join:
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$
The key to getting the correct output is using the sort -g option, and specifying the join output column order using the -o option.
To "pretty print" the output, pipe to column -t
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5 | column -t
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$

how to do sorting of column 3, and change corresponding value of column 2 in new file using shell scripting? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to do sorting using sort command.
Input file is 1.txt
1 2 2
1 3 5.5
1 4 1.5
1 5 2.2
2 1 1.1
2 3 0.7
2 4 0.9
2 5 0.4
out file should be
1 4 1.5
1 2 2
1 5 2.2
1 3 5.5
2 5 0.4
2 3 0.7
2 4 0.9
2 1 1.1
column 3 should be sorted and corresponding second column should change.
Seems like you just want to do a numeric sort on two keys:
$ sort -n -k1 -k3 file
1 4 1.5
1 2 2
1 5 2.2
1 3 5.5
2 5 0.4
2 3 0.7
2 4 0.9
2 1 1.1
-n does a numeric sort, first on field 1 -k1 and then on field 3 -k3.
try then tune this one:
cat 1.txt | sed -E -e 's/[[:blank:]]+/ /g' | awk 'BEGIN {FS=" "; OFS=" "} {print $1, $3, $2}' | sort | awk 'BEGIN {FS=" "; OFS=" "} {print $1, $3, $2}'
sed - unify the column separators to a single space
awk - change order of columns
sort - sort 'em accordingly
awk - change order of columns back to normal
Here is one using GNU awk. It reads the data in memory and sorts while outputing so a huge file may cause you problems:
$ awk '{
a[$1][$3]=$0
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"
for(i in a)
for(j in a[i])
print a[i][j]
}' file
Output (cleaned off the leading space after awk output):
1 4 1.5
1 2 2
1 5 2.2
1 3 5.5
2 5 0.4
2 3 0.7
2 4 0.9
2 1 1.1

how to compare two text file with first column if match then print same if not then put zero?

1.txt contain
1
2
3
4
5
.
.
180
2.txt contain
3 0.5
4 0.8
9 9.0
120 3.0
179 2.0
so I want my output like if 2.txt match with first column of 1.txt then should print the value of second column that is in 2.txt. while if not match then should print zero .
like output should be:
1 0.0
2 0.0
3 0.5
4 0.8
5 0.0
.
.
8 0.0
9 9.0
10 0.0
11 0.0
.
.
.
120 3.0
121 0.0
.
.
150 0.0
.
179 2.0
180 0.0
awk 'NR==FNR{a[$1]=$2;next}{if($1 in a){print $1,a[$1]}else{print $1,"0.0"}}' 2.txt 1.txt
Brief explanation,
NR==FNR{a[$1]=$2;next: Record $1 of 2.txt into array a
If the $1 in 1.txt exists in array a, print a[$1], else print 0.0
Could you please try following and let me know if this helps you.
awk 'FNR==NR{a[$1];next} {for(i=prev+1;i<=($1-1);i++){print i,"0.0"}}{prev=$1;$1=$1;print}' OFS="\t" 1.txt 2.txt
Explanation of code:
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1.txt is being read.
a[$1]; ##Creating an array a whose index is $1.
next ##next will skip all further statements from here.
}
{
for(i=prev+1;i<=($1-1);i++){ ##Starting a for loop from variable prev+1 to till value of first field with less than 1 to it.
print i,"0.0"} ##Printing value of variable i and 0.0 here.
}
{
prev=$1; ##Setting $1 value to variable prev here.
$1=$1; ##Resetting $1 here to make TAB output delimited in output.
print ##Printing the current line here.
}' OFS="\t" 1.txt 2.txt ##Setting OFS as TAB and mentioning Input_file(s) name here.
Execution of above code:
Input_file(s):
cat 1.txt
1
2
3
4
5
6
7
cat 2.txt
3 0.5
4 0.8
9 9.0
Output will be as follows:
awk 'FNR==NR{a[$1];next} {for(i=prev+1;i<=($1-1);i++){print i,"0.0"}}{prev=$1;$1=$1;print}' OFS="\t" 1.txt 2.txt
1 0.0
2 0.0
3 0.5
4 0.8
5 0.0
6 0.0
7 0.0
8 0.0
9 9.0
This might work for you (GNU sed):
sed -r 's#^(\S+)\s.*#/^\1\\s*$/c&#' file2 | sed -i -f - -e 's/$/ 0.0/' file1
Create a sed script from file2 that if the first field from file2 matches the first field of file1 changes the matching line to the contents of the matching line in file2. All other lines are then zeroed i.e. lines not changed have 0.0 appended.

Average column if value in other column matches and print as additional column

I have a file like this:
Score 1 24 HG 1
Score 2 26 HG 2
Score 5 56 RP 0.5
Score 7 82 RP 1
Score 12 97 GM 5
Score 32 104 LS 3
I would like to average column 5 if column 4 are identical and print the average as column 6 so that it looks like this:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
I have tried a couple of solutions I found on here.
e.g.
awk '{ total[$4] += $5; ++n[$4] } END { for(i in total) print i, total[i] / n[i] }'
but they all end up with this:
HG 1.5
RP 0.75
GM 5
LS 3
Which is undesirable as I lose a lot of information.
You can iterate through your table twice: calculate the averages (as you already) do on the first iteration, and then print them out on the second iteration:
awk 'NR==FNR { total[$4] += $5; ++n[$4] } NR>FNR { print $0, total[$4] / n[$4] }' file file
Notice the file twice at the end. While going through the "first" file, NR==FNR, and we sum the appropriate values, keeping them in memory (variables total and n). During "second" file traversal, NR>FNR, and we print out all the original data + averages:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
You can use 1 pass through the file, but you have to store in memory the entire file, so disk i/o vs memory tradeoff:
awk '
BEGIN {FS = OFS = "\t"}
{total[$4] += $5; n[$4]++; line[NR] = $0; key[NR] = $4}
END {for (i=1; i<=NR; i++) print line[i], total[key[i]] / n[key[i]]}
' file

How to sum column of different file in bash scripting

I have two files:
file-1
1 2 3 4
1 2 3 4
1 2 3 4
file-2
0.5
0.5
0.5
Now I want to add column 1 of file-2 to column 3 of file-1
Output
1 2 3.5 4
1 2 3.5 4
1 2 3.5 4
I've tried this, but it does not work correctly:
awk '{print $1, $2, $3+file-2 }' file-2=$1_of_file-2 file-1 > file-3
I know the awk statement is not right but I want to use something like this; can anyone help me?
Your data isn't very exciting…
awk 'FNR == NR { for (i = 1; i <= NF; i++) { line[NR,i] = $i } fields[NR] = NF }
FNR != NR { line[FNR,3] += $1
pad = ""
for (i = 1; i <= fields[FNR]; i++) { printf "%s%s", pad, line[FNR,i]; pad = " " }
printf "\n"
}' file-1 file-2
The first pattern matches the lines in the first file; it saves each field into the pseudo-multidimensional array line, and also records how many fields there are in that line.
The second pattern matches the lines in the second file; it adds the value in column one to column three of the saved data, then prints out all the fields with a space between them, and adds a newline to the end.
Given this (mildly) modified input, the script (saved in file so-25657951.sh) produces the output shown:
$ cat file-1
1 2 3 4
2 3 6 5
3 4 9 6
$ cat file-2
0.1
0.2
0.3
$ bash so-25657951.sh
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
$
Note that because this slurps the whole of the first file into memory before reading anything from the second file, the input files should not be too large (say sub-gigabyte size). If they're bigger than that, you should probably devise an alternative strategy.
For example, there is a getline function (even in POSIX awk) which could be used to read a line from file 2 for each line in file 1, and you could then simply print the data without needing to accumulate anything:
awk '{ getline add < "file-2"; $3 += add; print }' file-1
This works reasonably cleanly for any size of file (as long as the files have the same number of lines — or, more precisely, as long as file-2 has at least as many lines as file-1).
This may work:
cat f1
1 2 3 4
2 3 6 5
3 4 9 6
cat f2
0.1
0.2
0.3
awk 'FNR==NR {a[NR]=$1;next} {$3+=a[FNR]}1' f2 f1
1 2 3.1 4
2 3 6.2 5
3 4 9.3 6
After I posted it, I do see that its the same as Jaypal posted in a comment.

Resources