Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I would like to extract sum, mean and average in each 6 numbers interval from a column.
I found many discussions related to this problem, but all those are for whole column. e.g.
To compute sum of a column:
awk '{sum+=$1} END { print sum}'
To calculate Average:
awk '{sum+=$1} END { print sum/NR}'
To find maximum or minimum, sort command can be used.
I need all these in an interval. e.g., my input file is
inputfile.txt
1 3
2 5
3 4
4 3
5 2
6 1
7 3
8 3
9 4
10 2
11 2
12 2
13 5
14 4
15 2
16 3
17 7
18 3
Output files are
sum.txt
1 18
2 16
3 24
average.txt
1 3
2 2.67
3 4
maximum.txt
1 5
2 4
3 7
Please make a search before you ask any question many posts are already there
You can try something like below, modify accordingly
Input
[akshay#localhost tmp]$ cat input.txt
1 3
2 5
3 4
4 3
5 2
6 1
7 3
8 3
9 4
10 2
11 2
12 2
13 5
14 4
15 2
16 3
17 7
18 3
Script
[akshay#localhost tmp]$ cat test.awk
{
sum += $2
max = max > $2 ? max : $2
}
!(FNR%6){
print ++c,sum > "sum.txt"
print c,sum/6 > "average.txt"
print c,max > "maximum.txt"
sum = max = ""
}
Output
[akshay#localhost tmp]$ awk -f test.awk input.txt
Sum
[akshay#localhost tmp]$ cat sum.txt
1 18
2 16
3 24
Average
[akshay#localhost tmp]$ cat average.txt
1 3
2 2.66667
3 4
Maximum
[akshay#localhost tmp]$ cat maximum.txt
1 5
2 4
3 7
This side isn't meant to write whole programs, and that's basically what you're asking us to do.
What you need to do is keep track of how many lines you've read and then every 6th line you produce output. Consider something like this:
awk '{sum += $1} (NR%6)==0 {print(sum); sum=0}' input.txt
I'm not going to explain what I did, because I expect you to please search the internet for awk tutorials, and get an understanding of what I am doing yourself.
Related
I have several lines of text. I want to extract the number after specific word using awk.
I tried the following code but it does not work.
At first, create the test file by: vi test.text. There are 3 columns (the 3 fields are generated by some other pipeline commands using awk).
Index AllocTres CPUTotal
1 cpu=1,mem=256G 18
2 cpu=2,mem=1024M 16
3 4
4 cpu=12,gres/gpu=3 12
5 8
6 9
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21
Please note there are several empty fields in this file.
what I want to achieve is to extract the number after the first gres/gpu= in each line (if no gres/gpu= occurs in this line, the default number is 0) using a pipeline like: cat test.text | awk '{some_commands}' to output 4 columns:
Index AllocTres CPUTotal GPUAllocated
1 cpu=1,mem=256G 18 0
2 cpu=2,mem=1024M 16 0
3 4 0
4 cpu=12,gres/gpu=3 12 3
5 8 0
6 9 0
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20 4
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21 3
Firstly: awk do not need cat, it could read files on its' own. Combining cat and awk is generally discouraged as useless use of cat.
For this task I would use GNU AWK following way, let file.txt content be
cpu=1,mem=256G
cpu=2,mem=1024M
cpu=12,gres/gpu=3
cpu=13,gres/gpu=4,gres/gpu:ret6000=2
mem=12G,gres/gpu=3,gres/gpu:1080ti=1
then
awk 'BEGIN{FS="gres/gpu="}{print $2+0}' file.txt
output
0
0
0
3
0
0
4
3
Explanation: I inform GNU AWK that field separator (FS) is gres/gpu= then for each line I do print 2nd field increased by zero. For lines without gres/gpu= $2 is empty string, when used in arithmetic context this is same as zero so zero plus zero gives zero. For lines with at least one gres/gpu= increasing by zero provokes GNU AWK to find longest prefix which is legal number, thus 3 (4th line) becomes 3, 4, (7th line) becomes 4, 3, (8th line) becomes 3.
(tested in GNU Awk 5.0.1)
With your shown samples in GNU awk you can try following code. Written and tested in GNU awk. Simple explanation would be using awk's match function where using regex gres\/gpu=([0-9]+)(escaping / here) and creating one and only capturing group to capture all digits coming after =. Once match is found printing current line followed by array's arr's 1st element +0(to print zero in case no match found for any line) here.
awk '
FNR==1{
print $0,"GPUAllocated"
next
}
{
match($0,/gres\/gpu=([0-9]+)/,arr)
print $0,arr[1]+0
}
' Input_file
Using sed
$ sed '1s/$/\tGPUAllocated/;s~.*gres/gpu=\([0-9]\).*~& \t\1~;1!{\~gres/gpu=[0-9]~!s/$/ \t0/}' input_file
Index AllocTres CPUTotal GPUAllocated
1 cpu=1,mem=256G 18 0
2 cpu=2,mem=1024M 16 0
3 4 0
4 cpu=12,gres/gpu=3 12 3
5 8 0
6 9 0
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20 4
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21 3
awk '
BEGIN{FS="\t"}
NR==1{
$(NF+1)="GPUAllocated"
}
NR>1{
$(NF+1)=FS 0
}
/gres\/gpu=/{
split($0, a, "=")
gp=a[3]; gsub(/[ ,].*/, "", gp)
$NF=FS gp
}1' test.text
Index AllocTres CPUTotal GPUAllocated
1 cpu=1,mem=256G 18 0
2 cpu=2,mem=1024M 16 0
3 4 0
4 cpu=12,gres/gpu=3 12 3
5 8 0
6 9 0
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20 4
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21 3
I would appreciate some help with an awk script, or whatever would do the job.
So, I've got multiple files (the same amount of lines and columns) and I want to do an average of every number in every column (except the first) from all the files. I have got no idea how many columns there are in a file (though i could probably get the number if needed).
filename.1
1 1 2 3 4
2 3 4 5 6
3 2 3 5 6
filename.2
1 3 4 6 6
2 5 6 7 8
3 4 5 7 8
output
1 2 3 5 5
2 4 5 6 7
3 3 4 6 7
I've found this somewhere on here that does it for a single column (as far as I understand it
awk '{a[FNR]+=$2;b[FNR]++;}END{for(i=1;i<=FNR;i++)print i,a[i]/b[i];}' fort.*
So the only? change would be to replace the +=$2 with a cycle over all columns? Is there a way to do that without knowing the exact number of columns?
Thanks.
$ cat tst.awk
{
key[FNR] = $1
for (colNr=2; colNr<=NF; colNr++) {
sum[FNR,colNr] += $colNr
}
}
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
printf "%s%s", key[rowNr], OFS
for (colNr=2; colNr<=NF; colNr++) {
printf "%s%s", int(sum[rowNr,colNr]/ARGIND+0.5), (colNr<NF ? OFS : ORS)
}
}
}
$ awk -f tst.awk file1 file2
1 2 3 5 5
2 4 5 6 7
3 3 4 6 7
The above uses GNU awk for ARGIND, with other awks just add a line FNR==1{ARGIND++} at the start.
I have a .txt file with 25,000 lines. Each line there is a number from 1 to 20. I want to compute the total occurrence of each number in the file. I don't know should I use grep or awk and how to use it. And I'm worried about I got confused with 1 and 11, which both contain 1's. Thank you very much for helping!
I was trying but this would double count my numbers.
grep -o '1' degreeDistirbution.txt | wc -l
With grep you can match the beginning and end of a line with '^' and '$' respectively. For the whole thing I'll use an array, but to illustrate this point I'll just use one variable:
one="$(grep -c "^1$" ./$inputfile)"
then we put that together with the magic of bash loops and loop through all the numbers with a while like so:
i=1
while [[ $i -le 20 ]]
do
arr[i]="$(grep -c "^$i$" ./$inputfile)"
i=$[$i+1]
done
if you like you can of course use a for as well
An easier method is:
sort -n file | uniq -c
Which will count the occurrences of each number in the sorted file and display the results like:
$ sort -n dat/twenty.txt | uniq -c
3 1
3 2
3 3
4 4
4 5
4 6
4 7
4 8
4 9
4 10
4 11
3 12
2 13
2 14
4 15
4 16
4 17
2 18
2 19
2 20
Showing I have 3 ones, 3 twos, etc.. in the sample file.
I've been struggling to write a code for extracting every N columns from an input file and write them into output files according to their extracting order.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In a simpler case below, extracting every 3 columns(fields) from an input file with a start point of the 2nd column.
for example, if the input file looks like:
aa 1 2 3 4 5 6 7 8 9
bb 1 2 3 4 5 6 7 8 9
cc 1 2 3 4 5 6 7 8 9
dd 1 2 3 4 5 6 7 8 9
and I want the output to look like this:
output_file_1:
1 2 3
1 2 3
1 2 3
1 2 3
output_file_2:
4 5 6
4 5 6
4 5 6
4 5 6
output_file_3:
7 8 9
7 8 9
7 8 9
7 8 9
I tried this, but it doesn't work:
awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>
It gave me syntax error and the more I fix the more problems coming out.
I also tried the linux command cut but while I was dealing with large files this seems effortless. And I wonder if cut would do a loop cut of every 3 fields just like the awk.
Can someone please help me with this and give a quick explanation? Thanks in advance.
Actions to be performed by awk on the input data must be included in curled braces, so the reason the awk one-liner you tried results in a syntax error is that the for cycle does not respect this rule. A syntactically correct version will be:
awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>
This is syntactically correct (almost, see end of this post.), but does not do what you think.
To separate the output by columns on different files, the best thing is to use awk redirection operator >. This will give you the desired output, given that your input files always has 10 columns:
awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>
mind the " " to specify the filenames.
EDITED: REAL WORLD CASE
If you have to loop along the columns because you have too many of them, you can still use awk (gawk), with two loops: one on the output files and one on the columns per file. This is a possible way:
#!/usr/bin/gawk -f
BEGIN{
CTOT = 24005 # total number of columns, you can use NF as well
DELTA = 800 # columns per file
START = 6 # first useful column
d = CTOT/DELTA # number of output files.
}
{
for ( i = 0 ; i < d ; i++)
{
for ( j = 0 ; j < DELTA ; j++)
{
printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
}
printf("\n") > "file_out_"i
}
}
I have tried this on the simple input files in your example. It works if CTOT can be divided by DELTA. I assumed you had floats (%f) just change that with what you need.
Let me know.
P.s. going back to your original one-liner, note that the loop is an infinite one, as i is not incremented: i+a must be substituted by i+=a, and a=3 must be inside the inner braces:
awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>
this evaluates a=3 at every cycle, which is a bit pointless. A better version would thus be:
awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>
Still, this will just print the 2nd, 5th and 8th column of your file, which is not what you wanted.
awk '{ print $2, $3, $4 >"output_file_1";
print $5, $6, $7 >"output_file_2";
print $8, $9, $10 >"output_file_3";
}' input_file
This makes one pass through the input file, which is preferable to multiple passes. Clearly, the code shown only deals with the fixed number of columns (and therefore a fixed number of output files). It can be modified, if necessary, to deal with variable numbers of columns and generating variable file names, etc.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In that case, you're correct; you need a loop. In fact, you need two loops:
awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
{
for (i = start; i < start + gap; i++)
{
file = sprintf("%s%d", filebase, i);
for (j = i; j <= NF; j += gap)
printf("%s ", $j) > file;
printf "\n" > file;
}
}' input_file
I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and gap set to 8 and start set to 2. The output below is the resulting 8 files pasted horizontally.
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
With GNU awk:
$ awk -v d=3 '{for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3",""); print "----"}' file
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
Just redirect the output to files if desired:
$ awk -v d=3 '{sfx=0; for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3","") > ("output_file_" ++sfx)}' file
The idea is just to tell gensub() to skip the first few (i-1) fields then print the number of fields you want (d = 3) and ignore the rest (.*). If you're not printing exact multiples of the number of fields you'll need to massage how many fields get printed on the last loop iteration. Do the math...
Here's a version that'd work in any awk. It requires 2 loops and modifies the spaces between fields but it's probably easier to understand:
$ awk -v d=3 '{sfx=0; for(i=2;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) {str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file
I was successful using the following command line. :) It uses a for loop and pipes the awk program into it's stdin using -f -. The awk program itself is created using bash variable math.
for i in 0 1 2; do
echo "{print \$$((i*3+2)) \" \" \$$((i*3+3)) \" \" \$$((i*3+4))}" \
| awk -f - t.file > "file$((i+1))"
done
Update: After the question has updated I tried to hack a script that creates the requested 800-cols-awk script dynamically ( a version according to Jonathan Lefflers answer) and pipe that to awk. Although the scripts looks good (for me ) it produces an awk syntax error. The question is, is this too much for awk or am I missing something? Would really appreciate feedback!
Update: Investigated this and found documentation that says awk has a lot af restrictions. They told to use gawk in this situations. (GNU's awk implementation). I've done that. But still I'll get an syntax error. Still feedback appreciated!
#!/bin/bash
# Note! Although the script's output looks ok (for me)
# it produces an awk syntax error. is this just too much for awk?
# open pipe to stdin of awk
exec 3> >(gawk -f - test.file)
# verify output using cat
#exec 3> >(cat)
echo '{' >&3
# write dynamic script to awk
for i in {0..24005..800} ; do
echo -n " print " >&3
for (( j=$i; j <= $((i+800)); j++ )) ; do
echo -n "\$$j " >&3
if [ $j = 24005 ] ; then
break
fi
done
echo "> \"file$((i/800+1))\";" >&3
done
echo "}"
I have the following file
ENST001 ENST002 4 4 4 88 9 9
ENST004 3 3 3 99 8 8
ENST009 ENST010 ENST006 8 8 8 77 8 8
Basically I want to count how many times ENST* is repeated in each line so the expected results is
2
1
3
Any suggestion please ?
Try this (and see it in action here):
awk '{print gsub("ENST[0-9]+","")}' INPUTFILE