This is the sample input (the data has user-IDs and the number of hours spent by the user):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).
I wrote the following script:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this. Also, the numbers arent adding up correctly.
I am getting the following output (for the first one - user2):
27
5
I am expecting the following output:
27
5.4
What would be a better way to do it? Any help would be appreciated.
TIA
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/ makes sure we only look at "even" user-IDs.
for(i=3;i<=7;i++){} iterates over the day columns and adds them.
Edit 1:
Accommodating new requirement:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:
-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)
The above generates:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print can be replaced with printf and the desired format string ...
You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.
Easy to demonstrate:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc for the floating point result:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk, perl, python, ruby etc.
Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User that end in any of 0,2,4,6,8:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User2 27 5.4
User4 20 4
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000
I am not sure how to ignore the missing data here.
My ; separated file looks like (writing it with spaces so that it is readable):
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
and so on....
I want to fetch the records with maximum values of column3
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
I sorted the file on column3 in reverse order. Not sure how to proceed further.
sort -t';' -k3 -r original.txt > newfile.txt
Something great about the command line is that you can easily use the best tool for the proper application, either chaining output with pipes | or by creating temporary files like newfile.txt.
In this case, using sort is the apt choice for sorting your data. Once it's sorted, you can use a separate tool that's very efficient at parsing data, awk.
Starting from your sorted newfile.txt, this awk operation will only print a line with 5 fields (assuming your missing data is actually missing and there are no empty separators, e.g. your line looks like 45;c; rather than 45;c;;;)
awk -F';' 'NF == 5 { print }' newfile.txt
However, in the case that the empty fields are delimited (e.g. 45;c;;;), and assuming that only columns 3 through 5 may have missing data, this will handle it:
awk -F';' 'NF == 5 && $3 && $4 && $5 { print }' newfile.txt
Note that since the default behavior of awk is to print, the above { print } is actually unnecessary, but included pedagogically.
Thus, from start to finish, you can get your desired result with,
sort -t ';' -rk3 original.txt | awk 'NF==5 && $3 && $4 && $5' > result.txt
You can use the following command:
$ head -1 fileToSort; (sort -k3 -n -r <(tail -n +2 fileToSort) | head)
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
1 b ? ? ?
12 a ? ? ?
where fileToSort is
cat fileToSort
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
Explanations:
Use -t';' if your field separator is ;
<(tail -n +2 fileToSort) will exclude the header of the input file
You then sort it via the 3rd key in reverse and numeric mode using -n
head will limit the output at the first 10 records
head -1 fileToSort; will print the header line before printing the top 10 records
If you need an awk solution:
awk 'NR==1;NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/{buff[$3]=$0}END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}}' fileToSort
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
You might need to add -F';' just after the awk command if your file does have ; instead of spaces. Your command will become: awk -F';' ...
NR==1; print the first line
NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/ check that you have 5 fields and that the values of the 3 last columns are numerical
{buff[$3]=$0} save each line in a buffer indexed by the col3 value
END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}} at the end of the processing just order the array depending on the value of the index and print it in the reverse order.
I have a file with 3 columns like this
Col1 Col2 Col3
A B <-
C D ->
E F ->
I want to swap the entries of the Col1 and Col2 whenever there is
<-
in the third column. I want my output file to be like
Col1 Col2 Col3
B A ->
C D ->
E F ->
awk '($3=="<-"){$3=$2;$2=$1;$1=$3;$3="->"}1' <file>
Essentially, if $3=="<-", then swap the columns and redefine $3. Then print.
An short awk example is
cat foooo | awk '{if (match($3,"<-")){print $2,$1,$3}else{print $1,$2,$3}}'
where foooo is the file name.
If you also want to change the "<-" then the code would be
cat foooo | awk '{if (match($3,"<-")){print $2,$1,"->"}else{print $1,$2,$3}}'
This is a great example how to solve the problem if I want to print differences between subsequent lines of a single column.
awk 'NR>1{print $1-p} {p=$1}' file
But how would I do it if I have multiple (unknown) number of columns in the file and I want the differences for all of them, eg. (note that the number of columns is not necessarily 3, it can be 10 or 15 or more)
col1 col2 col3
---- ---- ----
1 3 2
2 4 10
1 9 -3
. . .
the output would be:
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
. . .
Instead of saving the first column, save the entire line and you would able to split it then print the difference using a loop:
awk 'NR>1{for(i=1;i<=NF;i++) printf "%d ", $i - a[i] ; print ""}
{p=split($0, a)}' file
If you need the column title then you can print it using BEGIN.
$ awk 'NR<3; NR>3{for (i=1;i<=NF;i++) printf "%d%s", $i-p[i], (i<NF?OFS:ORS)} {split($0,p)}' file | column -t
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
I have two text files and I want to compare their correspondent values according to their rows and columns. Each value (field) in the text file is separated by tabs.
Here are the files:
file1.txt
Name Col1 Col2 Col3
-----------------------
row1 1 4 7
row2 2 5 8
row3 3 6 9
file2.txt
Name Col1 Col2 Col3
-----------------------
row2 1 4 11
row1 2 5 12
row3 3 9
Here is the code I have so far:
awk '
FNR < 2 {next}
FNR == NR {
for (i = 2; i <= NF; i++) {
a[i,$1] = $i;
}
next;
}
# only compare if a row in file2 exists in file1
($1 in b) {
for (i = 2; i <= NF; i++)
{
if (a[i,$1] == $i)
{
print "EQUAL"
}
else if ( //condition that checks if value is null// )
{
print "NULL"
}
else
{
print "NOT EQUAL"
}
}
}' file1.txt file2.txt
I am having difficulties with checking if there is a null value (row3 and col2 in file2.txt) in file2.txt. I don't even get an output for that null value. So far I tried if ($i == "") and it is still not giving me any output. Any suggestions? Thanks. (I'm using gnu awk in a bash script)
Let me know if further explanation is required.
Just set the FS to tab:
awk -F'\t' '....'