Bash: While-read loop skips the last line of comma-separated text file - linux

I am trying to read some comma-separated data from a text file, parse it and calculate average of column-5.
The input is in the following form:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
I am using the following script for this:
hours = 0
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7
do
((hours = hours + col5))
echo "$col1, $col2"
done < <(tail -n+2 user-list.txt)
echo "$hours"
The problem with the script is that it does not read / parse the last line of the text.
What can I do about that?
Also, every time I run the script, the value of hours keeps on increasing (probably the previous value is stored). How can the value be defaulted to zero everytime the script runs?
TIA

The following code worked for me:
hours=0
#echo "\n" >> user-list.txt
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
((hours = hours + col5))
#echo "$col1, $col2"
done < <(tail -n+2 user-list.txt)
((hours = hours/10))
echo "$hours"

Related

Bash: Reading CSV text file and finding average of rows

This is the sample input (the data has user-IDs and the number of hours spent by the user):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).
I wrote the following script:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this. Also, the numbers arent adding up correctly.
I am getting the following output (for the first one - user2):
27
5
I am expecting the following output:
27
5.4
What would be a better way to do it? Any help would be appreciated.
TIA
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/ makes sure we only look at "even" user-IDs.
for(i=3;i<=7;i++){} iterates over the day columns and adds them.
Edit 1:
Accommodating new requirement:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:
-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)
The above generates:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print can be replaced with printf and the desired format string ...
You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.
Easy to demonstrate:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc for the floating point result:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk, perl, python, ruby etc.
Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User that end in any of 0,2,4,6,8:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User2 27 5.4
User4 20 4
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000

Bash: Reading a CSV file and sorting column based on a condition

I am trying read a CSV text file and print all entries of one column (sorted), based on a condition.
The input sample is as below:
Computer ID,User ID,M
Computer1,User3,5
Computer2,User5,8
computer3,User4,9
computer4,User10,3
computer5,User9,0
computer6,User1,11
The user-ID (2nd column) needs to be printed if the hours (third column) is greater than zero. However, the printed data should be sorted based on the user-id.
I have written the following script:
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2" > login.txt
fi
done < <(tail -n+2 user-list.txt)
The output of this script is:
User3
User5
User4
User10
User1
I am expecting the following output:
User1
User3
User4
User5
User10
Any help would be appreciated. TIA
awk -F, 'NR == 1 { next } $3 > 0 { match($2,/[[:digit:]]+/);map[$2]=substr($2,RSTART) } END { PROCINFO["sorted_in"]="#val_num_asc";for (i in map) { print i } }' user-list.txt > login.txt
Set the field delimiter to commas with -F, Ignore the header with NR == 1 { next } Set the index of an array (map) to the user when the 3rd delimited field is greater than 0. The value is set the number part of the User field (found with the match function) In the end block, set the sort order to value, number, ascending and loop through the map array created.
The problem with your script (and I presume with the "sorting isn't working") is the place where you redirect (and may have tried to sort) - the following variant of your own script does the job:
#!/bin/bash
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2"
fi
done < <(tail -n+2 user-list.txt) | sort > login.txt
Edit 1: Match new requirement
Sure we can fix the sorting; sort -k1.5,1.7n > login.txt
Of course, that, too, will only work if your user IDs are all 4 alphas and n digits ...
Sort ASCIIbetically:
tail -n +2 user-list.txt | perl -F',' -lane 'print if $F[2] > 0;' | sort -t, -k2,2
computer6,User1,11
computer4,User10,3
Computer1,User3,5
computer3,User4,9
Computer2,User5,8
Or sort numerically by the user number:
tail -n +2 user-list.txt | perl -F',' -lane 'print if $F[2] > 0;' | sort -t, -k2,2V
computer6,User1,11
Computer1,User3,5
computer3,User4,9
Computer2,User5,8
computer4,User10,3
Using awk for condition handling and sort for ordering:
$ awk -F, ' # comma delimiter
FNR>1 && $3 { # skip header and accept only non-zero hours
a[$2]++ # count instances for duplicates
}
END {
for(i in a) # all stored usernames
for(j=1;j<=a[i];j++) # remove this if there are no duplicates
print i | "sort -V" # send output to sort -V
}' file
Output:
User1
User3
User4
User5
User10
If there are no duplicated usernames, you can replace a[$2]++ with just a[$2] and remove the latter for. Also, no real need for sort to be inside awk program, you could just as well pipe data from awk to sort, like:
$ awk -F, 'FNR>1&&$3{a[$2]++}END{for(i in a)print i}' file | sort -V
FNR>1 && $3 skips the header and processes records where hours column is not null. If your data has records with negative hours and you only want positive hours, change it to FNR>1 && $3>0.
Or you could use grep with PCRE andsort:
$ grep -Po "(?<=,).*(?=,[1-9])" file | sort -V

Linux: I want to fetch the top 10 records of the column3. The column has some missing data. I have sorted the file

I am not sure how to ignore the missing data here.
My ; separated file looks like (writing it with spaces so that it is readable):
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
and so on....
I want to fetch the records with maximum values of column3
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
I sorted the file on column3 in reverse order. Not sure how to proceed further.
sort -t';' -k3 -r original.txt > newfile.txt
Something great about the command line is that you can easily use the best tool for the proper application, either chaining output with pipes | or by creating temporary files like newfile.txt.
In this case, using sort is the apt choice for sorting your data. Once it's sorted, you can use a separate tool that's very efficient at parsing data, awk.
Starting from your sorted newfile.txt, this awk operation will only print a line with 5 fields (assuming your missing data is actually missing and there are no empty separators, e.g. your line looks like 45;c; rather than 45;c;;;)
awk -F';' 'NF == 5 { print }' newfile.txt
However, in the case that the empty fields are delimited (e.g. 45;c;;;), and assuming that only columns 3 through 5 may have missing data, this will handle it:
awk -F';' 'NF == 5 && $3 && $4 && $5 { print }' newfile.txt
Note that since the default behavior of awk is to print, the above { print } is actually unnecessary, but included pedagogically.
Thus, from start to finish, you can get your desired result with,
sort -t ';' -rk3 original.txt | awk 'NF==5 && $3 && $4 && $5' > result.txt
You can use the following command:
$ head -1 fileToSort; (sort -k3 -n -r <(tail -n +2 fileToSort) | head)
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
1 b ? ? ?
12 a ? ? ?
where fileToSort is
cat fileToSort
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
Explanations:
Use -t';' if your field separator is ;
<(tail -n +2 fileToSort) will exclude the header of the input file
You then sort it via the 3rd key in reverse and numeric mode using -n
head will limit the output at the first 10 records
head -1 fileToSort; will print the header line before printing the top 10 records
If you need an awk solution:
awk 'NR==1;NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/{buff[$3]=$0}END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}}' fileToSort
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
You might need to add -F';' just after the awk command if your file does have ; instead of spaces. Your command will become: awk -F';' ...
NR==1; print the first line
NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/ check that you have 5 fields and that the values of the 3 last columns are numerical
{buff[$3]=$0} save each line in a buffer indexed by the col3 value
END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}} at the end of the processing just order the array depending on the value of the index and print it in the reverse order.

Awk: Sum up column values across multiple files with identical column layout

I have a number of files with the same header:
COL1, COL2, COL3, COL4
You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:
File 1
COL1 COL2 COL3 COL4
x y z 3
a b c 4
File 2
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Then a new file is returned:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.
Thanks
Combining paste with awk, as in Kristo Mägi's answer, is your best bet:
paste merges the corresponding lines from the input files,
which sends a single stream of input lines to awk, with each input line containing all fields to sum up.
Assuming a fixed number of input files and columns, Kristo's answer can be simplified to (making processing much more efficient):
paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
Note: The above produces space-separated output columns, because awk's default value for OFS, the output-field separator, is a single space.
Assuming that all files have the same column structure and line count, below is a generalization of the solution, which:
generalizes to more than 2 input files (and more than 2 data rows)
generalizes to any number of fields, as long as the field to sum up is the last one.
#!/bin/bash
files=( file1 file2 ) # array of input files
paste "${files[#]}" | awk -v numFiles=${#files[#]} -v OFS='\t' '
{
row = sep = ""
for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
printf "%s%s%s\n", row, OFS, sum
}
'
Note that \t (the tab char.) is used to separate output fields and that, due to relying on awk's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.
If all files have the same header - awk solution:
awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]
The output (for 2 files):
COL1 COL2 COL3 COL4
x y z 8
a b c 14
This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion)
One more option.
The command:
paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
The result:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
What it does:
Test files:
$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4
$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Command: paste f{1,2}.txt
Joins 2 files and gives output:
COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10
Command: sed '1d'
Is meant to remove header temporarily
Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.
Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back
EDIT:
Following #mklement0 comment, he is right about header handling as I forgot the NR==1 part.
So, I'll proxy his updated version here also:
paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
You state you have "a number of files". i.e., more than 2.
Given these 3 files (and should work with any number):
$ cat f1 f2 f3
COL1 COL2 COL3 COL4
x y z 3
a b c 4
COL1 COL2 COL3 COL4
x y z 5
a b c 10
COL1 COL2 COL3 COL4
x y z 10
a b c 15
You can do:
$ awk 'FNR==1{next}
{sum[$1]+=$4}
END{print "COL1 COL4";
for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29
It is unclear what you intend to do with COL2 or COL3, so I did not add that.
$ awk '
NR==1 { print }
{ sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
x y z 8
a b c 14
The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.

awk difference between subsequent lines

This is a great example how to solve the problem if I want to print differences between subsequent lines of a single column.
awk 'NR>1{print $1-p} {p=$1}' file
But how would I do it if I have multiple (unknown) number of columns in the file and I want the differences for all of them, eg. (note that the number of columns is not necessarily 3, it can be 10 or 15 or more)
col1 col2 col3
---- ---- ----
1 3 2
2 4 10
1 9 -3
. . .
the output would be:
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
. . .
Instead of saving the first column, save the entire line and you would able to split it then print the difference using a loop:
awk 'NR>1{for(i=1;i<=NF;i++) printf "%d ", $i - a[i] ; print ""}
{p=split($0, a)}' file
If you need the column title then you can print it using BEGIN.
$ awk 'NR<3; NR>3{for (i=1;i<=NF;i++) printf "%d%s", $i-p[i], (i<NF?OFS:ORS)} {split($0,p)}' file | column -t
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13

Resources