Bash: Reading a CSV file and sorting column based on a condition - linux

I am trying read a CSV text file and print all entries of one column (sorted), based on a condition.
The input sample is as below:
Computer ID,User ID,M
Computer1,User3,5
Computer2,User5,8
computer3,User4,9
computer4,User10,3
computer5,User9,0
computer6,User1,11
The user-ID (2nd column) needs to be printed if the hours (third column) is greater than zero. However, the printed data should be sorted based on the user-id.
I have written the following script:
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2" > login.txt
fi
done < <(tail -n+2 user-list.txt)
The output of this script is:
User3
User5
User4
User10
User1
I am expecting the following output:
User1
User3
User4
User5
User10
Any help would be appreciated. TIA

awk -F, 'NR == 1 { next } $3 > 0 { match($2,/[[:digit:]]+/);map[$2]=substr($2,RSTART) } END { PROCINFO["sorted_in"]="#val_num_asc";for (i in map) { print i } }' user-list.txt > login.txt
Set the field delimiter to commas with -F, Ignore the header with NR == 1 { next } Set the index of an array (map) to the user when the 3rd delimited field is greater than 0. The value is set the number part of the User field (found with the match function) In the end block, set the sort order to value, number, ascending and loop through the map array created.

The problem with your script (and I presume with the "sorting isn't working") is the place where you redirect (and may have tried to sort) - the following variant of your own script does the job:
#!/bin/bash
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2"
fi
done < <(tail -n+2 user-list.txt) | sort > login.txt
Edit 1: Match new requirement
Sure we can fix the sorting; sort -k1.5,1.7n > login.txt
Of course, that, too, will only work if your user IDs are all 4 alphas and n digits ...

Sort ASCIIbetically:
tail -n +2 user-list.txt | perl -F',' -lane 'print if $F[2] > 0;' | sort -t, -k2,2
computer6,User1,11
computer4,User10,3
Computer1,User3,5
computer3,User4,9
Computer2,User5,8
Or sort numerically by the user number:
tail -n +2 user-list.txt | perl -F',' -lane 'print if $F[2] > 0;' | sort -t, -k2,2V
computer6,User1,11
Computer1,User3,5
computer3,User4,9
Computer2,User5,8
computer4,User10,3

Using awk for condition handling and sort for ordering:
$ awk -F, ' # comma delimiter
FNR>1 && $3 { # skip header and accept only non-zero hours
a[$2]++ # count instances for duplicates
}
END {
for(i in a) # all stored usernames
for(j=1;j<=a[i];j++) # remove this if there are no duplicates
print i | "sort -V" # send output to sort -V
}' file
Output:
User1
User3
User4
User5
User10
If there are no duplicated usernames, you can replace a[$2]++ with just a[$2] and remove the latter for. Also, no real need for sort to be inside awk program, you could just as well pipe data from awk to sort, like:
$ awk -F, 'FNR>1&&$3{a[$2]++}END{for(i in a)print i}' file | sort -V
FNR>1 && $3 skips the header and processes records where hours column is not null. If your data has records with negative hours and you only want positive hours, change it to FNR>1 && $3>0.
Or you could use grep with PCRE andsort:
$ grep -Po "(?<=,).*(?=,[1-9])" file | sort -V

Related

Linux Unique values count [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have .csv file, I want to count total values from column 5 only if corresponding value of column8 is not equal to "999"
I have tried with this, but not getting desired output.
cat test.csv | sed "1 d" |awk -F , '$8 != 999' | cut -d, -f5 | sort | uniq | wc -l >test.txt
Note that total number of records is more than 20K
I do am getting the number of unique values but it is not subtracting values with 999.
Can anyone help?
Sample Input:
Col1,Col2,Col3,Col4,Col5,Col7,Col8,Col9,Col10,Col11
1,0,0,0,ABCD,0,0,5436,0,0,0
1,0,0,0,543674,0,0,18999,0,0,0
1,0,0,0,143527,0,0,1336,0,0,0
1,0,0,0,4325,0,0,999,0,0,0
1,0,0,0,MCCDU,0,0,456,0,0,0
1,0,0,0,MCCDU,0,0,190,0,0,0
1,0,0,0,4325,0,0,190,0,0,0
What I want to do is not count the value from col5 if the corresponding value from col8 ==999.
By count total I mean total lines.
In above sample input, col5 value of line 6 and 7 is same, I need them to count as one.
I need to sort as Col5 values could be duplicate as I need to find total unique values.
Script:
awk 'BEGIN {FS=","} NR > 1 && $8 != 999 {uniq[$5]++} END {for(key in uniq) {total+=uniq[key]}; print "Total: "total}' input.csv
Output:
543674 1
143527 1
ABCD 1
MCCDU 2
4325 1
Total: 6
With an awk that supports length(array) (e.g. GNU awk and some others):
$ awk -F',' '(NR>1) && ($8!=999){vals[$5]} END{print length(vals)}' test.csv
5
With any awk:
$ awk -F',' '(NR>1) && ($8!=999) && !seen[$5]++{ cnt++ } END{print cnt+0}' test.csv
5
The +0 in the END is so you get numeric output even if the input file is empty.

Bash: Reading CSV text file and finding average of rows

This is the sample input (the data has user-IDs and the number of hours spent by the user):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).
I wrote the following script:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this. Also, the numbers arent adding up correctly.
I am getting the following output (for the first one - user2):
27
5
I am expecting the following output:
27
5.4
What would be a better way to do it? Any help would be appreciated.
TIA
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/ makes sure we only look at "even" user-IDs.
for(i=3;i<=7;i++){} iterates over the day columns and adds them.
Edit 1:
Accommodating new requirement:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:
-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)
The above generates:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print can be replaced with printf and the desired format string ...
You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.
Easy to demonstrate:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc for the floating point result:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk, perl, python, ruby etc.
Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User that end in any of 0,2,4,6,8:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User2 27 5.4
User4 20 4
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

Linux command (Calculating the sum)

I have a .txt file with the following content:
a 3
a 4
a 5
a 6
b 1
b 3
b 5
c 9
c 10
I am wondering if there is any command (no awk if possible) that can read the .txt file and give the following output (Sorted by the second column):
c 19
a 18
b 9
You can use awk piped to sort:
awk '{sums[$1] += $2} END {for (i in sums) print i, sums[i]}' file | sort -rnk2
c 19
a 18
b 9
sums[$1] += $2 is adding value of $2 in an array sums that is indexed by field #1 ($1).
sort -rnk2 is reverse sorting numerically output of awk on field 2
Use can use this code:
cat 1.txt | awk '{arr[$1]+=$2}END{for (var in arr) print var," ",arr[var]}' | sort -rnk 2
Explanation:
cat 1.txt - read 1.txt file with content
awk - is a language very useful for data manipulation
{arr[$1]+=$2} for each line in content file increase array item with key first field with value of second field. Field separator by default is space.
END{for (var in arr) print var," ",arr[var]}' - after all line is proceeded, print array content
sort -rnk 2 - reverse numeric sort on field 2
Non-awk solutions.
perl
perl -lane '
$sum{$F[0]} += $F[1]
} END {
$, = " ";
print $_, $sum{$_} for reverse sort {$sum{$a} <=> $sum{$b}} keys %sum
' file.txt
bash version 4
declare -A sum
while read key val; do (( sum[$key] += $val )); done < file.txt
for key in "${!sum[#]}"; do echo "$key ${sum[$key]}"; done | sort -rn -k2
non-awk challenge accepted
vars=$(cut -d" " -f1 nums | uniq); paste <(echo "$vars") <(cat <(sed -e 's/ /+=/' nums) <(echo "$vars" | sed 's/$/;/') | bc) | sort -k2,2nr
c 19
a 18
b 9

Count occurrences of character per line/field on Unix

Given a file with data like this (ie stores.dat file)
sid|storeNo|latitude|longitude
2tt|1|-28.0372000t0|153.42921670
9|2t|-33tt.85t09t0000|15t1.03274200
What is the command that would return the number of occurrences of the 't' character per line?
eg. would return:
count lineNum
4 1
3 2
6 3
Also, to do it by count of occurrences by field what is the command to return the following results?
eg. input of column 2 and character 't'
count lineNum
1 1
0 2
1 3
eg. input of column 3 and character 't'
count lineNum
2 1
1 2
4 3
To count occurrence of a character per line you can do:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"") "\t" NR}' file
count lineNum
4 1
3 2
6 3
To count occurrence of a character per field/column you can do:
column 2:
awk -F'|' -v fld=2 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"",$fld) "\t" NR}' file
count lineNum
1 1
0 2
1 3
column 3:
awk -F'|' -v fld=3 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"",$fld) "\t" NR}' file
count lineNum
2 1
1 2
4 3
gsub() function's return value is number of substitution made. So we use that to print the number.
NR holds the line number so we use it to print the line number.
For printing occurrences of particular field, we create a variable fld and put the field number we wish to extract counts from.
grep -n -o "t" stores.dat | sort -n | uniq -c | cut -d : -f 1
gives almost exactly the output you want:
4 1
3 2
6 3
Thanks to #raghav-bhushan for the grep -o hint, what a useful flag. The -n flag includes the line number as well.
To count occurences of a character per line:
$ awk -F 't' '{print NF-1, NR}' input.txt
4 1
3 2
6 3
this sets field separator to the character that needs to be counted, then uses the fact that number of fields is one greater than number of separators.
To count occurences in a particular column cut out that column first:
$ cut -d '|' -f 2 input.txt | awk -F 't' '{print NF-1, NR}'
1 1
0 2
1 3
$ cut -d '|' -f 3 input.txt | awk -F 't' '{print NF-1, NR}'
2 1
1 2
4 3
One possible solution using perl:
Content of script.pl:
use warnings;
use strict;
## Check arguments:
## 1.- Input file
## 2.- Char to search.
## 3.- (Optional) field to search. If blank, zero or bigger than number
## of columns, default to search char in all the line.
(#ARGV == 2 || #ARGV == 3) or die qq(Usage: perl $0 input-file char [column]\n);
my ($char,$column);
## Get values or arguments.
if ( #ARGV == 3 ) {
($char, $column) = splice #ARGV, -2;
} else {
$char = pop #ARGV;
$column = 0;
}
## Check that $char must be a non-white space character and $column
## only accept numbers.
die qq[Bad input\n] if $char !~ m/^\S$/ or $column !~ m/^\d+$/;
print qq[count\tlineNum\n];
while ( <> ) {
## Remove last '\n'
chomp;
## Get fields.
my #f = split /\|/;
## If column is a valid one, select it to the search.
if ( $column > 0 and $column <= scalar #f ) {
$_ = $f[ $column - 1];
}
## Count.
my $count = eval qq[tr/$char/$char/];
## Print result.
printf qq[%d\t%d\n], $count, $.;
}
The script accepts three parameters:
Input file
Char to search
Column to search: If column is a bad digit, it searchs all the line.
Running the script without arguments:
perl script.pl
Usage: perl script.pl input-file char [column]
With arguments and its output:
Here 0 is a bad column, it searches all the line.
perl script.pl stores.dat 't' 0
count lineNum
4 1
3 2
6 3
Here it searches in column 1.
perl script.pl stores.dat 't' 1
count lineNum
0 1
2 2
0 3
Here it searches in column 3.
perl script.pl stores.dat 't' 3
count lineNum
2 1
1 2
4 3
th is not a char.
perl script.pl stores.dat 'th' 3
Bad input
No need for awk or perl, only with bash and standard Unix utilities:
cat file | tr -c -d "t\n" | cat -n |
{ echo "count lineNum"
while read num data; do
test ${#data} -gt 0 && printf "%4d %5d\n" ${#data} $num
done; }
And for a particular column:
cut -d "|" -f 2 file | tr -c -d "t\n" | cat -n |
{ echo -e "count lineNum"
while read num data; do
test ${#data} -gt 0 && printf "%4d %5d\n" ${#data} $num
done; }
And we can even avoid tr and the cats:
echo "count lineNum"
num=1
while read data; do
new_data=${data//t/}
count=$((${#data}-${#new_data}))
test $count -gt 0 && printf "%4d %5d\n" $count $num
num=$(($num+1))
done < file
and event the cut:
echo "count lineNum"
num=1; OLF_IFS=$IFS; IFS="|"
while read -a array_data; do
data=${array_data[1]}
new_data=${data//t/}
count=$((${#data}-${#new_data}))
test $count -gt 0 && printf "%4d %5d\n" $count $num
num=$(($num+1))
done < file
IFS=$OLF_IFS
awk '{gsub("[^t]",""); print length($0),NR;}' stores.dat
The call to gsub() deletes everything in the line that is not a t, then just print the length of what remains, and the current line number.
Want to do it just for column 2?
awk 'BEGIN{FS="|"} {gsub("[^t]","",$2); print NR,length($2);}' stores.dat
You could also split the line or field with "t" and check the length of the resulting array - 1. Set the col variable to 0 for the line or 1 through 3 for columns:
awk -F'|' -v col=0 -v OFS=$'\t' 'BEGIN {
print "count", "lineNum"
}{
split($col, a, "t"); print length(a) - 1, NR
}
' stores.dat
$ cat -n test.txt
1 test 1
2 you want
3 void
4 you don't want
5 ttttttttttt
6 t t t t t t
$ awk '{n=split($0,c,"t")-1;if (n!=0) print n,NR}' test.txt
2 1
1 2
2 4
11 5
6 6
cat stores.dat | awk 'BEGIN {FS = "|"}; {print $1}' | awk 'BEGIN {FS = "\t"}; {print NF}'
Where $1 would be a column number you want to count.
perl -e 'while(<>) { $count = tr/t//; print "$count ".++$x."\n"; }' stores.dat
Another perl answer yay! The tr/t// function returns the count of the number of times the translation occurred on that line, in other words the number of times tr found the character 't'. ++$x maintains the line number count.

Resources