Linux Unique values count [closed] - linux

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have .csv file, I want to count total values from column 5 only if corresponding value of column8 is not equal to "999"
I have tried with this, but not getting desired output.
cat test.csv | sed "1 d" |awk -F , '$8 != 999' | cut -d, -f5 | sort | uniq | wc -l >test.txt
Note that total number of records is more than 20K
I do am getting the number of unique values but it is not subtracting values with 999.
Can anyone help?
Sample Input:
Col1,Col2,Col3,Col4,Col5,Col7,Col8,Col9,Col10,Col11
1,0,0,0,ABCD,0,0,5436,0,0,0
1,0,0,0,543674,0,0,18999,0,0,0
1,0,0,0,143527,0,0,1336,0,0,0
1,0,0,0,4325,0,0,999,0,0,0
1,0,0,0,MCCDU,0,0,456,0,0,0
1,0,0,0,MCCDU,0,0,190,0,0,0
1,0,0,0,4325,0,0,190,0,0,0
What I want to do is not count the value from col5 if the corresponding value from col8 ==999.
By count total I mean total lines.
In above sample input, col5 value of line 6 and 7 is same, I need them to count as one.
I need to sort as Col5 values could be duplicate as I need to find total unique values.

Script:
awk 'BEGIN {FS=","} NR > 1 && $8 != 999 {uniq[$5]++} END {for(key in uniq) {total+=uniq[key]}; print "Total: "total}' input.csv
Output:
543674 1
143527 1
ABCD 1
MCCDU 2
4325 1
Total: 6

With an awk that supports length(array) (e.g. GNU awk and some others):
$ awk -F',' '(NR>1) && ($8!=999){vals[$5]} END{print length(vals)}' test.csv
5
With any awk:
$ awk -F',' '(NR>1) && ($8!=999) && !seen[$5]++{ cnt++ } END{print cnt+0}' test.csv
5
The +0 in the END is so you get numeric output even if the input file is empty.

Related

Select only those rows from a column where column 2 has more than 2 leading zeroes in Linux [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 months ago.
Improve this question
So I want to grab only the rows that has 2 or more than 2 leading zeroes in the ID column ($2) between the 5th and 10th character. For example column 2 has ID 156700923134, so from 5th to 10th character 1567-009231-34 i.e. 009231. In this case we do see the leading zeroes. However in the second row we have 777754635373, so grab 546353, which does not have leading zeroes. I am working on a pipe delimited file.
Ex:
1 | 156700923134 | hkohin | 23
4 | 777754635373 | hhkdys | 45
3 | 678387700263 | ieysff | 09
Expected output: 1 | 156700923134 | hkohin | 23
--OR--
156700923134
So far I have the substring 009231, 546353, 877002 as output but I don't know how to check for leading zeroes.
This is what I used to get to the above result:
awk -F'|' '{print $2, substr($2, 5, 6) }' file.dat | head -5
() for test condition allows any valid expression
awk -F'|' '( match($2,"^....00") ) { print print $2, substr($2, 5, 6) }' file.dat
Answer #2:
Takes more lines to be generic:
zstart=5
zcnt=3
zeros=$(eval printf '0%.0s' {1..$zcnt})
echo 'xxx|1234000890|end' |
awk -F'|' -vzstart=$zstart -vzcnt=$zcnt -vzeros="$zeros" '
### debug { print substr($2, zstart, zcnt); }
(zeros == substr($2, zstart, zcnt)) { print }'

Wanted to count total attemps by day awk [duplicate]

This question already has answers here:
Best way to simulate "group by" from bash?
(17 answers)
Closed 2 years ago.
im trying to count the number of ocurrences for a list in awk, actually im able to get it for each user the total attemps, but i wanted total attemps by day. I have a txt file something like:
ID, Event, Date, Type, Message, API, User, Protocol, Attemps
1, ERROR, 30-NOV-20, 4, TEXT, 2, user1, GUI, 9
i used below awk to count total attemps:
awk 'FNR == NR {count[$(NF-3)]++; next} {print $(NF-3), $3 "\t" count[$(NF-3)]}' file file
Can someone help me?
Expected output:
USER ATTEMPS DATE
user1 3 20-NOV-2020
user1 6 22-NOV-2020
user2 2 01-DEC-2020
user3 4 12-NOV-2020
user3 19 18-NOV-2020
This is not using only awk but it should works if you need total attempts per day:
awk -F, '{print $3}' file | sort | uniq -c
Edit: To have total attempts per day and per user you can do following:
awk -F, '{print $3 $7}' file | sort | uniq -c
this should do, but couldn't test with more data
$ awk -F', ' -v OFS='\t' '
NR==1 {print $7,$NF,$3; next}
NF {a[$7,$3]+=$NF}
END {for(k in a)
{split(k,ks,SUBSEP);
print ks[1],a[k],ks[2]}}' file
User Attemps Date
user1 9 30-NOV-20
awk -F, 'NR > 1 { map[$3]+=$NF } END { for (i in map) { print i" - "map[i] } }' file
Using gnu Awk, set the field delimiter to a comma and then use the 3rd field as the index for an array map and the value a running total of attempts ($NF). Once all lines are processed, we loop through the map array printing the index and the value which is the date and the attempt total.

Bash: Reading a CSV file and sorting column based on a condition

I am trying read a CSV text file and print all entries of one column (sorted), based on a condition.
The input sample is as below:
Computer ID,User ID,M
Computer1,User3,5
Computer2,User5,8
computer3,User4,9
computer4,User10,3
computer5,User9,0
computer6,User1,11
The user-ID (2nd column) needs to be printed if the hours (third column) is greater than zero. However, the printed data should be sorted based on the user-id.
I have written the following script:
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2" > login.txt
fi
done < <(tail -n+2 user-list.txt)
The output of this script is:
User3
User5
User4
User10
User1
I am expecting the following output:
User1
User3
User4
User5
User10
Any help would be appreciated. TIA
awk -F, 'NR == 1 { next } $3 > 0 { match($2,/[[:digit:]]+/);map[$2]=substr($2,RSTART) } END { PROCINFO["sorted_in"]="#val_num_asc";for (i in map) { print i } }' user-list.txt > login.txt
Set the field delimiter to commas with -F, Ignore the header with NR == 1 { next } Set the index of an array (map) to the user when the 3rd delimited field is greater than 0. The value is set the number part of the User field (found with the match function) In the end block, set the sort order to value, number, ascending and loop through the map array created.
The problem with your script (and I presume with the "sorting isn't working") is the place where you redirect (and may have tried to sort) - the following variant of your own script does the job:
#!/bin/bash
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2"
fi
done < <(tail -n+2 user-list.txt) | sort > login.txt
Edit 1: Match new requirement
Sure we can fix the sorting; sort -k1.5,1.7n > login.txt
Of course, that, too, will only work if your user IDs are all 4 alphas and n digits ...
Sort ASCIIbetically:
tail -n +2 user-list.txt | perl -F',' -lane 'print if $F[2] > 0;' | sort -t, -k2,2
computer6,User1,11
computer4,User10,3
Computer1,User3,5
computer3,User4,9
Computer2,User5,8
Or sort numerically by the user number:
tail -n +2 user-list.txt | perl -F',' -lane 'print if $F[2] > 0;' | sort -t, -k2,2V
computer6,User1,11
Computer1,User3,5
computer3,User4,9
Computer2,User5,8
computer4,User10,3
Using awk for condition handling and sort for ordering:
$ awk -F, ' # comma delimiter
FNR>1 && $3 { # skip header and accept only non-zero hours
a[$2]++ # count instances for duplicates
}
END {
for(i in a) # all stored usernames
for(j=1;j<=a[i];j++) # remove this if there are no duplicates
print i | "sort -V" # send output to sort -V
}' file
Output:
User1
User3
User4
User5
User10
If there are no duplicated usernames, you can replace a[$2]++ with just a[$2] and remove the latter for. Also, no real need for sort to be inside awk program, you could just as well pipe data from awk to sort, like:
$ awk -F, 'FNR>1&&$3{a[$2]++}END{for(i in a)print i}' file | sort -V
FNR>1 && $3 skips the header and processes records where hours column is not null. If your data has records with negative hours and you only want positive hours, change it to FNR>1 && $3>0.
Or you could use grep with PCRE andsort:
$ grep -Po "(?<=,).*(?=,[1-9])" file | sort -V

Bash Colum sum over a table of variable length

Im trying to get the columsums (exept for the first one) of a tab delimited containing numbers.
To find out the number of columns an store it in a variable I use:
cols=$(awk '{print NF}' file.txt | sort -nu | tail -n 1
next I want to calculate the sum of all numbers in that column and store this in a variable again in a for loop:
for c in 2:$col
do
num=$(cat file.txt | awk '{sum+$2 ; print $0} END{print sum}'| tail -n 1
done
this
num=$(cat file.txt | awk '{sum+$($c) ; print $0} END{print sum}'| tail -n 1
on itself with a fixed numer and without variable input works find but i cannot get it to accept the for-loop variable.
Thanks for the support
p.s. It would also be fine if i could sum all columns (expept the first one) at once without the loop-trouble.
Assuming you want the sums of the individual columns,
$ cat file
1 2 3 4
5 6 7 8
9 10 11 12
$ awk '
{for (i=2; i<=NF; i++) sum[i] += $i}
END {for (i=2; i<=NF; i++) printf "%d%s", sum[i], OFS; print ""}
' file
18 21 24
In case you're not bound to awk, there's a nice tool for "command-line statistical operations" on textual files called GNU datamash.
With datamash, summing (probably the simplest operation of all) a 2nd column is as easy as:
$ datamash sum 2 < table
9
Assuming the table file holds tab-separated data like:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
To sum all columns from 2 to n use column ranges (available in datamash 1.2):
$ n=4
$ datamash sum 2-$n < table
9 12 15
To include headers, see the --headers-out option

How to add number of identical line next to the line itself? [duplicate]

This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
Closed 7 years ago.
I have file file.txt which look like this
a
b
b
c
c
c
I want to know the command to which get file.txt as input and produces the output
a 1
b 2
c 3
I think uniq is the command you are looking for. The output of uniq -c is a little different from your format, but this can be fixed easily.
$ uniq -c file.txt
1 a
2 b
3 c
If you want to count the occurrences you can use uniq with -c.
If the file is not sorted you have to use sort first
$ sort file.txt | uniq -c
1 a
2 b
3 c
If you really need the line first followed by the count, swap the columns with awk
$ sort file.txt | uniq -c | awk '{ print $2 " " $1}'
a 1
b 2
c 3
You can use this awk:
awk '!seen[$0]++{ print $0, (++c) }' file
a 1
b 2
c 3
seen is an array that holds only uniq items by incrementing to 1 first time an index is populated. In the action we are printing the record and an incrementing counter.
Update: Based on comment below if intent is to get a repeat count in 2nd column then use this awk command:
awk 'seen[$0]++{} END{ for (i in seen) print i, seen[i] }' file
a 1
b 2
c 3

Resources