How to count number of unique values of a field in a tab-delimited text file? - linux

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l

awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv

You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l

Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold

# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l

Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done

This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Related

grep for a substring

I have a file that has the following user names in random places in the file:
albert#ghhdh
albert#jdfjgjjg
john#jfkfeie
mike#fjfkjf
bill#fjfj
bill#fkfkfk
Usernames are the names to the left of the # symbol.
I want to use unix commands to grep the file for usernames, then make a count of unique usernames.
Therefore using the example above, the output should state that there are 4 unique users (I just need the count as the output, no words)
Can someone help me determine the correct count?
You could extract the words before #, sort them and count them :
cat test.txt | cut -d '#' -f 1 | sort | uniq -c
With test.txt :
albert#ghhdh
john#jfkfeie
bill#fjfj
mike#fjfkjf
bill#fkfkfk
albert#jdfjgjjg
It outputs :
2 albert
2 bill
1 john
1 mike
Note that the duplicate usernames don't have to be grouped in the input list.
If you're just interested in the count of uniq users :
cat test.txt | cut -d '#' -f 1 | sort -u | wc -l
# => 4
Or shorter :
cut -d '#' -f 1 test.txt | sort -u | wc -l
Here is the solution that finds the usernames anywhere on the line (not just at the beginning), even if there are multiple usernames on a single line, and finds their unique count:
grep -oE '\b[[:alpha:]_][[:alnum:]_.]*#' file | cut -f1 -d# | sort -u | wc -l
-o only fetches the matched portion
-E processes extended regex
\b[[:alpha:]_][[:alnum:]]*# matches usernames (a string following a word boundary \b that starts with an alpha or underscore followed by zero or more alphanumeric and other permitted characters, ending with a #
cut -f1 -d# extracts the username portion which is then sorted and counted for unique names
Faster with one awk command, if awk is allowed:
awk -F"#" '!seen[$1]++{c++}END{print "Unique users =" c}'
Small Explanation:
using # as delimiter (-F) you look for field 1 = $1 for awk.
For every field 1 that is not seen again we increase a counter c.
In the same time we increase the particular field1 so if found again the test "not seen" will not be valid.
At the end we just print the counter of unique "seen".
As a plus, this solution does not require pre-sorting. Duplicates would be found even if file is not sorted.

Count lines and group by prefix word

I want to count number of lines in a document and group it by the prefix word. Prefix is a set of alphanumeric characters delimited by first underscore. I don't care much about sorting them but it would be nice to list them descending by number of occurrences.
The file looks like this:
prefix1_data1
prefix1_data2_a
differentPrefix_data3
prefix1_data2_b
differentPrefix_data5
prefix2_data4
differentPrefix_data5
The output should be the following:
prefix1 3
differentPrefix 3
prefix2 1
I already did this in python but I am curious if it is possible to do this more efficient using command line or bash script? uniq command has -c and -w options but the length of prefix may vary.
The solution using combination of sed, sort and uniq commands:
sed -rn 's/^([^_]+)_.*/\1/p' testfile | sort | uniq -c
The output:
3 differentPrefix
3 prefix1
1 prefix2
^([^_]+)_ - matches a sub-string(prefix, containing any characters except _) from the start of the string to the first occurrence of underscore _
You could use awk:
awk -F_ '{a[$1]++}END{for(i in a) print i,a[i]}' file
The field separator is set to _.
An array a is filled with all first element, with their associated count.
When the file is parsed the array content is printed
I like RomanPerekhrest's answer. It's more concise. Here is a small change to make it even more concise by using cut in place of sed.
cut -d_ -f1 testfile | sort | uniq -c
Can be done in following manner, testfile is file with contents mentioned above.
printf %-20s%d"\n" prefix1 $(cat testfile|grep "^prefix1" | wc -l)
printf %-20s%d"\n" differentPrefix $(cat testfile|grep "^differentPrefix" | wc -l)
printf %-20s%d"\n" prefix2 $(cat testfile|grep "^prefix2" | wc -l)
so you can check this with your code and check which one's more efficient.

Identify duplicate lines in a file that have 2 fields using linux command line

i have a file composed of 2 fields that contains long list of entries where the first fields is the id.the second field is a counter
what i want is to display the duplicated id
example of the file:
tXXXXXXXXXX 12345
tXXXXXXXXXX 53321
tXXXXXXXXXXXX 422642
i know the logic of how i solve this problem that i need to do an iteration or a loop in the file but i do not know how to write the syntax of the command.
i will appreciate any help
You can use this :
perl -ne '++$i;print $i," ",$_ if $line{$_}++' FILENAME
If you mean you just want a list of duplicate IDs in the file, then this can be easily achieved with cut, sort and uniq.
cat <filename> | cut -f1 -d ' ' | sort | uniq -d
If you want to print all the lines with duplicate IDs on, the below can be used:
FILE=/tmp/sdhjkhsfds ; for dupe in $(cat $FILE | cut -f1 -d ' ' | sort | uniq -d); do cat $FILE | grep $dupe | awk '{print $1, $2}'; done

How to use sed and wc command to handle whitespace

If I have a CSV file and I want to know the number of columns, I'll use the following command:
head -1 CSVFile.csv | sed 's/,/\t/g' | wc -w
However, whenever each column has a column name with a space in it, the command doesn't work and gives me a nonsense figure.
What would be the way to edit this command such that it gives me the correct number of columns?
e.g. in my file I could have column name (t - ZK) or (e - 22)
For example my file could be (first 2 row);
ZZ(v - 1),Tat(t - 1000)
1.1240128401924,2929292929
You are piping the sed output to wc -w which would return the number of words in the output. So if a field header contains spaces, those would be considered as different words.
You can use awk:
head -1 CSVFile.csv | awk -F, '{print NF}'
This would return the number of columns in the file (assuming the file is comma-delimited).
Maybe use the last line instead of the first. Change "head" to "tail". That would be a quick, easy solution.
Try using awk
awk -F, 'NR==1 {print NF; exit}' CSVFile.csv
If you wish to use chain of head, sed and wc
Try using sed replace deliminator as newline \n instead of tab \t and then count number of lines using wc -l instead of counting number of words with wc -w
head -1 CSVFile.csv | sed 's/,/\n/g' | wc -l
perl -ane 'print scalar(#F)-1 if($.==1)' your_file
Assuming there is no "," in header name (like field1,"Surname,name",field3, ...)
sed "1 s/[^,]//g;q" CSVFile.csv | wc -c
Could also be made only in sed but a bit heavy for counting.

unix - count of columns in file

Given a file with data like this (i.e. stores.dat file)
sid|storeNo|latitude|longitude
2|1|-28.03720000|153.42921670
9|2|-33.85090000|151.03274200
What would be a command to output the number of column names?
i.e. In the example above it would be 4. (number of pipe characters + 1 in the first line)
I was thinking something like:
awk '{ FS = "|" } ; { print NF}' stores.dat
but it returns all lines instead of just the first and for the first line it returns 1 instead of 4
awk -F'|' '{print NF; exit}' stores.dat
Just quit right after the first line.
This is a workaround (for me: I don't use awk very often):
Display the first row of the file containing the data, replace all pipes with newlines and then count the lines:
$ head -1 stores.dat | tr '|' '\n' | wc -l
Unless you're using spaces in there, you should be able to use | wc -w on the first line.
wc is "Word Count", which simply counts the words in the input file. If you send only one line, it'll tell you the amount of columns.
You could try
cat FILE | awk '{print NF}'
Perl solution similar to Mat's awk solution:
perl -F'\|' -lane 'print $#F+1; exit' stores.dat
I've tested this on a file with 1000000 columns.
If the field separator is whitespace (one or more spaces or tabs) instead of a pipe:
perl -lane 'print $#F+1; exit' stores.dat
If you have python installed you could try:
python -c 'import sys;f=open(sys.argv[1]);print len(f.readline().split("|"))' \
stores.dat
This is usually what I use for counting the number of fields:
head -n 1 file.name | awk -F'|' '{print NF; exit}'
select any row in the file (in the example below, it's the 2nd row) and count the number of columns, where the delimiter is a space:
sed -n 2p text_file.dat | tr ' ' '\n' | wc -l
Proper pure bash way
Simply counting columns in file
Under bash, you could simply:
IFS=\| read -ra headline <stores.dat
echo ${#headline[#]}
4
A lot quicker as without forks, and reusable as $headline hold the full head line. You could, for sample:
printf " - %s\n" "${headline[#]}"
- sid
- storeNo
- latitude
- longitude
Nota This syntax will drive correctly spaces and others characters in column names.
Alternative: strong binary checking for max columns on each rows
What if some row do contain some extra columns?
This command will search for bigger line, counting separators:
tr -dc $'\n|' <stores.dat |wc -L
3
If there are max 3 separators, then there are 4 fields... Or if you consider:
each separator (|) is prepended by a Before and followed by an After, trimed to 1 letter by word:
tr -dc $'\n|' <stores.dat|sed 's/./b&a/g;s/ab/a/g;s/[^ab]//g'|wc -L
4
Counting columns in a CSV file
Under bash, you may use csv loadable plugins:
enable -f /usr/lib/bash/csv csv
IFS= read -r line <file.csv
csv -a fields <<<"$line"
echo ${#fields[#]}
4
For more infos, see How to parse a CSV file in Bash?.
Based on Cat Kerr response.
This command is working on solaris
awk '{print NF; exit}' stores.dat
you may try:
head -1 stores.dat | grep -o \| | wc -l

Resources