Uniq skipping middle part of the line when comparing lines - linux

Sample file
aa\bb\cc\dd\ee\ff\gg\hh\ii\jj
aa\bb\cc\dd\ee\ll\gg\hh\ii\jj
aa\bb\cc\dd\ee\ff\gg\hh\ii\jj
I want to skip 6th field 'ff' when comparing for an unique line, also I want the count of # of duplicate lines in front.
I tried this, without any luck:
sort -t'\' -k1,5 -k7 --unique xslin1 > xslout
Expected output
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj

$ awk -F'\' -v OFS='\' '{$6="*"} 1' xslin1 | sort | uniq -c
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj
Discussion
With --unique, sort outputs only unique lines but it does not count them. One needs uniq -c for that. Further, sort outputs all unique lines, not just those that sort to the same value.
The above solution does the simple approach of assigning the sixth field to *, as you wanted in the output, and then uses the standard pipeline, sort | uniq -c, to produce the count of unique lines.

You can do this in one awk:
awk 'BEGIN{FS=OFS="\\"} {$6="*"} uniq[$0]++{}
END {for (i in uniq) print uniq[i] "\t" i}' file
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj

Related

Count lines and group by prefix word

I want to count number of lines in a document and group it by the prefix word. Prefix is a set of alphanumeric characters delimited by first underscore. I don't care much about sorting them but it would be nice to list them descending by number of occurrences.
The file looks like this:
prefix1_data1
prefix1_data2_a
differentPrefix_data3
prefix1_data2_b
differentPrefix_data5
prefix2_data4
differentPrefix_data5
The output should be the following:
prefix1 3
differentPrefix 3
prefix2 1
I already did this in python but I am curious if it is possible to do this more efficient using command line or bash script? uniq command has -c and -w options but the length of prefix may vary.
The solution using combination of sed, sort and uniq commands:
sed -rn 's/^([^_]+)_.*/\1/p' testfile | sort | uniq -c
The output:
3 differentPrefix
3 prefix1
1 prefix2
^([^_]+)_ - matches a sub-string(prefix, containing any characters except _) from the start of the string to the first occurrence of underscore _
You could use awk:
awk -F_ '{a[$1]++}END{for(i in a) print i,a[i]}' file
The field separator is set to _.
An array a is filled with all first element, with their associated count.
When the file is parsed the array content is printed
I like RomanPerekhrest's answer. It's more concise. Here is a small change to make it even more concise by using cut in place of sed.
cut -d_ -f1 testfile | sort | uniq -c
Can be done in following manner, testfile is file with contents mentioned above.
printf %-20s%d"\n" prefix1 $(cat testfile|grep "^prefix1" | wc -l)
printf %-20s%d"\n" differentPrefix $(cat testfile|grep "^differentPrefix" | wc -l)
printf %-20s%d"\n" prefix2 $(cat testfile|grep "^prefix2" | wc -l)
so you can check this with your code and check which one's more efficient.

Linux sort numerically based on first column

I'm trying to numerically sort a long list of csv file based on the number in the first column, using below command:
-> head -1 file.csv ; tail -n +2 file.csv | sort -t , -k1n
(I'm piping head/tail command to skip the first line of the file, as it's a header and contains string)
However, it doesn't return a fully sorted list. Half of it is sorted, the other half is like this:
9838,2361,8,947,2284
9842,2135,2,261,2511
9846,2710,1,176,2171
986,2689,32,123,2177
9888,2183,15,30,2790
989,2470,33,887,2345
Can somebody tell me what I'm doing wrong? I've also tried below with same result:
-> sort -k1n -t"," file.csv
tail -n +2 file.csv | sort -k1,2 -n -t"," should do the trick.
To perform a numeric sort by the first column use the following approach:
tail -n +2 /file.csv | sort -n -t, -k1,1
The output:
986,2689,32,123,2177
989,2470,33,887,2345
9838,2361,8,947,2284
9842,2135,2,261,2511
9846,2710,1,176,2171
9888,2183,15,30,2790
-k pos1[,pos2]
Specify a sort field that consists of the part of the line between pos1 and pos2
(or the end of the line, if pos2 is omitted), inclusive.
In its simplest form pos specifies a field number (starting with 1) ...

sort and remove duplicate based on different columns in a file

I have a file in which there are three columns as (yyyy-mm-dd hh:mm:ss.000 12-digit number) :
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.578 5001234567891
I want to first sort the file based on the date-time(first two columns) and then have to remove the rows having duplicate numbers (third column). So after this the above file will look like:
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.568 5001234567890
I have used sort with key and awk command(as below) but the results aren't correct..(I am not very sure which entries are being removed as the file that I am processing are too big.)
Commands:
sort -k1 inputFile > sortedInputFile<br/>
awk '!seen[$3]++' sortedInputFile > outputFile<br/>
I am not sure how to do this.
If you want to keep the earliest instance of each 3rd column entry, you can sort twice; the first time to group duplicates and the second time to restore the sort by time, after duplicates are removed. (The following assumes a default sort works with both dates and values and that all lines have three columns with consistent whitespace.)
sort -k3 -k1,2 inputFile | uniq -f2 | sort > sortedFile
The -f2 option to uniq tells it to start the comparison at the end of the second field, so that the date fields are not considered.
If milliseconds doesn't matter, following is another approach which removes the milliseconds and performs the sort and uniq:
awk '{print $1" "substr($2,1,index($2,".")-1)" "$3 }' file1.txt | sort | uniq
Here is one in awk. It groups on the $3 and stores the earliest timestamp but the output order is random, so the output should be piped to sort.
$ awk '
(a[$3] == "" || a[$3] > ($1 OFS $2)) && a[$3]=($1 OFS $2) { next }
END{ for(i in a) print a[i], i }
' file # | sort goes here
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.478 5001234567891

what is the meaning of delimiter in cut and why in this command it is sorting twice?

I am trying to find the reason of this command and as I know very basic I found that
last | cut -d" " -f 1 | sort | uniq -c | sort
last = Last searches back through the file /var/log/wtmp (or the file designated by the -f flag) and displays a list of all users logged in (and out) since that file was created.
cut is to show the desired column.
The option -d specifies what is the field delimiter that is used in the input file.
-f specifies which field you want to extract
1 is the out put I think which I am not sure
and the it is sorting and then it is
Uniq command is helpful to remove or detect duplicate entries in a file. This tutorial explains few most frequently used uniq command line options that you might find helpful.
If anyone can explain this command and also explain why there is two sorts I will appreciate it.
You are right on your explanation of cut: cut -d" " -f1 (no need of space after f) gets the first field of a stream based on delimiter " " (space).
Then why sort | uniq -c | sort?
From man uniq:
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.
That's why you need to sort the lines before piping to uniq. Finally, as uniq output is not sorted, you need to sort again to see the most repeated items first.
See an example of sort and uniq -c for a given file with repeated items:
$ seq 5 >>a
$ seq 5 >>a
$ cat a
1
2
3
4
5
1
2
3
4
5
$ sort a | uniq -c | sort <--- no repeated matches
2 1
2 2
2 3
2 4
2 5
$ uniq -c a | sort <---- repeated matches
1 1
1 1
1 2
1 2
1 3
1 3
1 4
1 4
1 5
1 5
Note you can do the sort | uniq -c all together with this awk:
last | awk '{a[$1]++} END{for (i in a) print i, a[i]}'
This will store in the a[] array the values of the first column and increase the counter whenever it finds more. In the END{} blocks it prints the results, unsorted, so you could pipe again to sort.
uniq -c is being used to create a frequency histogram. The reason for the second sort is that you are then sorting your histogram by frequency order.
The reason for the first sort is that uniq is only comparing each line to its previous when deciding whether the line is unique or not.

How to count number of unique values of a field in a tab-delimited text file?

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Resources