Unix command "uniq" & "sort"

Unix command "uniq" & "sort" - linux

As we known
uniq [options] [file1 [file2]]
It remove duplicate adjacent lines from sorted file1. The option -c prints each line once, counting instances of each. So if we have the following result:
34 Operating System
254 Data Structure
5 Crypo
21 C++
1435 C Language
589 Java 1.6
And we sort above data using "sort -1knr", the result is as below:
1435 C Language
589 Java 1.6
254 Data Structure
34 Operating System
21 C++
5 Crypo
Can anyone help me out that how to output only the book name in this order (no number)?

uniq -c filename | sort -k 1nr | awk '{$1='';print}'

You can also use sed for that, as follows:
uniq -c filename | sort -k -1nr | sed 's/[0-9]\+ \(.\+\)/\1/g'
Test:
echo "34 Data Structure" | sed 's/[0-9]\+ \(.\+\)/\1/g'
Data Structure
This can also be done with a simplified regex (courtesy William Pursell):
echo "34 Data Structure" | sed 's/[0-9]* *//'
Data Structure

Why do you use uniq -c to print the number of occurences, which you then want to remove with some cut/awk/sed dance?
Instead , you could just use
sort -u $file1 $file2 /path/to/more_files_to_glob*
Or do some systems come with a version of sort which doesn't support -u ?

Related

Validating file records shell script

I have a file with content as follows and want to validate the content as
1.I have entries of rec$NUM and this field should be repeated 7 times only.
for example I have rec1.any_attribute this rec1 should come only 7 times in whole file.
2.I need validating script for this.
If records for rec$NUM are less than 7 or Greater than 7 script should report that record.
FILE IS AS FOLLOWS :::
rec1:sourcefile.name=
rec1:mapfile.name=
rec1:outputfile.name=
rec1:logfile.name=
rec1:sourcefile.nodename_col=
rec1:sourcefle.snmpnode_col=
rec1:mapfile.enc=
rec2:sourcefile.name=abc
rec2:mapfile.name=
rec2:outputfile.name=
rec2:logfile.name=
rec2:sourcefile.nodename_col=
rec2:sourcefle.snmpnode_col=
rec2:mapfile.enc=
rec3:sourcefile.name=abc
rec3:mapfile.name=
rec3:outputfile.name=
rec3:logfile.name=
rec3:sourcefile.nodename_col=
rec3:sourcefle.snmpnode_col=
rec3:mapfile.enc=
Please Help
Thanks in Advance... :)

Simple awk:
awk -F: '/^rec/{a[$1]++}END{for(t in a){if(a[t]!=7){print "Some error for record: " t}}}' test.rc

grep '^rec1' file.txt | wc -l
grep '^rec2' file.txt | wc -l
grep '^rec3' file.txt | wc -l
All above should return 7.

The commands:
grep rec file2.txt | cut -d':' -f1 | uniq -c | egrep -v '^ *7'
will success if file follows your rules, fails (and returns the failing record) if it doesn't.
(replace "uniq -c" by "sort -u" if record numbers can be mixed).

Binned histogram of timings in log file on command line

To quickly evaluate the timings of various operations from a log file on a linux server, I would like to extract them from the log and create a textual/tsv style histogram. To have a better idea of how the timings are distributed, I want to bin them into ranges of 0-10ms, 10-20ms etc.
The output should look something like this:
121 10
39 20
12 30
7 40
1 100
How to achieve this with the usual set of unix command line tools?

Quick answer:
cat <file> | egrep -o [0-9]+ | sed "s/$/ \/10*10/" | bc | sort -n | uniq -c
Detailed answer:
grep the pattern of your timing or number. You may need to do multiple grep steps to extract exactly the numbers you want from your logs.
use sed to add arithmetic expression for integer division by desired factor and multiply it back on
bc performs the calculation
the well-known sort | uniq combo to count occurrences

Linux-About sorting shell output

I have output from a customised log file like this:
8 24 yum
8 24 yum
8 24 make
8 24 make
8 24 cd
8 24 cd
8 25 make
8 25 make
8 25 make
8 26 yum
8 26 yum
8 26 make
8 27 yum
8 27 install
8 28 ./linux
8 28 yum
I'd like to know if there's anyway to count the number of specific values of the third field. For example I may want to count the number of cd,yum and install only.

You can use awk to do get the third field values and wc -l to count the number.
awk '$3=="cd"||$3=="yum"||$3=="install"||$3=="cat" {print $0}' file | wc -l
You can also use egrep, but this will look for these words not only on the third field, but everywhere else in the line.
egrep "(cd|yum|install|cat)" file | wc -l
if you want to count a specific word on the third field, then you can do the above without multiple regexs.
awk '$3=="cd" {print $0}' | wc -l

A classic shell script to do the job is:
awk '{print $3}' "$file" | sort | uniq -c | sort -n
Extract values from column 3 with awk, sort the identical names together, count the repeats, sort the output in increasing order of count. The sort | uniq -c | sort -n part is a common meme.
If you're using GNU awk, you can do it all in the awk script; it might be more efficient, but for really humungous files, it can run out of memory where the pipeline doesn't (sort spills to disk when necessary; writing code to spill to disk in awk is not sensible).

Use cut, sort and uniq:
$ cut -d" " -f3 inputfile | sort | uniq -c
2 cd
1 install
1 ./linux
6 make
6 yum

For your input this
awk '{++a[$3]}END{for(i in a)print i "\t" a[i];}' file
Would print:
cd 2
install 1
./linux 1
make 6
yum 6

Using awk to count the occurrences of field three and sort to order the output:
$ awk '{a[$3]++}END{for(k in a)print a[k],k}' file | sort -n
1 install
1 ./linux
2 cd
6 make
6 yum
So filter by command:
$ awk '/cd|yum|install/{a[$3]++}END{for(k in a)print a[k],k}' file | sort -n
1 install
2 cd
6 yum
To stop partial matches such as grep in egrep use word boundaries \< and \> so the filter would be /\<cd\>|\<yum\>|\<install\>/

You can use grep to filter by multiple terms at the same time:
cut -f3 -d' ' file | grep -x -e yum -e make -e install | sort | uniq -c
Explanation:
The -x flag is to match only the lines that match exactly, as if with ^pattern$
The cut extracts the 3rd column only
We sort, uniq with count in the end for efficiency, after all junk is removed from the input

i guess u want to count the values of yum install & cd separately. if so, u shud go for 3 separate awk statements: awk '$3=="cd" {print $0}' file | wc -l
awk '$3=="yum" {print $0}' file | wc -l
awk '$3=="install" {print $0}' file | wc -l

Linux:How to list the information about file or directory(size,permission,number of files by type?) in total

Suppose I am staying in currenty directory, I wanted to list all the files in total numbers, as well as the size, permission, and also the number of files by types.
here is the sample outputs:
Here is a sample :
Print information about "/home/user/poker"
total number of file : 83
pdf files : 5
html files : 9
text files : 15
unknown : 5
NB: anyfile without extension could be consider as unknown.
i hope to use some simple command like ls, cut, sort, unique ,(just examples) put each different extension in file and using wc -l to count number of lines
or do i need to use grep, awk , or something else?
Hope to get the everybody's advices.thank you!

Best way is to use file to output only mimetype and pass it to awk.
file * -ib | awk -F'[;/.]' '{print $(NF-1)}' | sort -n | uniq -c
On my home directory it produces this output.
35 directory
3 html
1 jpeg
1 octet-stream
1 pdf
32 plain
5 png
1 spreadsheet
7 symlink
1 text
1 x-c++
3 x-empty
1 xml
2 x-ms-asf
4 x-shellscript
1 x-shockwave-flash
If you think text/x-c++ and text/plain should be in same Use this
file * -ib | awk -F'[;/.]' '{print $1}' | sort -n | uniq -c
6 application
6 image
45 inode
40 text
2 video
Change the {print $1} part according to your need to get the appropriate output.

You need bash.
files=(*)
pdfs=(*.pdf)
echo "${#files[#]}"
echo "${#pdfs[#]}"
echo "$((${#files[#]}-${#pdfs[#]}))"

find . -type f | xargs -n1 basename | fgrep . | sed 's/.*\.//' | sort | uniq -c | sort -n
That gives you a recursive list of file extensions. If you want only the current directory add a -maxdepth 1 to the find command.

Sorting in bash

I have been trying to get the unique values in each column of a tab delimited file in bash. So, I used the following command.
cut -f <column_number> <filename> | sort | uniq -c
It works fine and I can get the unique values in a column and its count like
105 Linux
55 MacOS
500 Windows
What I want to do is instead of sorting by the column value names (which in this example are OS names) I want to sort them by count and possibly have the count in the second column in this output format. So It will have to look like:
Windows 500
MacOS 105
Linux 55
How do I do this?

Use:
cut -f <col_num> <filename>
| sort
| uniq -c
| sort -r -k1 -n
| awk '{print $2" "$1}'
The sort -r -k1 -n sorts in reverse order, using the first field as a numeric value. The awk simply reverses the order of the columns. You can test the added pipeline commands thus (with nicer formatting):
pax> echo '105 Linux
55 MacOS
500 Windows' | sort -r -k1 -n | awk '{printf "%-10s %5d\n",$2,$1}'
Windows 500
Linux 105
MacOS 55

Mine:
cut -f <column_number> <filename> | sort | uniq -c | awk '{ print $2" "$1}' | sort
This will alter the column order (awk) and then just sort the output.
Hope this will help you

Using sed based on Tagged RE:
cut -f <column_number> <filename> | sort | uniq -c | sort -r -k1 -n | sed 's/\([0-9]*\)[ ]*\(.*\)/\2 \1/'
Doesn't produce output in a neat format though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unix command "uniq" & "sort" - linux

uniq -c filename | sort -k 1nr | awk '{$1='';print}'

Why do you use uniq -c to print the number of occurences, which you then want to remove with some cut/awk/sed dance? Instead , you could just use sort -u $file1 $file2 /path/to/more_files_to_glob* Or do some systems come with a version of sort which doesn't support -u ?

Related

Validating file records shell script

Binned histogram of timings in log file on command line

Linux-About sorting shell output

Linux:How to list the information about file or directory(size,permission,number of files by type?) in total

Sorting in bash

Categories

Resources