Sorting in bash - linux

I have been trying to get the unique values in each column of a tab delimited file in bash. So, I used the following command.
cut -f <column_number> <filename> | sort | uniq -c
It works fine and I can get the unique values in a column and its count like
105 Linux
55 MacOS
500 Windows
What I want to do is instead of sorting by the column value names (which in this example are OS names) I want to sort them by count and possibly have the count in the second column in this output format. So It will have to look like:
Windows 500
MacOS 105
Linux 55
How do I do this?

Use:
cut -f <col_num> <filename>
| sort
| uniq -c
| sort -r -k1 -n
| awk '{print $2" "$1}'
The sort -r -k1 -n sorts in reverse order, using the first field as a numeric value. The awk simply reverses the order of the columns. You can test the added pipeline commands thus (with nicer formatting):
pax> echo '105 Linux
55 MacOS
500 Windows' | sort -r -k1 -n | awk '{printf "%-10s %5d\n",$2,$1}'
Windows 500
Linux 105
MacOS 55

Mine:
cut -f <column_number> <filename> | sort | uniq -c | awk '{ print $2" "$1}' | sort
This will alter the column order (awk) and then just sort the output.
Hope this will help you

Using sed based on Tagged RE:
cut -f <column_number> <filename> | sort | uniq -c | sort -r -k1 -n | sed 's/\([0-9]*\)[ ]*\(.*\)/\2 \1/'
Doesn't produce output in a neat format though.

Related

Find duplicate entries in a text file using shell

I am trying to find duplicate *.sh entry mention in a text file(test.log) and delete it, using shell program. Since the path is different so uniq -u always print duplicate entry even though there are two first_prog.sh entry in a text file
cat test.log
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/first_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
output:
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
I tried couple of way using few command but dont have idea on how to get above output.
rev test.log | cut -f1 -d/ | rev | sort | uniq -d
Any clue on this?
You can use awk for this by splitting fields on / and using $NF (last field) in an associative array:
awk -F/ '!seen[$NF]++' test.log
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
awk shines for these kind of tasks but here in a non awk solution,
$ sed 's|.*/|& |' file | sort -k2 -u | sed 's|/ |/|'
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
or, if your path is balanced (the same number of parents for all files)
$ sort -t/ -k5 -u file
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
awk '!/my_shellprog\/test\/first/' file
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh

How to filter multiple files and eliminate duplicate entries to select a single entry while using linux shell

I have a folder that contains several files. These files consist of identical columns.
Let us say file1 and file2 have contents as follows.(Here it can be more than two files)
$cat file1.txt
9999999999|1200
8888888888|1400
7777777777|1255
6666666666|1788
7777777777|1289
9999999999|1300
$cat file2.txt
9999999999|2500
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
In my file 1st column is mobile number and 2nd is count. Same mobile can be there in multiple files. Now I want to get the records into a file with unique mobile numbers which has the highest count.
The output should be as follows:
$cat output.txt
7777777777|1289
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
Any help would be appreciated.
That's probably not very efficient but it does the job:
put this into phones.sh and run sh phones.sh
#!/bin/bash
files="
file1.txt
file2.txt
"
phones=$(cat $files | cut -d'|' -f1 | sort -u)
for phone in $phones; do grep -h $phone $files | sort -t'|' -k 2 -nr | head -n1; done | sort -t'|' -k 2
What it does is basically, extract all the phone numbers in the files, iterate over them and grep them in all files, select the one with the highest count. Then I also sorted the final result by count, which is what your expected result suggests. sort -t'|' -k 2 -nr means sort the second column given the delimiter |, by decreasing numerical order. head -n1 selects the first line. You can add other files into the files variable.
Another way of doing this is to use the power of sort and awk:
cat file1.txt file2.txt | sort -t '|' -k1,1 -k2,2nr | awk -F"|" '!_[$1]++' | sort -t '|' -k2,2n
I think the one-liner is pretty self-explanatory, except for the awk. What that part does is that it does a uniq by the first column. The last sort is just to get the final order that you wanted.

Calculating Awk Output divide by mega=1048576

Hi Can someone please let me know how I can calculate the output field from this command to MB ?
The command below shows the 20 largest file in directory and sub directories
but I need to convert the output to MB. In my script I use an array.. But If you guys show me how to use awk to divide the output for this by mega=1048576
I would really appreciate it .. Please explain the options !!!
ls -1Rs | sed -e "s/^ *//" | grep "^[0-9]" | sort -nr | head -n20 | awk {'print $1'}
Thanks
You don't show any sample input or expected output so this is a guess but this MAY be what you want (assuming you cant follow all the other good advice about not parsing ls output and you don't have GNU awk for internal sorting):
ls -1Rs | awk '/^ *[0-9]/' | sort -nr | awk 'NR<21{print $1/1024}'
Note that you don't need all those other commands and pipes when you're already using awk.
ls -1Rs | sed -e "s/^ *//" | grep "^[0-9]" | sort -nr | head -n20 | awk {'print $1 / 1024'} To turn it into MB - You have to divide it by 1024

Validating file records shell script

I have a file with content as follows and want to validate the content as
1.I have entries of rec$NUM and this field should be repeated 7 times only.
for example I have rec1.any_attribute this rec1 should come only 7 times in whole file.
2.I need validating script for this.
If records for rec$NUM are less than 7 or Greater than 7 script should report that record.
FILE IS AS FOLLOWS :::
rec1:sourcefile.name=
rec1:mapfile.name=
rec1:outputfile.name=
rec1:logfile.name=
rec1:sourcefile.nodename_col=
rec1:sourcefle.snmpnode_col=
rec1:mapfile.enc=
rec2:sourcefile.name=abc
rec2:mapfile.name=
rec2:outputfile.name=
rec2:logfile.name=
rec2:sourcefile.nodename_col=
rec2:sourcefle.snmpnode_col=
rec2:mapfile.enc=
rec3:sourcefile.name=abc
rec3:mapfile.name=
rec3:outputfile.name=
rec3:logfile.name=
rec3:sourcefile.nodename_col=
rec3:sourcefle.snmpnode_col=
rec3:mapfile.enc=
Please Help
Thanks in Advance... :)
Simple awk:
awk -F: '/^rec/{a[$1]++}END{for(t in a){if(a[t]!=7){print "Some error for record: " t}}}' test.rc
grep '^rec1' file.txt | wc -l
grep '^rec2' file.txt | wc -l
grep '^rec3' file.txt | wc -l
All above should return 7.
The commands:
grep rec file2.txt | cut -d':' -f1 | uniq -c | egrep -v '^ *7'
will success if file follows your rules, fails (and returns the failing record) if it doesn't.
(replace "uniq -c" by "sort -u" if record numbers can be mixed).

Linux-About sorting shell output

I have output from a customised log file like this:
8 24 yum
8 24 yum
8 24 make
8 24 make
8 24 cd
8 24 cd
8 25 make
8 25 make
8 25 make
8 26 yum
8 26 yum
8 26 make
8 27 yum
8 27 install
8 28 ./linux
8 28 yum
I'd like to know if there's anyway to count the number of specific values of the third field. For example I may want to count the number of cd,yum and install only.
You can use awk to do get the third field values and wc -l to count the number.
awk '$3=="cd"||$3=="yum"||$3=="install"||$3=="cat" {print $0}' file | wc -l
You can also use egrep, but this will look for these words not only on the third field, but everywhere else in the line.
egrep "(cd|yum|install|cat)" file | wc -l
if you want to count a specific word on the third field, then you can do the above without multiple regexs.
awk '$3=="cd" {print $0}' | wc -l
A classic shell script to do the job is:
awk '{print $3}' "$file" | sort | uniq -c | sort -n
Extract values from column 3 with awk, sort the identical names together, count the repeats, sort the output in increasing order of count. The sort | uniq -c | sort -n part is a common meme.
If you're using GNU awk, you can do it all in the awk script; it might be more efficient, but for really humungous files, it can run out of memory where the pipeline doesn't (sort spills to disk when necessary; writing code to spill to disk in awk is not sensible).
Use cut, sort and uniq:
$ cut -d" " -f3 inputfile | sort | uniq -c
2 cd
1 install
1 ./linux
6 make
6 yum
For your input this
awk '{++a[$3]}END{for(i in a)print i "\t" a[i];}' file
Would print:
cd 2
install 1
./linux 1
make 6
yum 6
Using awk to count the occurrences of field three and sort to order the output:
$ awk '{a[$3]++}END{for(k in a)print a[k],k}' file | sort -n
1 install
1 ./linux
2 cd
6 make
6 yum
So filter by command:
$ awk '/cd|yum|install/{a[$3]++}END{for(k in a)print a[k],k}' file | sort -n
1 install
2 cd
6 yum
To stop partial matches such as grep in egrep use word boundaries \< and \> so the filter would be /\<cd\>|\<yum\>|\<install\>/
You can use grep to filter by multiple terms at the same time:
cut -f3 -d' ' file | grep -x -e yum -e make -e install | sort | uniq -c
Explanation:
The -x flag is to match only the lines that match exactly, as if with ^pattern$
The cut extracts the 3rd column only
We sort, uniq with count in the end for efficiency, after all junk is removed from the input
i guess u want to count the values of yum install & cd separately. if so, u shud go for 3 separate awk statements: awk '$3=="cd" {print $0}' file | wc -l
awk '$3=="yum" {print $0}' file | wc -l
awk '$3=="install" {print $0}' file | wc -l

Resources