How to filter multiple files and eliminate duplicate entries to select a single entry while using linux shell - linux

I have a folder that contains several files. These files consist of identical columns.
Let us say file1 and file2 have contents as follows.(Here it can be more than two files)
$cat file1.txt
9999999999|1200
8888888888|1400
7777777777|1255
6666666666|1788
7777777777|1289
9999999999|1300
$cat file2.txt
9999999999|2500
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
In my file 1st column is mobile number and 2nd is count. Same mobile can be there in multiple files. Now I want to get the records into a file with unique mobile numbers which has the highest count.
The output should be as follows:
$cat output.txt
7777777777|1289
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
Any help would be appreciated.

That's probably not very efficient but it does the job:
put this into phones.sh and run sh phones.sh
#!/bin/bash
files="
file1.txt
file2.txt
"
phones=$(cat $files | cut -d'|' -f1 | sort -u)
for phone in $phones; do grep -h $phone $files | sort -t'|' -k 2 -nr | head -n1; done | sort -t'|' -k 2
What it does is basically, extract all the phone numbers in the files, iterate over them and grep them in all files, select the one with the highest count. Then I also sorted the final result by count, which is what your expected result suggests. sort -t'|' -k 2 -nr means sort the second column given the delimiter |, by decreasing numerical order. head -n1 selects the first line. You can add other files into the files variable.

Another way of doing this is to use the power of sort and awk:
cat file1.txt file2.txt | sort -t '|' -k1,1 -k2,2nr | awk -F"|" '!_[$1]++' | sort -t '|' -k2,2n
I think the one-liner is pretty self-explanatory, except for the awk. What that part does is that it does a uniq by the first column. The last sort is just to get the final order that you wanted.

Related

Find duplicate entries in a text file using shell

I am trying to find duplicate *.sh entry mention in a text file(test.log) and delete it, using shell program. Since the path is different so uniq -u always print duplicate entry even though there are two first_prog.sh entry in a text file
cat test.log
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/first_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
output:
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
I tried couple of way using few command but dont have idea on how to get above output.
rev test.log | cut -f1 -d/ | rev | sort | uniq -d
Any clue on this?
You can use awk for this by splitting fields on / and using $NF (last field) in an associative array:
awk -F/ '!seen[$NF]++' test.log
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
awk shines for these kind of tasks but here in a non awk solution,
$ sed 's|.*/|& |' file | sort -k2 -u | sed 's|/ |/|'
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
or, if your path is balanced (the same number of parents for all files)
$ sort -t/ -k5 -u file
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh
awk '!/my_shellprog\/test\/first/' file
/mnt/abc/shellprog/test/first_prog.sh
/mnt/abc/shellprog/test/second_prog.sh
/mnt/abc/my_shellprog/test/third_prog.sh

How to sort a text file numerically and then store the results in the same text file?

I have tried sort -n test.text > test.txt. However, this leaves me with an empty text file. What is going on here and what can I do to solve this problem?
Sort does not sort the file in-place. It outputs a sorted copy instead.
You need sort -n -k 4 out.txt > sorted-out.txt.
Edit: To get the order you want you have to sort the file with the numbers read in reverse. This does it:
cut -d' ' -f4 out.txt | rev | paste - out.txt | sort -k1 -n | cut -f2- > sorted-out.txt
For more learning -
sort -nk4 file
-n for numerical sort
-k for providing key
or add -r option for reverse sorting
sort -nrk4 file
It is because you are reading and writing to the same file. You can't do that. You can try something a temporary file, as mktemp or even something as:
sort -n test.text > test1.txt
mv test1.txt test
For sort, you can also do the following:
sort -n test.text -o test.text

Issue with unix sort

This is more of a doubt than a question.
So I have an input file like this:
$ cat test
class||sw sw-explr bot|results|id,23,0a522b36-556f-4116-b485-adcf132b6cad,20130325,/html/body/div/div[3]/div[2]/div[2]/div[3]/div/div/div/div/div/div[2]/div/div/ul/li[4]/div/img
class||sw sw-explr bot|results|id,40,30cefa2c-6ebf-485e-b49c-3a612fe3fd73,20130323,/html/body/div/div[3]/div[2]/div[3]/div[3]/div/div/div/div/div[3]/div/div/ul/li[8]/div/img
class||sw sw-explr bot|results|id,3,72805487-72c3-4173-947f-e5abed6ea1e4,20130324,/html/body/div/div[3]/div[2]/div[2]/div[2]/div/div/div/div/div/div[3]/div/div/div[2]/ul/li[20]/div/img
Kind of defining the element in an html page.
The comma separated 5 columns can be considered.
I want to sort this file with respect to the second column, i.e. columns having 23,40,3.
I am not sure why unix sort isn't working.
These are the queries I tried, surprisingly none gave me desired result.
cat test | sort -nt',' -k2
cat test | sort -n -t, -k2
cat test | sort -n -t$',' -k2
cat test | sort -t"," -k2
cat test | sort -n -k2
Is there something about sort that I don't know?
This didn't cause me a problem as I separated the columns, sorted, then joined again. But why did not sort work??
NB:- If I remove $3 of this file and then sort, it works fine!
this line should work for you:
sort -t, -n -k2,2 test
you don't need cat test|sort, just sort file
the default END POS of -k is the end of line. so if you sort -k2 it means sort from the 2nd field till the end of line. In fact you need sort by exact the 2nd field. And this also explains why your sort worked if you removed 3rd col.
if test with your example:
kent$ sort -t, -n -k2,2 file
class||sw sw-explr bot|results|id,3,72805487-72c3-4173-947f-e5abed6ea1e4,20130324,/html/body/div/div[3]/div[2]/div[2]/div[2]/div/div/div/div/div/div[3]/div/div/div[2]/ul/li[20]/div/img
class||sw sw-explr bot|results|id,23,0a522b36-556f-4116-b485-adcf132b6cad,20130325,/html/body/div/div[3]/div[2]/div[2]/div[3]/div/div/div/div/div/div[2]/div/div/ul/li[4]/div/img
class||sw sw-explr bot|results|id,40,30cefa2c-6ebf-485e-b49c-3a612fe3fd73,20130323,/html/body/div/div[3]/div[2]/div[3]/div[3]/div/div/div/div/div[3]/div/div/ul/li[8]/div/img
Here comes a working solution:
cat test.file | sort -t, -k2n,2
Explanation:
-t, # Set field separator to ','
-k2n,2 # sort by the second column, numerical

How do I randomly merge two input files to one output file using unix tools?

I have two text files, of different sizes, which I would like to merge into one file, but with the content mixed randomly; this is to create some realistic data for some unit tests. One text file contains the true cases, while the other the false.
I would like to use standard Unix tools to create the merged output. How can I do this?
Random sort using -R:
$ sort -R file1 file2 -o file3
My version of sort also does not support -R. So here is an alternative using awk by inserting a random number in front of each line and sorting according to those numbers, then strip off the number.
awk '{print int(rand()*1000), $0}' file1 file2 | sort -n | awk '{$1="";print $0}'
This adds a random number to the beginning of each line with awk, sorts based on that number, and then removes it. This will even work if you have duplicates (as pointed out by choroba) and is slightly more cross platform.
awk 'BEGIN { srand() } { print rand(), $0 }' file1 file2 |
sort -n |
cut -f2- -d" "

Sorting in bash

I have been trying to get the unique values in each column of a tab delimited file in bash. So, I used the following command.
cut -f <column_number> <filename> | sort | uniq -c
It works fine and I can get the unique values in a column and its count like
105 Linux
55 MacOS
500 Windows
What I want to do is instead of sorting by the column value names (which in this example are OS names) I want to sort them by count and possibly have the count in the second column in this output format. So It will have to look like:
Windows 500
MacOS 105
Linux 55
How do I do this?
Use:
cut -f <col_num> <filename>
| sort
| uniq -c
| sort -r -k1 -n
| awk '{print $2" "$1}'
The sort -r -k1 -n sorts in reverse order, using the first field as a numeric value. The awk simply reverses the order of the columns. You can test the added pipeline commands thus (with nicer formatting):
pax> echo '105 Linux
55 MacOS
500 Windows' | sort -r -k1 -n | awk '{printf "%-10s %5d\n",$2,$1}'
Windows 500
Linux 105
MacOS 55
Mine:
cut -f <column_number> <filename> | sort | uniq -c | awk '{ print $2" "$1}' | sort
This will alter the column order (awk) and then just sort the output.
Hope this will help you
Using sed based on Tagged RE:
cut -f <column_number> <filename> | sort | uniq -c | sort -r -k1 -n | sed 's/\([0-9]*\)[ ]*\(.*\)/\2 \1/'
Doesn't produce output in a neat format though.

Resources