Linux, How to sort the lines of a file - linux

I have a file called abc. The content of abc is:
ccc
abc
ccc
ccc
a
b
dd
ccc
I want to sort the lines of the file and delete all duplicates (in this case ccc are duplicates).
In the shell script I use this:
sort -u < $1
But the sorted result becomes the standard output instead of saved into the abc file. How do I do this?

You can redirect output to a file as
sort -u < $1 > abc

try
sort -u abc -o abc_sorted
or if you want to replace the file
sort -u abc -o abc
you could also do
sort abc | uniq > abc_sorted

You can simply do it by using the commands sort uniq , | (pipes) and > (re direction). If your file name is file you can do it simply by the following command:-
sort file | uniq >file

Related

Am I using the proper command?

I am trying to write a one-line command on terminal to count all the unique "gene-MIR" in a very large file. The "gene-MIR" are followed by a series of numbers ex. gene-MIR334223, gene-MIR633235, gene-MIR53453 ... etc, and there are multiples of the same "gene-MIR" ex. gene-MIR342433 may show up 10x in the script.
My question is, how do I write a command that will annotate the unique "gene-MIR" that are present in my file?
The commands I have been using so far is:
grep -c "gene-MIR" myfile.txt | uniq
grep "gene-MIR" myfile.txt | sort -u
The first command provides me with a count; however, I believe it does not include the number series after "MIR" and is only counting how many "gene-MIR" itself are present.
Thanks!
[1]: https://i.stack.imgur.com/Y7EcD.png
Assuming all the entries are are on separate lines, try this:
grep "gene-MIR" myfile.txt | sort | uniq -c
If the entries are mixed up with other text, and the system has GNU grep try this:
grep -o 'gene-MIR[0-9]*' myfile.txt | sort | uniq -c
To get the total count:
grep -o 'gene-MIR[0-9]*' myfile.txt | wc -l
If you have information like this:
Inf1
Inf2
Inf1
Inf2
And you want to know the amount of "inf" kinds, you always need to sort it first. Only afterwards you can start counting.
Edit
I've created a similar file, containing the examples, mentioned in the requester's comment, as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR4232
gene-MIR2334
gene-MIR93284
More nonsense
On that, I've applied both commands, as mentioned in the question:
grep -c "gene-MIR" myfile.txt | uniq
Which results in 6, just like the following command:
grep -c "gene-MIR" myfile.txt
Why? The question here is "How many lines contain the string "gene-MIR"?".
This is clearly not the requested information.
The other command also is not correct:
grep "gene-MIR" myfile.txt | sort -u
The result:
gene-MIR2334
gene-MIR4232
gene-MIR93284
Explanation:
grep "gene-MIR" ... means: show all the lines, which contain "gene-MIR"
| sort -u means: sort the displayed lines and if there are multiple instances of the same, only show one of them.
Also this is not what the requester wants. Therefore I have following proposal:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
2 gene-MIR2334
2 gene-MIR4232
2 gene-MIR93284
This is more what the requester is looking for, I presume.
What does it mean?
grep "gene-MIR" myfile.txt : only show the lines which contain "gene-MIR"
| sort : sort the lines, which are shown. Like this, you get an intermediate result like this:
gene-MIR2334
gene-MIR2334
gene-MIR4232
gene-MIR4232
gene-MIR93284
gene-MIR93284
| uniq -c : group those results together and show the count for every instance.
Unfortunately, the example is badly chosen as every instance occurs exactly two times. Therefore, for clarification purposes, I've created another "myfile.txt", as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR2334
gene-MIR2334
gene-MIR93284
More nonsense
I've applied the same command again:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
3 gene-MIR2334
1 gene-MIR4232
2 gene-MIR93284
Here you can see in a much clearer way that the proposed command is correct.
... and your next question is: "Yes, but is it possible to sort the result?", on which I answer:
grep "gene-MIR" myfile.txt | sort | uniq -c | sort -n
With following result:
1 gene-MIR4232
2 gene-MIR93284
3 gene-MIR2334
Have fun!

How to filter multiple files and eliminate duplicate entries to select a single entry while using linux shell

I have a folder that contains several files. These files consist of identical columns.
Let us say file1 and file2 have contents as follows.(Here it can be more than two files)
$cat file1.txt
9999999999|1200
8888888888|1400
7777777777|1255
6666666666|1788
7777777777|1289
9999999999|1300
$cat file2.txt
9999999999|2500
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
In my file 1st column is mobile number and 2nd is count. Same mobile can be there in multiple files. Now I want to get the records into a file with unique mobile numbers which has the highest count.
The output should be as follows:
$cat output.txt
7777777777|1289
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
Any help would be appreciated.
That's probably not very efficient but it does the job:
put this into phones.sh and run sh phones.sh
#!/bin/bash
files="
file1.txt
file2.txt
"
phones=$(cat $files | cut -d'|' -f1 | sort -u)
for phone in $phones; do grep -h $phone $files | sort -t'|' -k 2 -nr | head -n1; done | sort -t'|' -k 2
What it does is basically, extract all the phone numbers in the files, iterate over them and grep them in all files, select the one with the highest count. Then I also sorted the final result by count, which is what your expected result suggests. sort -t'|' -k 2 -nr means sort the second column given the delimiter |, by decreasing numerical order. head -n1 selects the first line. You can add other files into the files variable.
Another way of doing this is to use the power of sort and awk:
cat file1.txt file2.txt | sort -t '|' -k1,1 -k2,2nr | awk -F"|" '!_[$1]++' | sort -t '|' -k2,2n
I think the one-liner is pretty self-explanatory, except for the awk. What that part does is that it does a uniq by the first column. The last sort is just to get the final order that you wanted.

Unix command "uniq" & "sort"

As we known
uniq [options] [file1 [file2]]
It remove duplicate adjacent lines from sorted file1. The option -c prints each line once, counting instances of each. So if we have the following result:
34 Operating System
254 Data Structure
5 Crypo
21 C++
1435 C Language
589 Java 1.6
And we sort above data using "sort -1knr", the result is as below:
1435 C Language
589 Java 1.6
254 Data Structure
34 Operating System
21 C++
5 Crypo
Can anyone help me out that how to output only the book name in this order (no number)?
uniq -c filename | sort -k 1nr | awk '{$1='';print}'
You can also use sed for that, as follows:
uniq -c filename | sort -k -1nr | sed 's/[0-9]\+ \(.\+\)/\1/g'
Test:
echo "34 Data Structure" | sed 's/[0-9]\+ \(.\+\)/\1/g'
Data Structure
This can also be done with a simplified regex (courtesy William Pursell):
echo "34 Data Structure" | sed 's/[0-9]* *//'
Data Structure
Why do you use uniq -c to print the number of occurences, which you then want to remove with some cut/awk/sed dance?
Instead , you could just use
sort -u $file1 $file2 /path/to/more_files_to_glob*
Or do some systems come with a version of sort which doesn't support -u ?

Linux:How to list the information about file or directory(size,permission,number of files by type?) in total

Suppose I am staying in currenty directory, I wanted to list all the files in total numbers, as well as the size, permission, and also the number of files by types.
here is the sample outputs:
Here is a sample :
Print information about "/home/user/poker"
total number of file : 83
pdf files : 5
html files : 9
text files : 15
unknown : 5
NB: anyfile without extension could be consider as unknown.
i hope to use some simple command like ls, cut, sort, unique ,(just examples) put each different extension in file and using wc -l to count number of lines
or do i need to use grep, awk , or something else?
Hope to get the everybody's advices.thank you!
Best way is to use file to output only mimetype and pass it to awk.
file * -ib | awk -F'[;/.]' '{print $(NF-1)}' | sort -n | uniq -c
On my home directory it produces this output.
35 directory
3 html
1 jpeg
1 octet-stream
1 pdf
32 plain
5 png
1 spreadsheet
7 symlink
1 text
1 x-c++
3 x-empty
1 xml
2 x-ms-asf
4 x-shellscript
1 x-shockwave-flash
If you think text/x-c++ and text/plain should be in same Use this
file * -ib | awk -F'[;/.]' '{print $1}' | sort -n | uniq -c
6 application
6 image
45 inode
40 text
2 video
Change the {print $1} part according to your need to get the appropriate output.
You need bash.
files=(*)
pdfs=(*.pdf)
echo "${#files[#]}"
echo "${#pdfs[#]}"
echo "$((${#files[#]}-${#pdfs[#]}))"
find . -type f | xargs -n1 basename | fgrep . | sed 's/.*\.//' | sort | uniq -c | sort -n
That gives you a recursive list of file extensions. If you want only the current directory add a -maxdepth 1 to the find command.

Looping through a text file containing domains using bash script

I have written a script that reads href tag of a webpage and fetches the links on that webpage and writes them to a text file. Now I have a text file containing links such as these for example:
http://news.bbc.co.uk/2/hi/health/default.stm
http://news.bbc.co.uk/weather/
http://news.bbc.co.uk/weather/forecast/8?area=London
http://newsvote.bbc.co.uk/1/shared/fds/hi/business/market_data/overview/default.stm
http://purl.org/dc/terms/
http://static.bbci.co.uk/bbcdotcom/0.3.131/style/3pt_ads.css
http://static.bbci.co.uk/frameworks/barlesque/2.8.7/desktop/3.5/style/main.css
http://static.bbci.co.uk/frameworks/pulsesurvey/0.7.0/style/pulse.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie6.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie7.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie8.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/main.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/img/iphone.png
http://www.bbcamerica.com/
http://www.bbc.com/future
http://www.bbc.com/future/
http://www.bbc.com/future/story/20120719-how-to-land-on-mars
http://www.bbc.com/future/story/20120719-road-opens-for-connected-cars
http://www.bbc.com/future/story/20120724-in-search-of-aliens
http://www.bbc.com/news/
I would like to be able to filter them such that I return something like:
http://www.bbc.com : 6
http://static.bbci.co.uk: 15
The values on the the side indicate the number of times the domain appears in the file. How can i be able to achieve this in bash considering I would have a loop going through the file. I am a newbie to bash shell scripting?
$ cut -d/ -f-3 urls.txt | sort | uniq -c
3 http://news.bbc.co.uk
1 http://newsvote.bbc.co.uk
1 http://purl.org
8 http://static.bbci.co.uk
1 http://www.bbcamerica.com
6 http://www.bbc.com
Just like this
egrep -o '^http://[^/]+' domain.txt | sort | uniq -c
Output of this on your example data:
3 http://news.bbc.co.uk/
1 http://newsvote.bbc.co.uk/
1 http://purl.org/
8 http://static.bbci.co.uk/
6 http://www.bbc.com/
1 http://www.bbcamerica.com/
This solution works even if your line is made up of a simple url without a trailing slash, so
http://www.bbc.com/news
http://www.bbc.com/
http://www.bbc.com
will all be in the same group.
If you want to allow https, then you can write:
egrep -o '^https?://[^/]+' domain.txt | sort | uniq -c
If other protocols are possible, such as ftp, mailto, etc. you can even be very loose and write:
egrep -o '^[^:]+://[^/]+' domain.txt | sort | uniq -c

Resources