Am I using the proper command? - linux

I am trying to write a one-line command on terminal to count all the unique "gene-MIR" in a very large file. The "gene-MIR" are followed by a series of numbers ex. gene-MIR334223, gene-MIR633235, gene-MIR53453 ... etc, and there are multiples of the same "gene-MIR" ex. gene-MIR342433 may show up 10x in the script.
My question is, how do I write a command that will annotate the unique "gene-MIR" that are present in my file?
The commands I have been using so far is:
grep -c "gene-MIR" myfile.txt | uniq
grep "gene-MIR" myfile.txt | sort -u
The first command provides me with a count; however, I believe it does not include the number series after "MIR" and is only counting how many "gene-MIR" itself are present.
Thanks!
[1]: https://i.stack.imgur.com/Y7EcD.png

Assuming all the entries are are on separate lines, try this:
grep "gene-MIR" myfile.txt | sort | uniq -c
If the entries are mixed up with other text, and the system has GNU grep try this:
grep -o 'gene-MIR[0-9]*' myfile.txt | sort | uniq -c
To get the total count:
grep -o 'gene-MIR[0-9]*' myfile.txt | wc -l

If you have information like this:
Inf1
Inf2
Inf1
Inf2
And you want to know the amount of "inf" kinds, you always need to sort it first. Only afterwards you can start counting.
Edit
I've created a similar file, containing the examples, mentioned in the requester's comment, as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR4232
gene-MIR2334
gene-MIR93284
More nonsense
On that, I've applied both commands, as mentioned in the question:
grep -c "gene-MIR" myfile.txt | uniq
Which results in 6, just like the following command:
grep -c "gene-MIR" myfile.txt
Why? The question here is "How many lines contain the string "gene-MIR"?".
This is clearly not the requested information.
The other command also is not correct:
grep "gene-MIR" myfile.txt | sort -u
The result:
gene-MIR2334
gene-MIR4232
gene-MIR93284
Explanation:
grep "gene-MIR" ... means: show all the lines, which contain "gene-MIR"
| sort -u means: sort the displayed lines and if there are multiple instances of the same, only show one of them.
Also this is not what the requester wants. Therefore I have following proposal:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
2 gene-MIR2334
2 gene-MIR4232
2 gene-MIR93284
This is more what the requester is looking for, I presume.
What does it mean?
grep "gene-MIR" myfile.txt : only show the lines which contain "gene-MIR"
| sort : sort the lines, which are shown. Like this, you get an intermediate result like this:
gene-MIR2334
gene-MIR2334
gene-MIR4232
gene-MIR4232
gene-MIR93284
gene-MIR93284
| uniq -c : group those results together and show the count for every instance.
Unfortunately, the example is badly chosen as every instance occurs exactly two times. Therefore, for clarification purposes, I've created another "myfile.txt", as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR2334
gene-MIR2334
gene-MIR93284
More nonsense
I've applied the same command again:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
3 gene-MIR2334
1 gene-MIR4232
2 gene-MIR93284
Here you can see in a much clearer way that the proposed command is correct.
... and your next question is: "Yes, but is it possible to sort the result?", on which I answer:
grep "gene-MIR" myfile.txt | sort | uniq -c | sort -n
With following result:
1 gene-MIR4232
2 gene-MIR93284
3 gene-MIR2334
Have fun!

Related

how to Create a script that takes a list of words as input and prints only words that appear exactly once

Requirements for input and output files:
Input format: One line, one word
Output format: One line, one word
Words should be sorted
I tried to use this command to solve this question
sort list | uniq
but it fails.
Anyone who can help me to solve it?
Try below :
cat <file_name> | sort | uniq -c | grep -e '^\s.*1\s' | awk '{print $NF}'
Explanation:
cat <file_name> | sort | uniq -c --> Will print all the entries, sort them and print count of each name.
grep -e '^\s.*1\s' --> This is a regex which will exclude all the entries where count is not 1
awk is used to remove count and print just name
It would be nice, simple and elegant to use this command to perform this task.
cat <file_name> | sort |uniq -u
And it would do the task perfectly.
The answer given by #Evans Fone assisted me.
If you're trying to implement a script that runs as:
cat list | ./scriptname
Then do the following:
Step 1:
Type
emacs scriptname
Step 2:
Press
ENTER
Step 3:
Type
!/bin/bash
sort |uniq -u
Step 4:
Press
CTRL+X
CTRL+S
CTRL+X
CTRL+C
sort | uniq -u
as simple as that.
sort without argument prompts input sorts its and pipe it to unique which print unique words

Pipelining cut sort uniq

Trying to get a certain field from a sam file, sort it and then find the number of unique numbers in the file. I have been trying:
cut -f 2 practice.sam > field2.txt | sort -o field2.txt sortedfield2.txt |
uniq -c sortedfield2.txt
The cut is working to pull out the numbers from field two, however when trying to sort the numbers into a new file or the same file I am just getting a blank. I have tried breaking the pipeline into sections but still getting the same error. I am meant to use those three functions to achieve the output count.
Use
cut -f 2 practice.sam | sort -o | uniq -c
In your original code, you're redirecting the output of cut to field2.txt and at the same time, trying to pipe the output into sort. That won't work (unless you use tee). Either separate the commands as individual commands (e.g., use ;) or don't redirect the output to a file.
Ditto the second half, where you write the output to sortedfield2.txt and thus end up with nothing going to stdout, and nothing being piped into uniq.
So an alternative could be:
cut -f 2 practice.sam > field2.txt ; sort -o field2.txt sortedfield2.txt ; uniq -c sortedfield2.txt
which is the same as
cut -f 2 practice.sam > field2.txt
sort -o field2.txt sortedfield2.txt
uniq -c sortedfield2.txt
you can use this command:
cut -f 2 practise.sam | uniq | sort > sorted.txt
In your code is wrong. The fault is "No such file or directory". Because of pipe. You can learn at this link how it is used
https://www.guru99.com/linux-pipe-grep.html

Count lines of CLI output in linux

Hi have the following command:
lsscsi | grep HITACHI | awk '{print $6}'
I want that the output will be the number of lines of the original output.
For example, if the original output is:
/dev/sda
/dev/sdb
/dev/sdc
The final output will be 3.
Basically the command wc -l can be used to count the lines in a file or pipe. However, since you want to count the number of lines after a filter has been applied I would recommend to use grep for that:
lsscsi | grep -c 'HITACHI'
-c just prints the number of matching lines.
Another thing. In your example you are using grep .. | awk. That's a useless use of grep. It should be
lsscsi | awk '/HITACHI/{print $6}'

unix script sort showing the number of items?

I have a shell script which is grepping the results of a file then it calls sort -u to get the unique entries. Is there a way to have sort also tell me how many of each of those entries there are? So the output would be something like:
user1 - 50
user2 - 23
user3 - 40
etc..
Use sort input | uniq -c. uniq does what -u does in sort -u, but also has the additional -c option for counting.
Grep has a -c switch to count the occurrence of each item..
grep -c needle haystack
will give the number of needles which you can sort as needed..
Given a sorted list, uniq -c will show the item, and how many. It will be the first column, so I will often do something like:
sort file.txt | uniq -c |sort -nr
The -n in the sort will parse numbers correctly, like 9 before 11 (though with the '-r', it will reverse the count, since I usually want the higher count lines first).

Looping through a text file containing domains using bash script

I have written a script that reads href tag of a webpage and fetches the links on that webpage and writes them to a text file. Now I have a text file containing links such as these for example:
http://news.bbc.co.uk/2/hi/health/default.stm
http://news.bbc.co.uk/weather/
http://news.bbc.co.uk/weather/forecast/8?area=London
http://newsvote.bbc.co.uk/1/shared/fds/hi/business/market_data/overview/default.stm
http://purl.org/dc/terms/
http://static.bbci.co.uk/bbcdotcom/0.3.131/style/3pt_ads.css
http://static.bbci.co.uk/frameworks/barlesque/2.8.7/desktop/3.5/style/main.css
http://static.bbci.co.uk/frameworks/pulsesurvey/0.7.0/style/pulse.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie6.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie7.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie8.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/main.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/img/iphone.png
http://www.bbcamerica.com/
http://www.bbc.com/future
http://www.bbc.com/future/
http://www.bbc.com/future/story/20120719-how-to-land-on-mars
http://www.bbc.com/future/story/20120719-road-opens-for-connected-cars
http://www.bbc.com/future/story/20120724-in-search-of-aliens
http://www.bbc.com/news/
I would like to be able to filter them such that I return something like:
http://www.bbc.com : 6
http://static.bbci.co.uk: 15
The values on the the side indicate the number of times the domain appears in the file. How can i be able to achieve this in bash considering I would have a loop going through the file. I am a newbie to bash shell scripting?
$ cut -d/ -f-3 urls.txt | sort | uniq -c
3 http://news.bbc.co.uk
1 http://newsvote.bbc.co.uk
1 http://purl.org
8 http://static.bbci.co.uk
1 http://www.bbcamerica.com
6 http://www.bbc.com
Just like this
egrep -o '^http://[^/]+' domain.txt | sort | uniq -c
Output of this on your example data:
3 http://news.bbc.co.uk/
1 http://newsvote.bbc.co.uk/
1 http://purl.org/
8 http://static.bbci.co.uk/
6 http://www.bbc.com/
1 http://www.bbcamerica.com/
This solution works even if your line is made up of a simple url without a trailing slash, so
http://www.bbc.com/news
http://www.bbc.com/
http://www.bbc.com
will all be in the same group.
If you want to allow https, then you can write:
egrep -o '^https?://[^/]+' domain.txt | sort | uniq -c
If other protocols are possible, such as ftp, mailto, etc. you can even be very loose and write:
egrep -o '^[^:]+://[^/]+' domain.txt | sort | uniq -c

Resources