Pipelining cut sort uniq - linux

Trying to get a certain field from a sam file, sort it and then find the number of unique numbers in the file. I have been trying:
cut -f 2 practice.sam > field2.txt | sort -o field2.txt sortedfield2.txt |
uniq -c sortedfield2.txt
The cut is working to pull out the numbers from field two, however when trying to sort the numbers into a new file or the same file I am just getting a blank. I have tried breaking the pipeline into sections but still getting the same error. I am meant to use those three functions to achieve the output count.

Use
cut -f 2 practice.sam | sort -o | uniq -c
In your original code, you're redirecting the output of cut to field2.txt and at the same time, trying to pipe the output into sort. That won't work (unless you use tee). Either separate the commands as individual commands (e.g., use ;) or don't redirect the output to a file.
Ditto the second half, where you write the output to sortedfield2.txt and thus end up with nothing going to stdout, and nothing being piped into uniq.
So an alternative could be:
cut -f 2 practice.sam > field2.txt ; sort -o field2.txt sortedfield2.txt ; uniq -c sortedfield2.txt
which is the same as
cut -f 2 practice.sam > field2.txt
sort -o field2.txt sortedfield2.txt
uniq -c sortedfield2.txt

you can use this command:
cut -f 2 practise.sam | uniq | sort > sorted.txt
In your code is wrong. The fault is "No such file or directory". Because of pipe. You can learn at this link how it is used
https://www.guru99.com/linux-pipe-grep.html

Related

Am I using the proper command?

I am trying to write a one-line command on terminal to count all the unique "gene-MIR" in a very large file. The "gene-MIR" are followed by a series of numbers ex. gene-MIR334223, gene-MIR633235, gene-MIR53453 ... etc, and there are multiples of the same "gene-MIR" ex. gene-MIR342433 may show up 10x in the script.
My question is, how do I write a command that will annotate the unique "gene-MIR" that are present in my file?
The commands I have been using so far is:
grep -c "gene-MIR" myfile.txt | uniq
grep "gene-MIR" myfile.txt | sort -u
The first command provides me with a count; however, I believe it does not include the number series after "MIR" and is only counting how many "gene-MIR" itself are present.
Thanks!
[1]: https://i.stack.imgur.com/Y7EcD.png
Assuming all the entries are are on separate lines, try this:
grep "gene-MIR" myfile.txt | sort | uniq -c
If the entries are mixed up with other text, and the system has GNU grep try this:
grep -o 'gene-MIR[0-9]*' myfile.txt | sort | uniq -c
To get the total count:
grep -o 'gene-MIR[0-9]*' myfile.txt | wc -l
If you have information like this:
Inf1
Inf2
Inf1
Inf2
And you want to know the amount of "inf" kinds, you always need to sort it first. Only afterwards you can start counting.
Edit
I've created a similar file, containing the examples, mentioned in the requester's comment, as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR4232
gene-MIR2334
gene-MIR93284
More nonsense
On that, I've applied both commands, as mentioned in the question:
grep -c "gene-MIR" myfile.txt | uniq
Which results in 6, just like the following command:
grep -c "gene-MIR" myfile.txt
Why? The question here is "How many lines contain the string "gene-MIR"?".
This is clearly not the requested information.
The other command also is not correct:
grep "gene-MIR" myfile.txt | sort -u
The result:
gene-MIR2334
gene-MIR4232
gene-MIR93284
Explanation:
grep "gene-MIR" ... means: show all the lines, which contain "gene-MIR"
| sort -u means: sort the displayed lines and if there are multiple instances of the same, only show one of them.
Also this is not what the requester wants. Therefore I have following proposal:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
2 gene-MIR2334
2 gene-MIR4232
2 gene-MIR93284
This is more what the requester is looking for, I presume.
What does it mean?
grep "gene-MIR" myfile.txt : only show the lines which contain "gene-MIR"
| sort : sort the lines, which are shown. Like this, you get an intermediate result like this:
gene-MIR2334
gene-MIR2334
gene-MIR4232
gene-MIR4232
gene-MIR93284
gene-MIR93284
| uniq -c : group those results together and show the count for every instance.
Unfortunately, the example is badly chosen as every instance occurs exactly two times. Therefore, for clarification purposes, I've created another "myfile.txt", as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR2334
gene-MIR2334
gene-MIR93284
More nonsense
I've applied the same command again:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
3 gene-MIR2334
1 gene-MIR4232
2 gene-MIR93284
Here you can see in a much clearer way that the proposed command is correct.
... and your next question is: "Yes, but is it possible to sort the result?", on which I answer:
grep "gene-MIR" myfile.txt | sort | uniq -c | sort -n
With following result:
1 gene-MIR4232
2 gene-MIR93284
3 gene-MIR2334
Have fun!

how to Create a script that takes a list of words as input and prints only words that appear exactly once

Requirements for input and output files:
Input format: One line, one word
Output format: One line, one word
Words should be sorted
I tried to use this command to solve this question
sort list | uniq
but it fails.
Anyone who can help me to solve it?
Try below :
cat <file_name> | sort | uniq -c | grep -e '^\s.*1\s' | awk '{print $NF}'
Explanation:
cat <file_name> | sort | uniq -c --> Will print all the entries, sort them and print count of each name.
grep -e '^\s.*1\s' --> This is a regex which will exclude all the entries where count is not 1
awk is used to remove count and print just name
It would be nice, simple and elegant to use this command to perform this task.
cat <file_name> | sort |uniq -u
And it would do the task perfectly.
The answer given by #Evans Fone assisted me.
If you're trying to implement a script that runs as:
cat list | ./scriptname
Then do the following:
Step 1:
Type
emacs scriptname
Step 2:
Press
ENTER
Step 3:
Type
!/bin/bash
sort |uniq -u
Step 4:
Press
CTRL+X
CTRL+S
CTRL+X
CTRL+C
sort | uniq -u
as simple as that.
sort without argument prompts input sorts its and pipe it to unique which print unique words

Why does uniq -c command return duplicates in some cases?

I am trying to grep for words in a file that is not present in another file
grep -v -w -i -r -f "dont_use_words.txt" "list_of_words.txt" >> inverse_match_words.txt
uniq -c -i inverse_match_words.txt | sort -nr
But I get duplicate values in my uniq command. Why so?
I am wondering if it might be because grep differentiates between strings, say, "AAA" found in "GIRLAAA", "AAABOY", "GIRLAAABOY" and therefore, I end up with duplicates.
When I do a grep -F "AAA" all of them are returned though.
I'd appreciate if someone could help me out on this. I am new to Linux OS.
uniq eliminates all but one line in each group of consecutive duplicate lines. The conventional way to use it, therefore, is to pass the input through sort first. You're not doing that, so yes, it is entirely possible that (non-consecutive) duplicates will remain in the output.
Example:
grep -v -w -i -f dont_use_words.txt list_of_words.txt \
| sort -f \
| uniq -c -i \
| sort -nr

How to sort a text file numerically and then store the results in the same text file?

I have tried sort -n test.text > test.txt. However, this leaves me with an empty text file. What is going on here and what can I do to solve this problem?
Sort does not sort the file in-place. It outputs a sorted copy instead.
You need sort -n -k 4 out.txt > sorted-out.txt.
Edit: To get the order you want you have to sort the file with the numbers read in reverse. This does it:
cut -d' ' -f4 out.txt | rev | paste - out.txt | sort -k1 -n | cut -f2- > sorted-out.txt
For more learning -
sort -nk4 file
-n for numerical sort
-k for providing key
or add -r option for reverse sorting
sort -nrk4 file
It is because you are reading and writing to the same file. You can't do that. You can try something a temporary file, as mktemp or even something as:
sort -n test.text > test1.txt
mv test1.txt test
For sort, you can also do the following:
sort -n test.text -o test.text

Sorting in bash

I have been trying to get the unique values in each column of a tab delimited file in bash. So, I used the following command.
cut -f <column_number> <filename> | sort | uniq -c
It works fine and I can get the unique values in a column and its count like
105 Linux
55 MacOS
500 Windows
What I want to do is instead of sorting by the column value names (which in this example are OS names) I want to sort them by count and possibly have the count in the second column in this output format. So It will have to look like:
Windows 500
MacOS 105
Linux 55
How do I do this?
Use:
cut -f <col_num> <filename>
| sort
| uniq -c
| sort -r -k1 -n
| awk '{print $2" "$1}'
The sort -r -k1 -n sorts in reverse order, using the first field as a numeric value. The awk simply reverses the order of the columns. You can test the added pipeline commands thus (with nicer formatting):
pax> echo '105 Linux
55 MacOS
500 Windows' | sort -r -k1 -n | awk '{printf "%-10s %5d\n",$2,$1}'
Windows 500
Linux 105
MacOS 55
Mine:
cut -f <column_number> <filename> | sort | uniq -c | awk '{ print $2" "$1}' | sort
This will alter the column order (awk) and then just sort the output.
Hope this will help you
Using sed based on Tagged RE:
cut -f <column_number> <filename> | sort | uniq -c | sort -r -k1 -n | sed 's/\([0-9]*\)[ ]*\(.*\)/\2 \1/'
Doesn't produce output in a neat format though.

Resources