how to Create a script that takes a list of words as input and prints only words that appear exactly once - linux

Requirements for input and output files:
Input format: One line, one word
Output format: One line, one word
Words should be sorted
I tried to use this command to solve this question
sort list | uniq
but it fails.
Anyone who can help me to solve it?

Try below :
cat <file_name> | sort | uniq -c | grep -e '^\s.*1\s' | awk '{print $NF}'
Explanation:
cat <file_name> | sort | uniq -c --> Will print all the entries, sort them and print count of each name.
grep -e '^\s.*1\s' --> This is a regex which will exclude all the entries where count is not 1
awk is used to remove count and print just name

It would be nice, simple and elegant to use this command to perform this task.
cat <file_name> | sort |uniq -u
And it would do the task perfectly.

The answer given by #Evans Fone assisted me.
If you're trying to implement a script that runs as:
cat list | ./scriptname
Then do the following:
Step 1:
Type
emacs scriptname
Step 2:
Press
ENTER
Step 3:
Type
!/bin/bash
sort |uniq -u
Step 4:
Press
CTRL+X
CTRL+S
CTRL+X
CTRL+C

sort | uniq -u
as simple as that.
sort without argument prompts input sorts its and pipe it to unique which print unique words

Related

Am I using the proper command?

I am trying to write a one-line command on terminal to count all the unique "gene-MIR" in a very large file. The "gene-MIR" are followed by a series of numbers ex. gene-MIR334223, gene-MIR633235, gene-MIR53453 ... etc, and there are multiples of the same "gene-MIR" ex. gene-MIR342433 may show up 10x in the script.
My question is, how do I write a command that will annotate the unique "gene-MIR" that are present in my file?
The commands I have been using so far is:
grep -c "gene-MIR" myfile.txt | uniq
grep "gene-MIR" myfile.txt | sort -u
The first command provides me with a count; however, I believe it does not include the number series after "MIR" and is only counting how many "gene-MIR" itself are present.
Thanks!
[1]: https://i.stack.imgur.com/Y7EcD.png
Assuming all the entries are are on separate lines, try this:
grep "gene-MIR" myfile.txt | sort | uniq -c
If the entries are mixed up with other text, and the system has GNU grep try this:
grep -o 'gene-MIR[0-9]*' myfile.txt | sort | uniq -c
To get the total count:
grep -o 'gene-MIR[0-9]*' myfile.txt | wc -l
If you have information like this:
Inf1
Inf2
Inf1
Inf2
And you want to know the amount of "inf" kinds, you always need to sort it first. Only afterwards you can start counting.
Edit
I've created a similar file, containing the examples, mentioned in the requester's comment, as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR4232
gene-MIR2334
gene-MIR93284
More nonsense
On that, I've applied both commands, as mentioned in the question:
grep -c "gene-MIR" myfile.txt | uniq
Which results in 6, just like the following command:
grep -c "gene-MIR" myfile.txt
Why? The question here is "How many lines contain the string "gene-MIR"?".
This is clearly not the requested information.
The other command also is not correct:
grep "gene-MIR" myfile.txt | sort -u
The result:
gene-MIR2334
gene-MIR4232
gene-MIR93284
Explanation:
grep "gene-MIR" ... means: show all the lines, which contain "gene-MIR"
| sort -u means: sort the displayed lines and if there are multiple instances of the same, only show one of them.
Also this is not what the requester wants. Therefore I have following proposal:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
2 gene-MIR2334
2 gene-MIR4232
2 gene-MIR93284
This is more what the requester is looking for, I presume.
What does it mean?
grep "gene-MIR" myfile.txt : only show the lines which contain "gene-MIR"
| sort : sort the lines, which are shown. Like this, you get an intermediate result like this:
gene-MIR2334
gene-MIR2334
gene-MIR4232
gene-MIR4232
gene-MIR93284
gene-MIR93284
| uniq -c : group those results together and show the count for every instance.
Unfortunately, the example is badly chosen as every instance occurs exactly two times. Therefore, for clarification purposes, I've created another "myfile.txt", as follows:
Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR2334
gene-MIR2334
gene-MIR93284
More nonsense
I've applied the same command again:
grep "gene-MIR" myfile.txt | sort | uniq -c
With following result:
3 gene-MIR2334
1 gene-MIR4232
2 gene-MIR93284
Here you can see in a much clearer way that the proposed command is correct.
... and your next question is: "Yes, but is it possible to sort the result?", on which I answer:
grep "gene-MIR" myfile.txt | sort | uniq -c | sort -n
With following result:
1 gene-MIR4232
2 gene-MIR93284
3 gene-MIR2334
Have fun!

What do back brackets do in this bash script code?

so i'm doing a problem with bashscript, this one: ./namefreq.sh ANA should return a list of two names (on separate lines) ANA and RENEE, both of which have frequency 0.120.
Basically I have a file from table.csv shown in the code below that have names and a frequency number next to them e.g. Anna, 0.120
I'm still unsure what the `` does for this code, and I'm also struggling to understand how this code is able to print out two names with identical frequencies. The way I read the code is:
grep compares the word (-w) typed by the user (./bashscript.sh Anna) to the value of (a), which then uses the cut command to be able to compare the 2nd field of the line separated by the delimiter "," which is the frequency from the file table.csv and then | cut -f1 -d"," prints out the first fields which are the names with the same frequency
^ would this be correct?
thanks :)
#!/bin/bash
a=`grep -w $1 table.csv | cut -f2 -d','`
grep -w $a table.csv | cut -f1 -d',' | sort -d
When a command is in backticks or $(), the output of the command is subsituted back into the command in place of it. So if the file has Anna,0.120
a=`grep -w Anna table.csv | cut -f2 -d','`
will execute the grep and cut commands, which will output 0.120, so it will be equivalent to
a=0.120
Then the command looks for all the lines that match 0.120, extracts the first field with cut, and sorts them.

Pipelining cut sort uniq

Trying to get a certain field from a sam file, sort it and then find the number of unique numbers in the file. I have been trying:
cut -f 2 practice.sam > field2.txt | sort -o field2.txt sortedfield2.txt |
uniq -c sortedfield2.txt
The cut is working to pull out the numbers from field two, however when trying to sort the numbers into a new file or the same file I am just getting a blank. I have tried breaking the pipeline into sections but still getting the same error. I am meant to use those three functions to achieve the output count.
Use
cut -f 2 practice.sam | sort -o | uniq -c
In your original code, you're redirecting the output of cut to field2.txt and at the same time, trying to pipe the output into sort. That won't work (unless you use tee). Either separate the commands as individual commands (e.g., use ;) or don't redirect the output to a file.
Ditto the second half, where you write the output to sortedfield2.txt and thus end up with nothing going to stdout, and nothing being piped into uniq.
So an alternative could be:
cut -f 2 practice.sam > field2.txt ; sort -o field2.txt sortedfield2.txt ; uniq -c sortedfield2.txt
which is the same as
cut -f 2 practice.sam > field2.txt
sort -o field2.txt sortedfield2.txt
uniq -c sortedfield2.txt
you can use this command:
cut -f 2 practise.sam | uniq | sort > sorted.txt
In your code is wrong. The fault is "No such file or directory". Because of pipe. You can learn at this link how it is used
https://www.guru99.com/linux-pipe-grep.html

Find line number in a text file - without opening the file

In a very large file I need to find the position (line number) of a string, then extract the 2 lines above and below that string.
To do this right now - I launch vi, find the string, note it's line number, exit vi, then use sed to extract the lines surrounding that string.
Is there a way to streamline this process... ideally without having to run vi at all.
Maybe using grep like this:
grep -n -2 your_searched_for_string your_large_text_file
Will give you almost what you expect
-n : tells grep to print the line number
-2 : print 2 additional lines (and the wanted string, of course)
You can do
grep -C 2 yourSearch yourFile
To send it in a file, do
grep -C 2 yourSearch yourFile > result.txt
Use grep -n string file to find the line number without opening the file.
you can use cat -n to display the line numbers and then use awk to get the line number after a grep in order to extract line number:
cat -n FILE | grep WORD | awk '{print $1;}'
although grep already does what you mention if you give -C 2 (above/below 2 lines):
grep -C 2 WORD FILE
You can do it with grep -A and -B options, like this:
grep -B 2 -A 2 "searchstring" | sed 3d
grep will find the line and show two lines of context before and after, later remove the third one with sed.
If you want to automate this, simple you can do a Shell Script. You may try the following:
#!/bin/bash
VAL="your_search_keyword"
NUM1=`grep -n "$VAL" file.txt | cut -f1 -d ':'`
echo $NUM1 #show the line number of the matched keyword
MYNUMUP=$["NUM1"-1] #get above keyword
MYNUMDOWN=$["NUM1"+1] #get below keyword
sed -n "$MYNUMUP"p file.txt #display above keyword
sed -n "$MYNUMDOWN"p file.txt #display below keyword
The plus point of the script is you can change the keyword in VAL variable as you like and execute to get the needed output.

Why this command adds \n at the last line

I'm using this command to sort and remove duplicate lines from a file.
sort file2.txt | uniq > file2_uniq.txt
After performing the command, I find the last line with this value: \n which cause me problems. What can I do to avoid it ?
You could also let sort take care of uniquing the output, omitting the first line would avoid empty lines:
sort -u file2.txt | tail -n +2
Edit
If you also wanted to remove all empty lines I would suggest using:
grep -v '^$' | sort -u file2.txt
Just filter out what you don't want:
sort file2.txt | egrep -v "^$" | uniq > file2_uniq.txt
The problem solved by removing the last line using:
sed '$d' infile > outfile

Resources