Difference between using the uniq command with sort or without it in linux - linux

When I use uniq -u data.txt lists the whole file and when I use sort data.txt | uniq -u it omits repeated lines. Why does this happen?
uniq man says that -u, --unique only prints unique lines. I don't understand why I need to use pipe to get correct output.

uniq removes adjacent duplicates. If you want to omit duplicates that are not adjacent, you'll have to sort the data first.

Related

how to Create a script that takes a list of words as input and prints only words that appear exactly once

Requirements for input and output files:
Input format: One line, one word
Output format: One line, one word
Words should be sorted
I tried to use this command to solve this question
sort list | uniq
but it fails.
Anyone who can help me to solve it?
Try below :
cat <file_name> | sort | uniq -c | grep -e '^\s.*1\s' | awk '{print $NF}'
Explanation:
cat <file_name> | sort | uniq -c --> Will print all the entries, sort them and print count of each name.
grep -e '^\s.*1\s' --> This is a regex which will exclude all the entries where count is not 1
awk is used to remove count and print just name
It would be nice, simple and elegant to use this command to perform this task.
cat <file_name> | sort |uniq -u
And it would do the task perfectly.
The answer given by #Evans Fone assisted me.
If you're trying to implement a script that runs as:
cat list | ./scriptname
Then do the following:
Step 1:
Type
emacs scriptname
Step 2:
Press
ENTER
Step 3:
Type
!/bin/bash
sort |uniq -u
Step 4:
Press
CTRL+X
CTRL+S
CTRL+X
CTRL+C
sort | uniq -u
as simple as that.
sort without argument prompts input sorts its and pipe it to unique which print unique words

How to display number of times a word repeated after a common pattern

I have a file which has N number of line
For example
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
How to get the below result using uniq command:
This/is/workshop/ =5
Okay so there are a couple tools you can utilize here. Familiarize yourself with grep, cut, and uniq. My process for doing something like this may not be ideal, but given your original question I'll try to tailor the process to the lines in the file you've given.
First you'll want to grep the file for the relevant strings. Then you can pass it through to cut, declaring the fields you want to include by specifying the delimiter and also the number of fields. Lastly, you can pipe this through to uniq to count it.
Example:
Contents of file.txt
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
Use grep, cut and uniq
$ grep "This/is/workshop/" file.txt | cut -d/ -f1-3 | uniq -c
5 This/is/workshop
To specify the delimiter in cut, you use the -d flag and the delimiter you want to use. Each field is what exists between delimiters, starting at 1. For this, we want the first three. Then just pipe it through to uniq to get the count you are after.

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

unix script sort showing the number of items?

I have a shell script which is grepping the results of a file then it calls sort -u to get the unique entries. Is there a way to have sort also tell me how many of each of those entries there are? So the output would be something like:
user1 - 50
user2 - 23
user3 - 40
etc..
Use sort input | uniq -c. uniq does what -u does in sort -u, but also has the additional -c option for counting.
Grep has a -c switch to count the occurrence of each item..
grep -c needle haystack
will give the number of needles which you can sort as needed..
Given a sorted list, uniq -c will show the item, and how many. It will be the first column, so I will often do something like:
sort file.txt | uniq -c |sort -nr
The -n in the sort will parse numbers correctly, like 9 before 11 (though with the '-r', it will reverse the count, since I usually want the higher count lines first).

grep and sed command

i have a truckload of files with sql commands in them, i have been asked to extract all database table names from the files
How can I use grep and sed to parse the files and create a list of the unique table names in a text file ..one per line?
the name names all seem to start with "db_" which is handy!
what would be the best way to use grep and sed together to pull the table names out?
This will search for lines containing the table names. The output of this will quickly reveal if a more selective search is needed:
grep "\<db_[a-zA-Z0-9_]*" *.sql
Once the proper search is sorted out, remove all other characters from lines with tablenames:
grep "\<db_[a-zA-Z0-9_]*" *.sql | sed 's/.*\(\<db_[a-zA-Z0-9_]*\).*/\1/'
Once that's running, add on a sort and remove duplicates:
(same last pipe expression) | sort | uniq
you just need grep
grep -owE "db_[a-zA-Z0-9]+" file|sort -u
or awk
awk '{for(i=1;i<=NF;i++)if($i~/^db_[a-zA-Z0-9]+/){print $i} }' file

Resources