Sort & uniq in Linux shell

Sort & uniq in Linux shell - linux

What is the difference between the following to commands?
sort -u FILE
sort FILE | uniq

Using sort -u does less I/O than sort | uniq, but the end result is the same. In particular, if the file is big enough that sort has to create intermediate files, there's a decent chance that sort -u will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).
Note that there times when the piping is appropriate. For example:
sort FILE | uniq -c | sort -n
This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)
There are times when not using the pipe is important. For example:
sort -u -o FILE FILE
This sorts the file 'in situ'; that is, the output file is specified by -o FILE, and this operation is guaranteed safe (the file is read before being overwritten for output).

There is one slight difference: return code.
The thing is that unless shopt -o pipefail is set the return code of the piped command will be return code of the last one. And uniq always returns zero (success). Try examining exit code, and you'll see something like this (pipefail is not set here):
pavel#lonely ~ $ sort -u file_that_doesnt_exist ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
2
pavel#lonely ~ $ sort file_that_doesnt_exist | uniq ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
0
Other than this, the commands are equivalent.

Beware! While it's true that "sort -u" and "sort|uniq" are equivalent, any additional options to sort can break the equivalence. Here's an example from the coreutils manual:
For example, 'sort -n -u' inspects only the value of the initial numeric string when checking for uniqueness, whereas 'sort -n | uniq' inspects the entire line.
Similarly, if you sort on key fields, the uniqueness test used by sort won't necessarily look at the entire line anymore. After being bitten by that bug in the past, these days I tend to use "sort|uniq" when writing Bash scripts. I'd rather have higher I/O overhead than run the risk that someone else in the shop won't know about that particular pitfall when they modify my code to add additional sort parameters.

sort -u will be slightly faster, because it does not need to pipe the output between two commands
also see my question on the topic: calling uniq and sort in different orders in shell

I have worked on some servers where sort don't support '-u' option. there we have to use
sort xyz | uniq

Nothing, they will produce the same result

Related

How to create a dynamic command in bash?

I want to have a command in a variable that runs a program and specifies the output filename for it depending on the number of files exits (to work on a new file each time).
Here is what I have:
export MY_COMMAND="myprogram -o ./dir/outfile-0.txt"
However I would like to make this outfile number increases each time MY_COMMAND is being executed. You may suppose myprogram creates the file soon enough before the next call. So the number can be retrieved from the number of files exists in the directory ./dir/. I do not have access to change myprogram itself or the use of MY_COMMAND.
Thanks in advance.

Given that you can't change myprogram — its -o option will always write to the file given on the command line, and assuming that something also out of your control is running MY_COMMAND so you can't change the way that MY_COMMAND gets called, you still have control of MY_COMMAND
For the rest of this answer I'm going to change the name MY_COMMAND to callprog mostly because it's easier to type.
You can define callprog as a variable as in your example export callprog="myprogram -o ./dir/outfile-0.txt", but you could instead write a shell script and name that callprog, and a shell script can do pretty much anything you want.
So, you have a directory full of outfile-<num>.txt files and you want to output to the next non-colliding outfile-<num+1>.txt.
Your shell script can get the numbers by listing the files, cutting out only the numbers, sorting them, then take the highest number.
If we have these files in dir:
outfile-0.txt
outfile-1.txt
outfile-5.txt
outfile-10.txt
ls -1 ./dir/outfile*.txt produces the list
./dir/outfile-0.txt
./dir/outfile-1.txt
./dir/outfile-10.txt
./dir/outfile-5.txt
(using outfile and .txt means this will work even if there are other files not name outfile)
Scrape out the number by piping it through the stream editor sed … capture the number and keep only that part:
ls -1 ./dir/outfile*.txt | sed -e 's:^.*dir/outfile-\([0-9][0-9]*\)\.txt$:\1:'
(I'm using colon : instead of the standard slash / so I don't have to escape the directory separator in dir/outfile)
Now you just need to pick the highest number. Sort the numbers and take the top
| sort -rn | head -1
Sorting with -n is numeric, not lexigraphic sorting, -r reverses so the highest number will be first, not last.
Putting it all together, this will list the files, edit the names keeping only the numeric part, sort, and get just the first entry. You want to assign that to a variable to work with it, so it is:
high=$(ls -1 ./dir/outfile*.txt | sed -e 's:^.*dir/outfile-\([0-9][0-9]*\)\.txt$:\1:' | sort -rn | head -1)
In the shell (I'm using bash) you can do math on that, $[high + 1] so if high is 10, the expression produces 11
You would use that as the numeric part of your filename.
The whole shell script then just needs to use that number in the filename. Here it is, with lines broken for better readability:
#!/bin/sh
high=$(ls -1 ./dir/outfile*.txt \
| sed -e 's:^.*dir/outfile-\([0-9][0-9]*\)\.txt$:\1:' \
| sort -rn | head -1)
echo "myprogram -o ./dir/outfile-$[high + 1].txt"
Of course you wouldn't echo myprogram, you'd just run it.

you could do this in a bash function under your .bashrc by using wc to get the number of files in the dir and then adding 1 to the result
yourfunction () {
dir=/path/to/dir
filenum=$(expr $(ls $dir | wc -w) + 1)
myprogram -o $dir/outfile-${filenum}.txt
}
this should get the number of files in $dir and append 1 to that number to get the number you need for the filename. if you place it in your .bashrc or under .bash_aliases and source .bashrc then it should work like any other shell command

You can try exporting a function for MY_COMMAND to run.
next_outfile () {
my_program -o ./dir/outfile-${_next_number}.txt
((_next_number ++ ))
}
export -f next_outfile
export MY_COMMAND="next_outfile" _next_number=0
This relies on a "private" global variable _next_number being initialized to 0 and not otherwise modified.

rsync verbose with final stats but no file list

I see that when I use rsync with the -v option it prints the changed files list and some useful infos at the end, like the total transfer size.
Is it somewhat possible to cut out the first (long) part and just print the stats? I am using it in a script, and the log shouldn't be so long. Only the stats are useful.
Thank you.

As I was looking for an answer and came across this question:
rsync also supports the --stats option.

Best solution for now i think :
rsync --info=progress0,name0,flist0,stats2 ...
progress0 hides progress
progress2 display progress
name0 hides file names
stats2 displays stats at the end of transfer

This solution is more a "hack" than the right way to do it because the output is generated but only filtered afterwards. You can use the option --out-format.
rsync ... --out-format="" ... | grep -v -E "^sending|^created" | tr -s "\n"
The grep filter should probably be updated with unwanted lines you see in the output. The tr is here to filter the long sequence of carriage returns.
grep -E for extended regexes
grep -v to invert the match. "Selected lines are those not matching any of the specified patterns."
tr -s to squeeze the repeated carriage returns into a single one

General solution for bypassing file headers in shell commands

I make extensive use of piping multiple linux shell commands, for example:
grep BLAH file1 | sed 's/old/new/' | sort -k 1,1 > file3
My files often have a header line, and often I have to preserve it throughout the pipeline. So, for example, I would want to grep, sed and sort from line 2 and on, while keeping the 1st line unchanged.
I am looking for some general solution that given some command(s) would preserve the header. I usually write the header to a file before the pipe and then cat it back after the pipe ends. I have started using zshell, so I was wondering if that might help to get a more streamlined solution.
Perhaps something like this:
(arrows are pipes in the image)
but I am not sure how to get that to work in zshell or if it is even possible. One problem is that I need to follow up the first pipe split with a command on both pipes.
Any creative solutions?

Vaughn and devnull have already directed you towards the solution. They both contain typos though and I have some remarks to add and would advise to use this instead:
{ head -n 1 file1; tail -n +2 file1 | grep BLAH | sed 's/old/new/' | sort -k 1,1; } >file3
What it does is take the first line of file1 in one command (your header) and does your grep/sed/whatever magic in a second command on the rest of the file (sans the header, tail -n +2) and redirects the combined output to file3.
Notes:
If your shell supports { } it is preferred over the ( ) construct in this case as it does not spawn a subshell (sometimes it is desirable to have the subshell, though).
head -2 is deprecated, you should use the -n parameter like head -n 2
You can skip the tail -n +2 file1 part if you absolutely know that what you are grepping for cannot be found in your header, but it is certainly cleaner this way.
This should work in most recent shells, btw (bash, ksh, zsh).

Why doesn't "sort file1 > file1" work?

When I am trying to sort a file and save the sorted output in itself, like this
sort file1 > file1;
the contents of the file1 is getting erased altogether, whereas when i am trying to do the same with 'tee' command like this
sort file1 | tee file1;
it works fine [ed: "works fine" only for small files with lucky timing, will cause lost data on large ones or with unhelpful process scheduling], i.e it is overwriting the sorted output of file1 in itself and also showing it on standard output.
Can someone explain why the first case is not working?

As other people explained, the problem is that the I/O redirection is done before the sort command is executed, so the file is truncated before sort gets a chance to read it. If you think for a bit, the reason why is obvious - the shell handles the I/O redirection, and must do that before running the command.
The sort command has 'always' (since at least Version 7 UNIX) supported a -o option to make it safe to output to one of the input files:
sort -o file1 file1 file2 file3
The trick with tee depends on timing and luck (and probably a small data file). If you had a megabyte or larger file, I expect it would be clobbered, at least in part, by the tee command. That is, if the file is large enough, the tee command would open the file for output and truncate it before sort finished reading it.

It doesn't work because '>' redirection implies truncation, and to avoid keeping the whole output of sort in the memory before re-directing to the file, bash truncates and redirects output before running sort. Thus, contents of the file1 file will be truncated before sort will have a chance to read it.

It's unwise to depend on either of these command to work the way you expect.
The way to modify a file in place is to write the modified version to a new file, then rename the new file to the original name:
sort file1 > file1.tmp && mv file1.tmp file1
This avoids the problem of reading the file after it's been partially modified, which is likely to mess up the results. It also makes it possible to deal gracefully with errors; if the file is N bytes long, and you only have N/2 bytes of space available on the file system, you can detect the failure creating the temporary file and not do the rename.
Or you can rename the original file, then read it and write to a new file with the same name:
mv file1 file1.bak && sort file1.bak > file1
Some commands have options to modify files in place (for example, perl and sed both have -i options (note that the syntax of sed's -i option can vary). But these options work by creating temporary files; it's just done internally.

Redirection has higher precedence. So in the first case, > file1 executes first and empties the file.

The first command doesn't work (sort file1 > file1), because when using the redirection operator (> or >>) shell creates/truncates file before the sort command is even invoked, since it has higher precedence.
The second command works (sort file1 | tee file1), because sort reads lines from the file first, then writes sorted data to standard output.
So when using any other similar command, you should avoid using redirection operator when reading and writing into the same file, but you should use relevant in-place editors for that (e.g. ex, ed, sed), for example:
ex '+%!sort' -cwq file1
or use other utils such as sponge.
Luckily for sort there is the -o parameter which write results to the file (as suggested by #Jonathan), so the solution is straight forward: sort -o file1 file1.

Bash open a new empty file when reads the pipe, and then calls to sort.
In the second case, tee opens the file after sort has already read the contents.

You can use this method
sort file1 -o file1
This will sort and store back to the original file. Also, you can use this command to remove duplicated line:
sort -u file1 -o file1

Diff-ing files with Linux command

What Linux command allow me to check if all the lines in file A exist in file B? (it's almost like a diff, but not quite). Also file A has uniq lines, as is the case with file B as well.

The comm command compares two sorted files, line by line, and is part of GNU coreutils.

Are you looking for a better diff tool?
https://stackoverflow.com/questions/12625/best-diff-tool

So, what if A has
a
a
b
and b has
a
b
What would you want the output to be(yes or no)?

Use diff command.
Here is a useful vide with complete usage of diff command under 3 min
Click Here

if cat A A B | sort | uniq -c | egrep -e '^[[:space:]]*2[[:space:]]' > /dev/null; then
echo "A has lines that are not in B."
fi
If you do not redirect the output, you will get a list of all the lines that are in A that are not in B (except each line will have a 2 in front if it). This relies on the lines in A being unique, and the lines in B being unique.
If they aren't, and you don't care about counting duplicates, it's relatively simple to transform each file into a list of unique lines using sort and uniq.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sort & uniq in Linux shell - linux

What is the difference between the following to commands? sort -u FILE sort FILE | uniq

sort -u will be slightly faster, because it does not need to pipe the output between two commands also see my question on the topic: calling uniq and sort in different orders in shell

I have worked on some servers where sort don't support '-u' option. there we have to use sort xyz | uniq

Nothing, they will produce the same result

Related

How to create a dynamic command in bash?

rsync verbose with final stats but no file list

General solution for bypassing file headers in shell commands

Why doesn't "sort file1 > file1" work?

Diff-ing files with Linux command

Categories

Resources