Bash sort -nu results in unexpected behaviour - linux

A colleague of mine noticed some odd behaviour with the sort command today, and I was wondering if anyone knows if the output of this command is intentional or not?
Given the file:
ABC_22
ABC_43
ABC_1
ABC_1
ABC_43
ABC_10
ABC_123
We are looking to sort the file with numeric sort, and also make it unique, so we run:
sort file.txt -nu
The output is:
ABC_22
Now, we know that the numeric sort won't work in this case as the lines don't begin with numbers (and that's fine, this is just part of a larger script), but I would have expected something more along the lines of:
ABC_1
ABC_10
ABC_123
ABC_22
ABC_43
Does anyone know why this isn't the case? The sort acts as one would expect if given just the -u or -n options individually.

With -n, an empty number is zero:
Sort numerically. The number begins each line and consists of optional
blanks, an optional ‘-’ sign, and zero or more digits possibly
separated by thousands separators, optionally followed by a
decimal-point character and zero or more digits. An empty number is
treated as ‘0’.
All these lines have an empty number at the start of the line, so they are all zero for sort's numerical uniqueness. If you'd started each line with the same number, say 1, the effect would be the same. You should specify the field containing the numbers explicitly, or use version sort (-V):
$ sort -Vu foo
ABC_1
ABC_10
ABC_22
ABC_43
ABC_123

You are missing specifying the de-limit on the second field of GNU sort as
sort -nu -t'_' -k2 file
ABC_1
ABC_10
ABC_22
ABC_43
ABC_123
The flag -n for numerical sort, -u for unique lines and the key part is to set de-limiter as _ and sort on the second field after _ done by -k2.

Related

Trouble pipelining grep into a sort

I have data in file in the form:
Torch Lake township | Antrim | 1194
I'm able use grep to look for keywords and pipe that into a sort but the sort isn't behaving how I intendended
This is what I have:
grep '| Saginaw' data | sort -k 5,5
I would like to be able to sort by the numerical value in the last column but it currently isn't and I'm unsure what I'm actually doing wrong.
A few things seem to be bogging you down.
First, the vertical bar can be a special character in grep. It means OR. Ex:
A|B
could be interpreted as A or B, and not A vertical bar B.
To correct that, you need to tell grep to interpret the | as a non-special character. To do that, escape it, like this:
grep '\| Saginaw' data
or, simply remove it altogether, if you data format allows that.
Second, the sort command needs to know what your column separator is. By default, it uses a space character (Actually, it's any white space). sort -k 5,5 actually says "sort on the 5th word"
To specify that your column separator is actually the vertical bay, use the -t option:
sort -t'|' -k 5,5
alternately,
sort --field-separator='|' -k 5,5
Third, You've got a bit of a sticky wicket now. Your data is formatted as:
Field1 | Field2 | Field3
...and not...
Field1|Field2|Field3
You may have issues with that additional space. Or maybe not. If all of your data has EXACTLY the same white-space, you'll be fine. If some have a single space, some have 2 spaces, and others have a tab, your sort will get jacked up.
Fourth, sorting by numbers may not be intuitive for you. The number 10 comes after the number 1 and before the number 2.
To sort the way you think it ought to be, where 10 comes after 9, use the option -n for numeric sort.
grep '\| Saginaw' data | sort -t'|' -n -k 5,5
The entire filed #5 will be sorted. Thus, 10 Abbington will come before 10 Andover.

unix sort by single column only

I have a file of numbers seperated by comma. I want to sort the list by columns 2 only. my expecation is that lines wil lbe sorted by only the second columns and not by additional columns. I am not looking to sort by multiple keys. I know how to sort by a single key. My question here is why when i am sotin giving start POS and end POS as column 2 why is it also sorting columns 3?
FILE
cat chris.num
1,5,2
1,4,3
1,4,1
1,7,2
1,7,1
AFTER SORT
sort -t',' -k2,2 chris.num
1,4,1
1,4,3
1,5,2
1,7,1
1,7,2
However my expected output is
1,4,3
1,4,1
1,5,2
1,7,2
1,7,1
So i thought since i give the start and end key as -k2,2 that it would sort only based on this column but it seems to be sorting the other columns too. how can i get this to sort only by column 2 and not by others
From the POSIX description of sort:
Except when the -u option is specified, lines that otherwise compare equal shall be ordered as if none of the options -d, -f, -i, -n, or -k were present (but with -r still in effect, if it was specified) and with all bytes in the lines significant to the comparison. The order in which lines that still compare equal are written is unspecified.
So in your case, when two lines have the same value in the second column and thus are equal, the entire lines are then compared to get the final ordering.
GNU sort (And possibly other implementations, but it's not mandated by POSIX) has the -s option for a stable sort where lines with keys that compare equal appear in the same order as in the original, which is what it appears you want:
$ sort -t, -s -k2,2n chris.num
1,4,3
1,4,1
1,5,2
1,7,2
1,7,1

Differences between Unix commands for Sorting CSV

What's the difference between:
!tail -n +2 hits.csv | sort -k 1n -o output.csv
and
!tail -n +2 hits.csv | sort -t "," -k1 -n -k2 > output.csv
?
I'm trying to sort a csv file by first column first, then by the second column, so that lines with the same first column are still together.
It seems like the first one already does that correctly, by first sorting by the field before the first comma, then by the field following the first comma. (breaking ties, that is.)
Or does it not actually do that?
And what does the second command do/mean? (And what's the difference between the two?) There is a significant difference between the two output.csv files when I run the two.
And, finally, which one should I use? (Or are they both wrong?)
See also the answer by #morido for some other pointers, but here's a description of exactly what those two sort invocations do:
sort -k 1n -o output.csv
This assumes that the "fields" in your file are delimited by a transition from non-whitespace to whitespace (i.e. leading whitespace is included in each field, not stripped, as many might expect/assume), and tells sort to order things by a key that starts with the first field and extends to the end of the line, and assumes that the key is formatted as a numeric value. The output is sent explicitly to a specific file.
sort -t "," -k1 -n -k2
This defines the field separator as a comma, and then defines two keys to sort on. The first key again starts at the first field and extends to the end of the line and is lexicographic (dictionary order), not numeric, and the second key, which will be used when values of the first key are identical, starts with the second field and extends to the end of the line, and because of the intervening -n, will be assumed to be numeric data as well. However, because your first key entails the entire line, essentially, the second key is not likely to ever be needed (if the first key of two separate lines is identical, the second key most likely will be too).
Since you didn't provide sample data, it's unknown whether the data in the first two fields is numeric or not, but I suspect you want something like what was suggested in the answer by #morido:
sort -t, -k1,1 -k2,2
or
sort -t, -k1,1n -k2,2n (alternatively sort -t, -n -k1,1 -k2,2)
if the data is numeric.
First off: you want to remove the leading ! from these two commands. In Bash (and probably others since this comes from csh) you are otherwise referencing the last command that contained tail in your history which does not make sense here.
The main difference between your two versions is that in the first case you are not taking the second column into account.
This is how I would do it:
tail -n +2 hits.csv | sort -t "," -n --key=1,1 --key=2,2 > output.csv
-t specifies the field separator
-n turns on numerical sorting order
--key specifies the fields that should be used for sorting (in order of precedence)

How do I sort input with a variable number of fields by the second-to-last field?

Editor's note: The original title of the question mentioned tabs as the field separators.
In a text such as
500 east 23rd avenue Toronto 2 890 400000 1
900 west yellovillage blvd Mississauga 3 800 600090 3
how would you sort in ascending order of the second to last column?
Editor's note: The OP later provided another sample input line, 500 Jackson Blvd Toronto 3 700 40000 2, which contains only 8 whitespace-separated input fields (compared to the 9 above), revealing the need to deal with a variable number of fields in the input.
Note: There are several, potentially separate questions:
Update: Question C was the relevant one.
Question A: As implied by the question's title only: how can you use the tab character (\t) as the field separator?
Question B: How can you sort input by the second-to-last field, without knowing that field's specific index up front, given a fixed number of fields?
Question C: How can you sort input by the second-to-last field, without knowing that field's respective index up front, given a variable number of fields?
Answer to question A:
sort's -t option allows you to specify a field separator.
By default, sort uses any run of line-interior whitespace as the separator.
Assuming Bash, Ksh, or Zsh, you can use an ANSI C-quoted string ($'...') to specify a single tab as the field separator ($'\t'):
sort -t $'\t' -n -k8,8 file # -n sorts numerically; omit for lexical sorting
Answer to question B:
Note: This assumes that all input lines have the same number of fields, and that input comes from file file:
# Determine the index of the next-to-last column, based on the first
# line, using Awk:
nextToLastColNdx=$(head -n 1 file | awk -F '\t' '{ print NF - 1 }')
# Sort numerically by the next-to-last column (omit -n to sort lexically):
sort -t $'\t' -n -k$nextToLastColNdx,$nextToLastColNdx file
Note: To sort by a single field, always specify it as the end field too (e.g., -k8,8), as above, because sort, given only a start field index (e.g., -k8), sorts from the specified field through the remainder of the line.
Answer to question C:
Note: This assumes that input lines may have a variable number of fields, and that on each line it is that line's second-to-last field that should act as the sort field; input comes from file file:
awk '{ printf "%s\t%s\n", $(NF-1), $0 }' file |
sort -n -k1,1 | # omit -n to perform lexical sorting
cut -f2-
The awk command extracts each line's second-to-last field and prepends it to the input line on output, separated by a tab.
The result is sorted by the first field (i.e., each input line's second-to-last field).
Finally, the artificially prepended sort field is removed again, using cut.
I suggest looking at "man sort".
You will see how to specify a field separator and how to specify the field index that should be used as a key for sorting.
You can use sort -k 2
For example :
echo -e '000 west \n500 east\n500 east\n900 west' | sort -k 2
The result is :
500 east
500 east
900 west
000 west
You can find more informations in the man page of sort. Take a look a the end of the man page. Just before author you have some interesting informations :)
Bye

use uniq -d on a particular column?

Have a text file like this.
john,3
albert,4
tom,3
junior,5
max,6
tony,5
I'm trying to fetch records where column2 value is same. My desired output.
john,3
tom,3
junior,5
tony,5
I'm checking if we can use uniq -d on second column?
Here's one way using awk. It reads the input file twice, but avoids the need to sort:
awk -F, 'FNR==NR { a[$2]++; next } a[$2] > 1' file file
Results:
john,3
tom,3
junior,5
tony,5
Brief explanation:
FNR==NR is a common AWK idiom that is true for the first file in the arguments list. Here, column two is added to an array and incremented. On the second read of the file, we simply check if the value of column two is greater than one (the next keyword skips processing the rest of the code).
You can use uniq on fields (columns), but not easily in your case.
Uniq's -f and -s options filter by fields and characters respectively. However neither of these quite do what want.
-f divides fields by whitespace and you separate them with commas.
-s skips a fixed number of characters and your names are of variable length.
Overall though, uniq is used to compress input by consolidating duplicates into unique lines. You are actually wishing to retain duplicates and eliminate singletons, which is the opposite of what uniq is used to do. It would appear you need a different approach.

Resources