I want to sort a unix file by the id column but when I use sort -k4,4 or -k4,4n I do not get the expected result.
The column of interest should be sorted like this:
id1
id2
id3
id4
etc.
Instead it is sorted like this when i do sort -k4,4
id1
id10
id100
id1000
id10000
id10001
etc.
My unix version uses the following sort function:
sort --help
Usage: sort [OPTION]... [FILE]...
Write sorted concatenation of all FILE(s) to standard output.
Mandatory arguments to long options are mandatory for short options too.
Ordering options:
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < `JAN' < ... < `DEC'
-n, --numeric-sort compare according to string numerical value
-r, --reverse reverse the result of comparisons
Other options:
-c, --check check whether input is sorted; do not sort
-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit
Use the -V or --version-sort option for version sort
sort -V -k4,4 file.txt
Example:
$ cat file.txt
id5
id3
id100
id1
id10
Ouput:
$ sort -V file.txt
id1
id3
id5
id10
id100
EDIT:
If your implementation of sort doesn't have the -V option then a work-around using sed to remove id so a numeric sort -n can be done and then replace id back with sed, like this:
sed -E 's/id([0-9]+)/\1/' file.txt | sort -n -k4,4 | sed -E 's/( *)([0-9]+)( *|$)/\1id\2\3/'
Note: this solution is dependent on the data, only works if no columns containing purely numbers are found before the ID column.
As sudo_o has already mentioned, the easiest would be to use --version-sort which does natural sorting of numbers that occur within text.
If your version of sort does not have that option, a hacky way to approach this would be to temporarily remove the "id" prefix before a sort, then replace them. Here's one way, using awk:
awk 'sub("^id", "", $4)' file.txt | sort -k4,4n | awk 'sub("^", "id", $4)'
If your sort supports it, you can also use the syntax F.C to use specific characters from a field.
This would sort on field 4, from chars 3 to 10, numerical value:
sort -bn -k 4.3,4.10 file
And this would sort on field 4, from chars 3 to end of field, numerical value:
sort -bn -k 4.3,4 file
Related
When I use uniq -u data.txt lists the whole file and when I use sort data.txt | uniq -u it omits repeated lines. Why does this happen?
uniq man says that -u, --unique only prints unique lines. I don't understand why I need to use pipe to get correct output.
uniq removes adjacent duplicates. If you want to omit duplicates that are not adjacent, you'll have to sort the data first.
I have a couple of csv's that i need to merge. I want to consider entries which have the same first and second columns as duplicates.
I know the command for this is something like
sort -t"," -u -k 1,1 -k 2,2 file1 file2
I additionally want to resolve the duplicates in such a way that the entry from the second file is chosen everytime. What is the way to do that?
If the suggestion to reverse the order of files to the sort command doesn't work (see other answer), another way to do this would be to concatenate the files, file2 first, and then sort them with the -s switch.
cat file2 file1 | sort -t"," -u -k 1,1 -k 2,2 -s
-s forces a stable sort, meaning that identical lines will appear in the same relative order. Since the input to sort has all of the lines from file2 before file1, all of the duplicates in the output should come from file2.
The sort man page doesn't explicitly state that input files will be read in the order that they're supplied on the command line, so I guess it's possible that an implementation could read the files in reverse order, or alternating lines, or whatever. But if you concatenate the files first then there's no ambiguity.
Change the order of the two files and add -s(#Jim Mischel give the hit) would solve your problem.
sort -t"," -u -k 1,1 -k 2,2 -s file2 file1
man sort
-u, --unique
with -c, check for strict ordering; without -c, output only the
first of an equal run
-s, --stable
stabilize sort by disabling last-resort comparison
Short answer
awk -F"," '{out[$1$2]=$0} END {for(i in out) {print out[i]}}' file1 file2
A bit long answer:
awk 'BEGIN {
FS=OFS=","; # set ',' as field separator
}
{
out[$1$2]=$0; # save the value to dict, new value would replace old value.
}
END {
for (i in out) { # in the end, print all value of the dict
print out[i];
}
}' file1 file2
How can I find the files from a folder where a specific word appears on more than 3 lines? I tried using recursive grep for finding that word and then using -c to count the number of lines where the word appears.
This command will recursively list the files in the current directory where word appears on more than 3 lines, along with the matches count for each file:
grep -c -r 'word' . | grep -v -e ':[0123]$' | sort -n -t: -k2
The final sort is not necessary if you don't want the results sorted, but I'd say it's convenient.
The first command in the pipeline (grep -c -r 'word' .) recursively finds every file in the current directory that contains word, and counts the occurrences for each file. The intermediate grep discards every count that is 0, 1, 2 or 3, so you just get counts greater than 3 (this is because -v in grep(1) inverts the sense of matching to select non-matching lines). The final sort step sorts the list according to the occurrences for each file; it sets the field delimiter to : and instructs sort(1) to do a numeric-based sorting using the 2nd field (the count) as the sort key.
Here's a sample output from some tests I ran:
./file1:4
./dir1/dir2/file3:5
./dir1/file2:8
If you just want the filenames without the match counts, you can use sed(1) to discard the :count portions:
grep -m 4 -c -r 'word' . | grep -v -e ':[0123]$' | sed -r 's/:[0-9]+$//'
As noted in the comments, if matches count is not important, in this case we can optimize the first grep with -m 4, which stops reading the file after 4 matching lines.
UPDATE
The solution above works fine up to a certain extent if used with small numbers, but it does not scale well for larger numbers. If you want to filter based on an arbitrary number, you can use awk(1) (and in fact it ends up being much more clean), like so:
grep -c -r 'word' . | awk -F: '$2 > 10'
The -F: argument to awk(1) is necessary; it instructs awk(1) to separate fields by : rather than the default (whitespace and tab). This solution generalizes well to any number.
Again, if matches count doesn't matter and all you want is to get a list of the filenames, do this instead:
grep -c -r 'word' . | awk -F: '$2 > 10 { print $1 }'
I wanted to sort a file in numerical order as well as uniquify with the sort -nu [filename].
$ *** | sort -n | wc
201172
$ *** | sort -nu | wc
9599
$ *** | sort -un | wc
9599
$ *** | sort -n | sort -u | wc
201149
$ *** | sort -u | wc
201149
Why there is a decrease in number of lines with sort -un ? So I tried running above commands on a small numeric file and see if there is any problem. It worked as expected.
Am I missing something obvious ? or
those options incompatible with each other ? I've checked man sort for this, no information was provided about this combination.
Thanks in advance.
EDIT
How should I fix this ? (using the n and u options separately ?)
-u removes duplicates.
So yeah, obviously it will reduce lines if the key is repeated within the file.
The difference with
sort -n | sort -u
then is that the second sort -u pipe command considers the full line, not just the numeric key.
so you need understand what's the meaning of -u and -n.
man sort
-u Unique: suppresses all but one in each set
of lines having equal keys. If used with the
-c option, checks that there are no lines
with duplicate keys in addition to checking
that the input file is sorted.
-n Restricts the sort key to an initial numeric string,
consisting of optional blank characters, optional
minus sign, and zero or more digits with an optional
radix character and thousands separators (as defined
in the current locale), which is sorted by arithmetic
value. An empty digit string is treated as zero.
Leading zeros and signs on zeros do not affect order-
ing.
I have a shell script which is grepping the results of a file then it calls sort -u to get the unique entries. Is there a way to have sort also tell me how many of each of those entries there are? So the output would be something like:
user1 - 50
user2 - 23
user3 - 40
etc..
Use sort input | uniq -c. uniq does what -u does in sort -u, but also has the additional -c option for counting.
Grep has a -c switch to count the occurrence of each item..
grep -c needle haystack
will give the number of needles which you can sort as needed..
Given a sorted list, uniq -c will show the item, and how many. It will be the first column, so I will often do something like:
sort file.txt | uniq -c |sort -nr
The -n in the sort will parse numbers correctly, like 9 before 11 (though with the '-r', it will reverse the count, since I usually want the higher count lines first).