linux sort order that I don't understand - linux

I notice the following sort outputs. Who understands why the '.' gets sorted in front the first time and at the end the second time?
I was trying to debug a program which looks up lines in a large sorted file, but the culprit seems to be my expectation/understanding of linux sort.
$ sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
/mnt/x/E
/mnt/x/.
<ctrl-D>
/mnt/x/.
________
/mnt/x/E
________
$ sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
/mnt/x/Ed
/mnt/x/.T
<ctrl-D>
/mnt/x/Ed
_________
/mnt/x/.T
_________
$

It's not that "." comes before or after other characters - it's that it's not being examined at all; it's sorting purely based on the alphabetic characters.
In your first example, <end-of-string> sorts before E; in the second example, E sorts before T.
This behaviour is dependent on the locale settings for collation. You can influence this with environment variables, such as LC_COLLATE:
$ env LC_COLLATE=C sort
/mnt/x/Ed
/mnt/x/.T
^D
/mnt/x/.T
/mnt/x/Ed
$ env LC_COLLATE=en_US.UTF-8 sort
/mnt/x/Ed
/mnt/x/.T
^D
/mnt/x/Ed
/mnt/x/.T
$
Under the C locale, all ASCII characters are considered, and are sorted in their ASCII order; in many other locales punctuation is ignored - this is presumably what is causing the behaviour you're seeing.
You can examine your locale settings using the locale command.

Related

wp wc product_cat get - Remove leading whitespace?

I am using the WP-CLI for updating WooCommerce product_cat terms. When using wp wc product_cat get to retrieve individual fields, a line feed character (a0) seems to get inserted as leading character. Example:
$ echo "»"$(wp wc product_cat --user=4 get 44277 --field="description")"«"
» All widgets for A.-C.«
Another example - Note that the leading character is before the opening "
$ i1=$(wp wc product_cat --user=4 get 18869 --field="name" --format="json")
$ echo "format=json: »"$i1"«"
format=json: » "AEG"«
Additional information:
This happens for all fields
I verified that the added character is a0 by updating the field and checking in the database
Using --format didn't make a difference
Using --context didn't make a difference
I'm working on Linux Mint with Bash version 5.0.17(1).
Did I make a mistake somehwere in my syntaxis that inadvertently inserted this leading character? Or am I missing something in how WP-CLI or Bash works? Thanks in advance! Jeroen

Strange behaviour of redhat linux sort

Redhat linux, file to sort - "aaa":
4;AAA;456
3;BBB;567
2;AAA;123
1;BBB;234
5;AAA;000
sort only by second field - command:
sort -t ";" -k2,2 aaa
output is:
2;AAA;123
4;AAA;456
5;AAA;000
1;BBB;234
3;BBB;567
In my opinion output should be:
4;AAA;456
2;AAA;123
5;AAA;000
3;BBB;567
1;BBB;234
Error in sort?
There could be other reasons, but I'll guess that it is your "opinion", because you think that for records with equal keys, whichever one was first encountered in the file should be first in the output.
That is known as a "stable sort".
Stable sorts can take more work, and in most cases aren't required, so by default the sort command doesn't do it. Hence the results you saw.
It can do it if you want it to though:
$ sort --stable --field-separator=";" --key="2,2" aaa
4;AAA;456
2;AAA;123
5;AAA;000
3;BBB;567
1;BBB;234

What does "cat -A" command option mean in Unix

I'm currently working on a Unix box and came across this post which I found helpful, in order to learn about cat command in Unix. At the bottom of the page found this line saying: -A = Equivalent to -vET
As I'm new into Unix, I'm unaware of what does this mean actually? For example lets say I've created a file called new using cat and then apply this command to the file:
cat -A new, I tried this command but an error message comes up saying it's and illegal option.
To cut short, wanted to know what does cat -A really mean and how does it effect when I apply it to a file. Any help would be appreciated.
It means show ALL.
Basically its a combination of -vET
E : It will display '$' at the end of every line.
T : It will display tab character as ^I
v : It will use ^ and M-notation
^ and M-notation:
(Display control characters except for LFD(LineFeed or NewLine) and TAB using '^' notation and precede characters that have the high bit set with
'M-') M- notation is a way to display high-bit characters as low bit ones by preceding them with M-
You should read about little-endian and big-endian if you like to know more about M notation.
For example:
!http://i.imgur.com/0DGET5k.png?1
Check your manual page as below and it will list all options avaialable with your command and check is there -A present, if it is not present it is an illegal option.
man cat
It displays non-printing characters
In Mac OS you need to use -e flag and
-e Display non-printing characters (see the -v option), and display a dollar sign (`$') at the end of each line.

UNIX sort ignores whitespaces

Given a file txt:
ab
a c
a a
When calling sort txt, I obtain:
a a
ab
a c
In other words, it is not proper sorting, it kind of deletes/ignores the whitespaces! I expected this to be the behavior of sort -i but it happens with or without the -i flag.
I would like to obtain "correct" sorting:
a a
a c
ab
How should I do that?
Solved by:
export LC_ALL=C
From the sort() documentation:
WARNING: The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
(works for ASCII at least, no idea for UTF8)
Like mentioned before, LC_ALL=C sort does the trick. This is simply because different languages have different rules for sorting characters, which are often laid out by senior linguists instead of CS experts. And these rules, in the case of your locale, seem to say that spaces ought to be ignored in sorting.
By prefixing LC_ALL=C (or, when LC_ALL is unset, LC_COLLATE=C suffices), you explicitely declare language-agnostic sorting (and, with LC_ALL, number-formatting and stuff), which is what you want in this context. If you want to make this your default, export LC_COLLATE in your environment.
The default is chosen in this way to keep consistency with the "normal", real-world sorting schemes (like the white pages), which often ignored spaces.
Using the C locale i.e. sorting just by byte values is not a good solution in languages where some letters are outside the range [A-Za-z]. Such letters are represented as multiple bytes in UTF-8 and then the byte value collating order is not what one desires. (Some characters may have two equivalent representations (pre-composed and de-composed)).
Nevertheless, the treatment of spaces is a problem. I tried the following:
$ cat stest
a b
a c
ab
a d
$ sort stest
ab
a b
a c
a d
$ sort -k 1,1 stest
a b
a c
a d
ab
For my needs, the -k 1,1 did the trick. Another but clumsier solution I tried, was to change spaces to some auxiliary character, then sort, then change the auxiliaries back into blanks.
You could use the 'env' program to temporarily change your LC_COLLATE for the duration of the sort; e.g.
/usr/bin/env LC_COLLATE=POSIX /bin/sort file1 file2
It's a little cumbersome on the command line but if you're using it in a script should be transparent.
I have been looking at this for a little while, wanting to optimize a shell script I maintain that has a heavy international userbase. (heavy as in percentage, not quantity).
Most of the options I saw around the web and SO seem to recommend what I see here, setting the locale globally (overkill)
export LC_ALL=C
or piping it into each individual command like this from gnu.org (tedious)
$ echo abcdefghijklmnopqrstuvwxyz | LC_ALL=C /usr/xpg4/bin/tr 'a-z' 'A-Z' ABCDEFGHIJKLMNOPQRSTUVWXYZ
I wanted to avoid clobbering the user's locale as a unseen side effect of running my program. This turned out to be easily accomplished just as you would expect, by leaving off the globalization. No need to export this variable past your program.
I had to set LANG instead of LC_ALL for some reason, but all the individual locales were set which is functionally enough for me.
Here is the test, simple as can be
#!/bin/bash
# locale_checker.sh
#Check and set locale to LC_ALL to optimize character sort and search.
echo "locale was $LANG"
LANG=C
locale
and output + proof that it is temporary and can be restricted to my script's process.
mateor#:~/snippets$ ./locale_checker.sh
locale was en_US.UTF-8
LANG=C
LANGUAGE=en_US:en
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
mateor#:~/snippets$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
There you go. You get the optimized locale without clobbering another person's innocent environment as well as avoid the tedium of piping it everywhere you think it may help.
Weird, works here (cygwin).
Try sort -d txt.
Actually for me
$ cat txt
ab
a c
a a
$ sort txt
a a
a c
ab
I'll bet between your a and c you have a non-breaking space or an enspace or an empspace or other high-codepoint space!
EDIT
Just ran it on Linux. I should have looked at the tags. Yes I get the same output you do! My first run was on the Mac. Looks like a difference between GNU and BSD. I will investigate further.
EDIT 2:
Linux uses a field-based sort.... still looking for how to suppress it. Tried
sort -t, txt
hoping to trick GNU into thinking the whole line was one field, but it still used the current locale to sort.
EDIT 3:
The OP solved the problem by setting the locale to C with
export LC_ALL=C
There seems to be no other approach. The sort command will use the current locale, and although it often says the C (or its alias POSIX) is the default locale, if you have Linux it has probably been set for you. Enter locale -a to see the available locales. On my system:
$ locale -a
C
POSIX
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
It seems like setting the locale to C (or its alias POSIX) is the only way to break the field-based behavior of sort and treat the whole line as one field. It is rather odd IMHO that this is how to do it. I would think the -t or -k options, or perhaps some new option would be a more sensible way to make this happen.
BTW, it looks like this question has been asked before on SO: unexpected result from gnu sort.

How to do `sort -V` in OSX?

I wrote a script for a Linux bash shell.
One line takes a list of filenames and sorts them. The list looks like this:
char32.png char33.png [...] char127.png
It goes from 32 to 127.
The default sorting of ls of this list is like this
char100.png char101.png [...] char32.png char33.png [...] char99.png
Luckily, there is sort, which has the handy -V switch which sorts the list correctly (as in the first example).
Now, I have to port this script to OSX and sort in OSX is lacking the -V switch.
Do you have a clever idea of how to sort this list correctly?
Do they all start with a fixed string (char in your example)? If so:
sort -k1.5 -n
-k1.5 means to sort on the first key (there’s only one key in your example) starting from the 5th character, which will be the first digit. -n means to sort numerically. This works on Linux too.

Resources