UNIX sort ignores whitespaces

UNIX sort ignores whitespaces - linux

Given a file txt:
ab
a c
a a
When calling sort txt, I obtain:
a a
ab
a c
In other words, it is not proper sorting, it kind of deletes/ignores the whitespaces! I expected this to be the behavior of sort -i but it happens with or without the -i flag.
I would like to obtain "correct" sorting:
a a
a c
ab
How should I do that?

Solved by:
export LC_ALL=C
From the sort() documentation:
WARNING: The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
(works for ASCII at least, no idea for UTF8)

Like mentioned before, LC_ALL=C sort does the trick. This is simply because different languages have different rules for sorting characters, which are often laid out by senior linguists instead of CS experts. And these rules, in the case of your locale, seem to say that spaces ought to be ignored in sorting.
By prefixing LC_ALL=C (or, when LC_ALL is unset, LC_COLLATE=C suffices), you explicitely declare language-agnostic sorting (and, with LC_ALL, number-formatting and stuff), which is what you want in this context. If you want to make this your default, export LC_COLLATE in your environment.
The default is chosen in this way to keep consistency with the "normal", real-world sorting schemes (like the white pages), which often ignored spaces.

Using the C locale i.e. sorting just by byte values is not a good solution in languages where some letters are outside the range [A-Za-z]. Such letters are represented as multiple bytes in UTF-8 and then the byte value collating order is not what one desires. (Some characters may have two equivalent representations (pre-composed and de-composed)).
Nevertheless, the treatment of spaces is a problem. I tried the following:
$ cat stest
a b
a c
ab
a d
$ sort stest
ab
a b
a c
a d
$ sort -k 1,1 stest
a b
a c
a d
ab
For my needs, the -k 1,1 did the trick. Another but clumsier solution I tried, was to change spaces to some auxiliary character, then sort, then change the auxiliaries back into blanks.

You could use the 'env' program to temporarily change your LC_COLLATE for the duration of the sort; e.g.
/usr/bin/env LC_COLLATE=POSIX /bin/sort file1 file2
It's a little cumbersome on the command line but if you're using it in a script should be transparent.

I have been looking at this for a little while, wanting to optimize a shell script I maintain that has a heavy international userbase. (heavy as in percentage, not quantity).
Most of the options I saw around the web and SO seem to recommend what I see here, setting the locale globally (overkill)
export LC_ALL=C
or piping it into each individual command like this from gnu.org (tedious)
$ echo abcdefghijklmnopqrstuvwxyz | LC_ALL=C /usr/xpg4/bin/tr 'a-z' 'A-Z' ABCDEFGHIJKLMNOPQRSTUVWXYZ
I wanted to avoid clobbering the user's locale as a unseen side effect of running my program. This turned out to be easily accomplished just as you would expect, by leaving off the globalization. No need to export this variable past your program.
I had to set LANG instead of LC_ALL for some reason, but all the individual locales were set which is functionally enough for me.
Here is the test, simple as can be
#!/bin/bash
# locale_checker.sh
#Check and set locale to LC_ALL to optimize character sort and search.
echo "locale was $LANG"
LANG=C
locale
and output + proof that it is temporary and can be restricted to my script's process.
mateor#:~/snippets$ ./locale_checker.sh
locale was en_US.UTF-8
LANG=C
LANGUAGE=en_US:en
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
mateor#:~/snippets$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
There you go. You get the optimized locale without clobbering another person's innocent environment as well as avoid the tedium of piping it everywhere you think it may help.

Weird, works here (cygwin).
Try sort -d txt.

Actually for me
$ cat txt
ab
a c
a a
$ sort txt
a a
a c
ab
I'll bet between your a and c you have a non-breaking space or an enspace or an empspace or other high-codepoint space!
EDIT
Just ran it on Linux. I should have looked at the tags. Yes I get the same output you do! My first run was on the Mac. Looks like a difference between GNU and BSD. I will investigate further.
EDIT 2:
Linux uses a field-based sort.... still looking for how to suppress it. Tried
sort -t, txt
hoping to trick GNU into thinking the whole line was one field, but it still used the current locale to sort.
EDIT 3:
The OP solved the problem by setting the locale to C with
export LC_ALL=C
There seems to be no other approach. The sort command will use the current locale, and although it often says the C (or its alias POSIX) is the default locale, if you have Linux it has probably been set for you. Enter locale -a to see the available locales. On my system:
$ locale -a
C
POSIX
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
It seems like setting the locale to C (or its alias POSIX) is the only way to break the field-based behavior of sort and treat the whole line as one field. It is rather odd IMHO that this is how to do it. I would think the -t or -k options, or perhaps some new option would be a more sensible way to make this happen.
BTW, it looks like this question has been asked before on SO: unexpected result from gnu sort.

Related

Why ip_forward_use_pmtu added in the result of sysctl in linux server

So I did an OS version-up in a linux server, and was seeing if any setting has been changed.
And when I typed "sysctl -a | grep "net.ipv4.ip_forward"
The following line was added,
net.ipv4.ip_forward_use_pmtu = 0
I know that this is because this parameter is in /proc/sys.
But I think if the result of sysctl before upload did not show this line, it was not in /proc/sys before as well, right ?
I know that 0 means " this setting is not applied...So basically it does not do anything.
But why this line is added.
The question is
Is there any possible reason that can add this line?
Thank you, ahead.

Even the question itself "added in the result of sysctl in linux server" is wrong here.
sysctl in the way you invoked it, lists all the entries.
grep which you used to filter those entries "selects" matching texts, if you'd run grep foo against the list:
foo
foobar
both items would be matched. That's exactly what you see but the only difference is instead of "foo" you have "net.ipv4.ip_forward".
Using --color shows that clearly:
Pay attention to the use of fgrep instead of grep because people tend to forget that grep interprets some characters as regular expressions, and the dot . means any character, which might also lead to unexpected matches.

Bash round 2 decimal places with dot

I used printf "%0.2f\n" $myVar method to display something on 2 decimal places, but it doesnt work for numbers with dot (.) as decimal mark, but comma(,)
Anybody has any idea what should i do?
http://puu.sh/owM1p/21f5be08c2.jpg

Try setting your locale environment variable LC_NUMERIC to some locale that uses period. E.g.
LC_NUMERIC="C" printf "%0.2f\n" 3.1415
The locale needs to be installed in your system. To get full list of the locales installed, use locale -a

PROBLEM WAS LOCALE as someone pointed out
I didnt know what caused error and i thank you.
more on this link
/bin/bash printf does not work with other LANG than C

linux sort order that I don't understand

I notice the following sort outputs. Who understands why the '.' gets sorted in front the first time and at the end the second time?
I was trying to debug a program which looks up lines in a large sorted file, but the culprit seems to be my expectation/understanding of linux sort.
$ sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
/mnt/x/E
/mnt/x/.
<ctrl-D>
/mnt/x/.
________
/mnt/x/E
________
$ sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
/mnt/x/Ed
/mnt/x/.T
<ctrl-D>
/mnt/x/Ed
_________
/mnt/x/.T
_________
$

It's not that "." comes before or after other characters - it's that it's not being examined at all; it's sorting purely based on the alphabetic characters.
In your first example, <end-of-string> sorts before E; in the second example, E sorts before T.
This behaviour is dependent on the locale settings for collation. You can influence this with environment variables, such as LC_COLLATE:
$ env LC_COLLATE=C sort
/mnt/x/Ed
/mnt/x/.T
^D
/mnt/x/.T
/mnt/x/Ed
$ env LC_COLLATE=en_US.UTF-8 sort
/mnt/x/Ed
/mnt/x/.T
^D
/mnt/x/Ed
/mnt/x/.T
$
Under the C locale, all ASCII characters are considered, and are sorted in their ASCII order; in many other locales punctuation is ignored - this is presumably what is causing the behaviour you're seeing.
You can examine your locale settings using the locale command.

Using -s command in bash script

I have a trivial error that I cant seem to get around. Im trying to return the various section numbers of lets say "man" since it resides in all the sections. I am using the -s command but am having problems. Every time I use it I keep getting "what manual page do you want". Any help?

In the case of getting the section number of a command, you want something like man -k "page_name" | awk -F'-' "/^page_name \(/ {print $1}", replacing any occurrence of page_name with whatever command you're needing.
This won't work for all systems necessarily as the format for the "man" output is "implementation-defined". In other words, the format on FreeBSD, OS X, various flavours of Linux, etc. may not be the same. For example, mine is:
page_name (1) - description
If you want the section number only, I'm sure there is something you can do such as saving the result of that line in a shell variable and use parameter expansion to remove the parentheses around the section number:
man -k "page_name" | awk -F'-' "/^page_name \(/ {print $1}" | while IFS= read sect ; do
sect="${sect##*[(]}"
sect="${sect%[)]*}"
printf '%s\n' "$sect"
done
To get the number of sections a command appears in, add | wc -l at the end on the same line as the done keyword. For the mount command, I have 3:
2
2freebsd
8

You've misinterpreted the nature of -s. From man man:
-S list, -s list, --sections=list
List is a colon- or comma-separated list of `order specific' manual sections to search. This option overrides the
$MANSECT environment variable. (The -s
spelling is for compatibility with System V.)
So when man sees man -s man it thinks you want to look for a page in section "man" (which most likely doesn't exist, since it is not a normal section), but you didn't say what page, so it asks:
What manual page do you want?
BTW, wrt "man is just the test case cuz i believe its in all the sections" -- nope, it is probably only in one, and AFAIK there isn't any word with a page in all sections. More than 2 or 3 would be very unusual.
The various standard sections are described in man man too.

The correct syntax requires an argument. Typically you're looking for either
man -s 1 man
to read the documentation for the man(1) command, or
man -s 7 man
to read about the man(7) macro package.
If you want a list of standard sections, the former contains that. You may have additional sections installed locally, though. A directory listing of /usr/local/share/man might reveal additional sections, for example.
(Incidentally, -s is not a "command" in this context, it's an option.)

How to get terminal's Character Encoding

Now I change my gnome-terminal's character encoding to "GBK" (default it is UTF-8), but how can I get the value(character encoding) in my Linux?

The terminal uses environment variables to determine which character set to use, therefore you can determine it by looking at those variables:
echo $LC_CTYPE
or
echo $LANG

locale command with no arguments will print the values of all of the relevant environment variables except for LANGUAGE.
For current encoding:
locale charmap
For available locales:
locale -a
For available encodings:
locale -m

Check encoding and language:
$ echo $LC_CTYPE
ISO-8859-1
$ echo $LANG
pt_BR
Get all languages:
$ locale -a
Change to pt_PT.utf8:
$ export LC_ALL=pt_PT.utf8
$ export LANG="$LC_ALL"

If you have Python:
python -c "import sys; print(sys.stdout.encoding)"

To my knowledge, no.
Circumstantial indications from $LC_CTYPE, locale and such might seem alluring, but these are completely separated from the encoding the terminal application (actually an emulator) happens to be using when displaying characters on the screen.
They only way to detect encoding for sure is to output something only present in the encoding, e.g. ä, take a screenshot, analyze that image and check if the output character is correct.
So no, it's not possible, sadly.

To see the current locale information use locale command. Below is an example on RHEL 7.8
[usr#host ~]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Examination of https://invisible-island.net/xterm/ctlseqs/ctlseqs.html, the xterm control character documentation, shows that it follows the ISO 2022 standard for character set switching. In particular ESC % G selects UTF-8.
So to force the terminal to use UTF-8, this command would need to be sent. I find no way of querying which character set is currently in use, but there are ways of discovering if the terminal supports national replacement character sets.
However, from charsets(7), it doesn't look like GBK (or GB2312) is an encoding supported by ISO 2022 and xterm doesn't support it natively. So your best bet might be to use iconv to convert to UTF-8.
Further reading shows that a (significant) subset of GBK is EUC, which is a ISO2022 code, so ISO2022 capable terminals may be able to display GBK natively after all, but I can't find any mention of activating this programmatically, so the terminal's user interface would be the only recourse.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string