Some confusion of utf8 in Linux server? - linux

About character set I always have question.
Local (Mac OS)
bash-3.2$ 你好
bash: 你好: command not found
bash-3.2$ locale
LANG="en_US.UTF-8"
Linux Server (ssh)
root#hg:~# 你好
-bash: $'\344\275\240\345\245\275': command not found
root#hg:~# locale
LANG=en_US.utf8
Question 1
Why both are utf8, in Server 你好 changed to \344\275\240\345\245\275 but literal 你好?
Question 2
Dose \344\275\240\345\245\275 represents utf-8 encode of 你好? Should it be \xE4\xBD\xA0\xE5\xA5\xBD? Are there different utf-8?

Why both are utf8, in Server 你好 changed to \344\275\240\345\245\275 but literal 你好?
The display comes from your shell. Locally you're using zsh which just copies your input in the error message.
Remotely, you're using bash which tries to make sure the message is displayed even if you're missing the right fonts, so it converts each byte to octal representation.
Does \344\275\240\345\245\275 represents utf-8 encode of 你好? Should it be \xE4\xBD\xA0\xE5\xA5\xBD? Are there different utf-8?
The first is octal representation. Octal 344 is hex e4. All bytes match, it's just a different display format.

Related

What's the default encoding in bash standard input? [duplicate]

I am using Gina Trapiani's excellent todo.sh to organize my todo-list.
However being a dane, it would be nice if the script accepted special danish characters like ø and æ.
I am an absolute UNIX-n00b, so it would be a great help if anybody could tell me how to fix this! :)
Slowly, the Unix world is moving from ASCII and other regional encodings to UTF-8. You need to be running a UTF terminal, such as a modern xterm or putty.
In your ~/.bash_profile set you language to be one of the UTF-8 variants.
export LANG=C.UTF-8
or
export LANG=en_AU.UTF-8
etc..
You should then be able to write UTF-8 characters in the terminal, and include them in bash scripts.
#!/bin/bash
echo "UTF-8 is græat ☺"
See also: https://serverfault.com/questions/11015/utf-8-and-shell-scripts
What does this command show?
locale
It should show something like this for you:
LC_CTYPE="da_DK.UTF-8"
LC_NUMERIC="da_DK.UTF-8"
LC_TIME="da_DK.UTF-8"
LC_COLLATE="da_DK.UTF-8"
LC_MONETARY="da_DK.UTF-8"
LC_MESSAGES="da_DK.UTF-8"
LC_PAPER="da_DK.UTF-8"
LC_NAME="da_DK.UTF-8"
LC_ADDRESS="da_DK.UTF-8"
LC_TELEPHONE="da_DK.UTF-8"
LC_MEASUREMENT="da_DK.UTF-8"
LC_IDENTIFICATION="da_DK.UTF-8"
LC_ALL=
If not, you might try doing this before you run your script:
LANG=da_DK.UTF-8
You don't say what happens when you run the script and it encounters these characters. Are they in the todo file? Are they entered at a prompt? Is there an error message? Is something output in place of the expected output?
Try this and see what you get:
read -p "Enter some characters" string
echo "$string"

Unicode character not visible while doing cat

I have a CSV file generated by a windows system. The file is then moved to linux. The linux environment is NAME="Red Hat Enterprise Linux Server".VERSION="7.3 (Maipo)".ID="rhel".
When I use vi editor, all characters are visible. For example, one line is given :"Sarah--bitte nicht löschen".
But when i cat the file, i get something like "Sarah--bitte nicht l▒schen".
This file is consumed by datastage application and this unicode characters are coming as "?" in datastage. Since cat is not showing the character properly, I believe the issue is at the linux server. Any help is appreciated.
vi reads the file using encoding according fenc setting and show the content using your locales setting ($LANG env). If fenc is different from LANG, vi can handle the translate.
But cat doesn't handle the translate, it always output the exact byte stream without any convert.
Your terminal will show the output content of both vi and cat using your local PC locale setting.

UnicodeEncodeError: 'charmap' codec can't encode... solution in traceback? [duplicate]

Ok, i want to print a string in my windows xp console.
There are several characters the console cant print, so i have to encode to my stdout.encoding which is 'cp437'. but printing the encoded string, the 'ß' is printed as '\xe1'. after decoding back to unicode and printing the string, i get the output i want. but this feels somewhat wrong. how is the correct way to print a string and get ? for non-printable characters?
>>>var
'Bla \u2013 großes'
>>>print(var)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
>>>var.encode('cp437', 'replace')
b'Bla ? gro\xe1es'
>>>print(var.encode('cp437', 'replace'))
b'Bla ? gro\xe1es'
>>>var.encode('cp437', 'replace').decode('cp437')
'Bla ? großes'
>>>print(var.encode('cp437', 'replace').decode('cp437'))
Bla ? großes
edit:
#Mark Ransom: since i print a lot this makes the code pretty bloated i feel :/
#eryksun: excactly what i was looking for. thanks a lot!
To print Unicode characters that can't be represented using the console codepage, you could use win-unicode-console Python package that uses Unicode API such as ReadConsoleW/WriteConsoleW() to read/write Unicode from/to Windows console directly:
#!/usr/bin/env python3
import win_unicode_console
win_unicode_console.enable()
try:
print('Bla \u2013 großes')
finally:
win_unicode_console.disable()
save it to test_unicode.py file, and run it:
C:\> py test_unicode.py
You should see:
Bla – großes
As a preferred alternative, you could use run module (included in the package), to run an ordinary script with enabled Unicode support in Windows console:
C:\> py -m run unmodified_script_that_prints_unicode.py
To install win_unicode_console module, run:
C:\> pip install win-unicode-console
Make sure to select a font able to display Unicode characters in Windows console.
To save the output of a Python script to a file, you could use PYTHONIOENCODING envvar:
C:\> set PYTHONIOENCODING=utf-8:backslashreplace
C:\> py unmodified_script_that_prints_unicode.py >output_utf8.txt
Do not hardcode the character encoding of your environment inside your script, print Unicode instead. The examples show that the same script may be used to print to the console and to a file using different encodings and different methods.
An alternate solution is to not use the crippled Windows console for general unicode output. Tk text widgets (accessed as tkinter Text instances) handle all BMP chars as long as the selected font will.
Since Idle used tkinter, it can as well. Running an Idle editor file (call it tem.py) containing
print('Bla \u2013 großes')
prints the following in the Shell window.
Bla – großes
A file can be run through Idle from the console with -m and -r.
C:\>python -m idlelib -r c:/programs/python34/tem.py
This opens a shell window and prints the same as above. Or you can create your own tk window with Label or Text widget.

Character encoding problems

I attempted to convert a file I wrote in Vim to UTF-8. Vim defaulted the encoding to us-ascii. I ran this command: recode UTF-8 [filename]. It reported no errors, but when I run: file -i [filename] it still stays encoding is ASCII. Is this a known error or expected result? Thanks in advance :-)
I have to say that if your file is just ascii character, there is no difference in the final file between the ascii encoding and utf8 encoding, cause for ascii character, the utf8 encoding is exactly the same as ascii encoding.
But if your file contains some non-ascii character, you will see the difference.
Your "fileencodings" setting for vim may use "ascii" before "utf8", that's the list that vim try to detect the file encodings. So if the file can be read as "ascii", the later utf8 will not be tried anymore, although utf8 is also correct.

encoding problem?

i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..

Resources