Python3: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc'

Python3: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' - python-3.x

I'am trying to get running a very simple example on OSX with python 3.5.1 but I'm really stucked. Have read so many articles that deal with similar problems but I can not fix this by myself. Do you have any hints how to resolve this issue?
I would like to have the correct encoded latin-1 output as defined in mylist without any errors.
My code:
# coding=<latin-1>
mylist = [u'Glück', u'Spaß', u'Ähre',]
print(mylist)
The error:
Traceback (most recent call last):
File "/Users/abc/test.py", line 4, in <module>
print(mylist)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 4: ordinal not in range(128)
How I can fix the error but still get something wrong with stdout (print):
mylist = [u'Glück', u'Spaß', u'Ähre',]
for w in mylist:
print(w.encode("latin-1"))
What I get as output:
b'Gl\xfcck'
b'Spa\xdf'
b'\xc4hre'
What 'locale' shows me:
LANG="de_AT.UTF-8"
LC_COLLATE="de_AT.UTF-8"
LC_CTYPE="de_AT.UTF-8"
LC_MESSAGES="de_AT.UTF-8"
LC_MONETARY="de_AT.UTF-8"
LC_NUMERIC="de_AT.UTF-8"
LC_TIME="de_AT.UTF-8"
LC_ALL=
What
-> 'python3' shows me:
Python 3.5.1 (default, Jan 22 2016, 08:54:32)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

Try running your script with explicitly defined PYTHONIOENCODING environment variable:
PYTHONIOENCODING=utf-8 python3 script.py

Remove the characters < and >:
# coding=latin-1
Those character are often used in examples to indicate where the encoding name goes, but the literal characters < and > should not be included in your file.
For that to work, your file must be encoded using latin-1. If your file is actually encoded using utf-8, the encoding line should be
# coding=utf-8
For example, when I run this script (saved as a file with latin-1 encoding):
# coding=latin-1
mylist = [u'Glück', u'Spaß', u'Ähre',]
print(mylist)
for w in mylist:
print(w.encode("latin-1"))
I get this output (with no errors):
['Glück', 'Spaß', 'Ähre']
b'Gl\xfcck'
b'Spa\xdf'
b'\xc4hre'
That output looks correct. For example, the latin-1 encoding of ü is '\xfc'.
I used my editor to save the file with latin-1 encoding. The contents of the file in hexadecimal are:
$ hexdump -C codec-question.py
00000000 23 20 63 6f 64 69 6e 67 3d 6c 61 74 69 6e 2d 31 |# coding=latin-1|
00000010 0a 0a 6d 79 6c 69 73 74 20 3d 20 5b 75 27 47 6c |..mylist = [u'Gl|
00000020 fc 63 6b 27 2c 20 75 27 53 70 61 df 27 2c 20 75 |.ck', u'Spa.', u|
00000030 27 c4 68 72 65 27 2c 5d 0a 70 72 69 6e 74 28 6d |'.hre',].print(m|
00000040 79 6c 69 73 74 29 0a 0a 66 6f 72 20 77 20 69 6e |ylist)..for w in|
00000050 20 6d 79 6c 69 73 74 3a 0a 20 20 20 20 70 72 69 | mylist:. pri|
00000060 6e 74 28 77 2e 65 6e 63 6f 64 65 28 22 6c 61 74 |nt(w.encode("lat|
00000070 69 6e 2d 31 22 29 29 0a |in-1")).|
00000078
Note that the first byte (represented in hexadecimal) in the third line (i.e. the character at position 0x20) is fc. That is the latin-1 encoding of ü. If the file was encoded using utf-8, the character ü would be represented using two bytes, c3 bc.

If you are facing this problem while reading/writing a file, then try this
import codecs
# File read
with codecs.open(filename, 'r', encoding='utf8') as f:
text = f.read()
# File write
with codecs.open(filename, 'w', encoding='utf8') as f:
f.write(text)

Related

Parsing linux color control sequences

I'm trying to render the output of a linux shell command in HTML. For example, systemctl status mysql looks like this in my terminal:
As I understand from Floz'z Misc I was expecting that the underlying character stream would contain control codes. But looking at it in say hexyl (systemctl status mysql | hexyl) I can't see any codes:
Looking near the bottom on lines 080 and 090 where the text "Active: failed" is displayed, I was hoping to find some control sequences to change the color to red. While not necessarily ascii, I used some ascii tables to help me:
looking at the second lot of 8 characters on line 090 where the letters ive: fa are displayed, I find:
69 = i
76 = v
65 = e
3a = :
20 = space
66 = f
61 = a
69 = i
There are no bytes for control sequences.
I wondered if hexyl is choosing not to display them so I wrote a Java program which outputs the raw bytes after executing the process as a bash script and the results are the same - no control sequences.
The Java is roughly:
p = Runtime.getRuntime().exec(new String[]{"/bin/sh", "-c", "systemctl status mysql"}); // runs in the shell
p.waitFor();
byte[] bytes = p.getInputStream().readAllBytes();
for(byte b : bytes) {
System.out.println(b + "\t" + ((char)b));
}
That outputs:
...
32
32
32
32
32
65 A
99 c
116 t
105 i
118 v
101 e
58 :
32
102 f
97 a
105 i
108 l
101 e
100 d
...
So the question is: How does bash know that it has to display the word "failed" red?

systemctl detects that the output is not a terminal, and it removes colors codes from the output.
Related: Detect if stdin is a terminal or pipe? , https://unix.stackexchange.com/questions/249723/how-to-trick-a-command-into-thinking-its-output-is-going-to-a-terminal , https://superuser.com/questions/1042175/how-do-i-get-systemctl-to-print-in-color-when-being-interacted-with-from-a-non-t
Tools sometimes (sometimes not) come with options to enable color codes always, like ls --color=always, grep --color=always on in case of systemd with SYSTEMD_COLORS environment variable.
What tool can I use to see them?
You can use hexyl to see them.
how does bash know that it has to mark the word "failed" red?
Bash is the shell, it is completely unrelated.
Your terminal, the graphical window that you are viewing the output with, knows to mark it red because of ANSI escape sequences in the output. There is no interaction with Bash.
$ SYSTEMD_COLORS=1 systemctl status dbus.service | grep runn | hexdump -C
00000000 20 20 20 20 20 41 63 74 69 76 65 3a 20 1b 5b 30 | Active: .[0|
00000010 3b 31 3b 33 32 6d 61 63 74 69 76 65 20 28 72 75 |;1;32mactive (ru|
00000020 6e 6e 69 6e 67 29 1b 5b 30 6d 20 73 69 6e 63 65 |nning).[0m since|
00000030 20 53 61 74 20 32 30 32 32 2d 30 31 2d 30 38 20 | Sat 2022-01-08 |
00000040 31 39 3a 35 37 3a 32 35 20 43 45 54 3b 20 35 20 |19:57:25 CET; 5 |
00000050 64 61 79 73 20 61 67 6f 0a |days ago.|
00000059

Python - temporarily change default encoding?

I want to use https://pypi.org/project/pyclibrary/ to parse some .h files.
Some of those .h files are unfortunately not UTF-8 encoded - Notepad++ tells me they are "ANSI" encoded (and as they originate on Windows, I guess that means CP-1252? Not sure ...)
Anyways, I can reduce the problem to this example:
mytest.h:
/*******************************************************
Just a test header file
© Copyright myself
*******************************************************/
#ifndef _MY_TEST_
#define _MY_TEST_
#endif
The tricky part here is the copyright character - and just to make sure, here is a hexdump of this:
$ hexdump -C mytest.h
00000000 2f 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |/***************|
00000010 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00000030 2a 2a 2a 2a 2a 2a 2a 2a 0d 0a 4a 75 73 74 20 61 |********..Just a|
00000040 20 74 65 73 74 20 68 65 61 64 65 72 20 66 69 6c | test header fil|
00000050 65 0d 0a a9 20 43 6f 70 79 72 69 67 68 74 20 6d |e... Copyright m|
00000060 79 73 65 6c 66 0d 0a 2a 2a 2a 2a 2a 2a 2a 2a 2a |yself..*********|
00000070 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00000090 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2f 0d |**************/.|
000000a0 0a 0d 0a 23 69 66 6e 64 65 66 20 5f 4d 59 5f 54 |...#ifndef _MY_T|
000000b0 45 53 54 5f 0d 0a 23 64 65 66 69 6e 65 20 5f 4d |EST_..#define _M|
000000c0 59 5f 54 45 53 54 5f 0d 0a 23 65 6e 64 69 66 0d |Y_TEST_..#endif.|
000000d0 0a |.|
000000d1
And then I try this Python script:
mytest.py
#!/usr/bin/env python3
import sys, os
from pyclibrary import CParser
myhfile = "mytest.h"
c_parser = CParser([myhfile])
print(c_parser)
When I run this, I get:
$ python3 mytest.py
Traceback (most recent call last):
File "mytest.py", line 7, in <module>
c_parser = CParser([myhfile])
File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 443, in __init__
self.load_file(f, replace)
File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 678, in load_file
self.files[path] = fd.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 83: invalid start byte
... and I guess, the "byte 0xa9 in position 83" is the copyright character. So, the way I see it:
I don't really have an option to choose the file encoding in pyclibrary - but I don't want to hack pyclibrary either
I don't really want to edit the .h files either, and make them UTF-8 compatible
... and so, the only thing I can think of, is to change the default encoding of Python (while opening files) to ANSI/CP-1252/whatever, only for the call to c_parser = CParser([myhfile]) - and then restore the default UTF-8.
Is this possible to do somehow? I have seen Changing default encoding of Python? - but most of the answers there seem to imply, that you better just change the default encoding once, at the start of the script - I cannot find any references to changing the default encoding temporarily, and then restoring the original UTF-8 default later.

OK, I think I got it - found this thread: Windows Python: Changing encoding using the locale module - and got inspired to try the locale package; note that I'm working in MSYS2 bash on Windows, and as such, I use the MSYS2 Python3. So, now the file is:
mytest.py
#!/usr/bin/env python3
import sys, os
import locale
from pyclibrary import CParser
import pprint
myhfile = "mytest.h"
print( locale.getlocale() ) # ('en_US', 'UTF-8')
#pprint.pprint(locale.locale_alias)
locale.setlocale( locale.LC_ALL, 'en_US.ISO8859-1' )
c_parser = CParser([myhfile])
print(c_parser)
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
print( locale.getlocale() ) # ('en_US', 'UTF-8')
... and running this produces:
$ python3 mytest.py
('en_US', 'UTF-8')
============== types ==================
{}
============== variables ==================
{}
============== fnmacros ==================
{}
============== macros ==================
{'_MY_TEST_': ''}
============== structs ==================
{}
============== unions ==================
{}
============== enums ==================
{}
============== functions ==================
{}
============== values ==================
{'_MY_TEST_': None}
('en_US', 'UTF-8')
Well - this looks OK to me ...

Groovy gives error expecting EOF, found '?' # line 9, column 25

I'm using following code to generate random number in Groovy. I can run it in e.g. Groovy Web Console (https://groovyconsole.appspot.com/) and it works, however it fails when I try to run it in Mule. Here is the code I use:
log.info ">>run"
Random random = new Random()
def ranInt = random.nextInt()
def ran = Math.abs(ranInt)%200;
log.info ">>sleep counter:"+flowVars.counter+" ran: "+ran
sleep(ran)
And here is an exception that gets thrown:
Caused by:
org.codehaus.groovy.control.MultipleCompilationErrorsException:
startup failed: Script26.groovy: 9: expecting EOF, found '?' # line 9,
column 25. def ran = Math.abs(?400)?%20?0;
^
1 error

You have some extra unicode characters in line 4. If you convert it to hex you will get:
64 65 66 20 72 61 6e 20 3d 20 4d 61 74 68 2e 61 62 73 28 e2 80 8b 72 61 6e 49 6e 74 29 e2 80 8b 25 32 30 e2 80 8b 30 3b
Now if you convert this hex back to ascii, you will get:
def ran = Math.abs(â€‹ranInt)â€‹%20â€‹0;
There is a character â€‹ added after first (, after ) and after first 0. If you remove it, your code will compile correctly.
Here is the hex of curated line:
64 65 66 20 72 61 6e 20 3d 20 4d 61 74 68 2e 61 62 73 28 72 61 6e 49 6e 74 29 25 32 30 30 3b
And the line itself:
def ran = Math.abs(ranInt)%200;

nodejs telnet websocket response as <Buffer>

I am getting a telnet response as showed below when tried with nodejs code. Actually it is in xml format. When I did directly call the telnet response, the correct xml format is delivering. Can someone plz help, why 'Buffer' response when calling via nodejs.
"Buffer 3c 6e 66 6c 2d 65 76 65 6e 74 3e 0d 0a 20 3c 67 61 6d 65 63 6f 64 65 20
3 6f 64 65 3d 22 32 30 31 37 31 31 33 30 30 30 36 22 20 67 6c 6f 62 61 6c 2d .. "

bash command truncating

I have a bash file with the content
cd /var/www/path/to/folder
git pull
When I run it I get
: No such file or directorywww/path/to/folder
' is not a git command. See 'git --help'.
Did you mean this?
pull
Any idea why bash gets a truncated version of commands?

You have carriage returns (Windows text file line endings) in your bash script. Remove them.
The bash file should look like this under hexdump -C:
00000000 63 64 20 2f 76 61 72 2f 77 77 77 2f 70 61 74 68 |cd /var/www/path|
00000010 2f 74 6f 2f 66 6f 6c 64 65 72 0a 67 69 74 20 70 |/to/folder.git p|
00000020 75 6c 6c 0a |ull.|
00000024
But yours looks like this instead:
00000000 63 64 20 2f 76 61 72 2f 77 77 77 2f 70 61 74 68 |cd /var/www/path|
00000010 2f 74 6f 2f 66 6f 6c 64 65 72 0d 0a 67 69 74 20 |/to/folder..git |
00000020 70 75 6c 6c 0d 0a |pull..|
Note the extra 0d's (hex 0D = decimal 13 = ASCII carriage return, ANSI \r) in front of the 0as (hex 0A = decimal 10 = ASCII linefeed, ANSI \n, which is what bash treats as the end of a line).
A carriage return is not whitespace in bash, so it is treated as part of the last argument on the command line. You're getting errors because the folder /var/www/path/to/folder.git\r doesn't exist and pull\r isn't a valid git subcommand.
When printed, a carriage return moves the cursor to the start of the line, which is why your error messages look wrong. Bash and git are printing something like foo.bash: line 1: cd: /www/path/to/folder\r: No such file or directory and git: 'pull\r' is not a git command. See 'git --help', but after the \r moves the cursor to the start of the line, the tail end of each message overwrites its beginning.
There's a program called dos2unix that converts a text file from DOS to Unix:
dos2unix filename >newfilename
But that conversion really consists of nothing but deleting the carriage returns, which you could also do explicitly with tr:
tr -d '\r' <filename >newfilename

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python3: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' - python-3.x

Try running your script with explicitly defined PYTHONIOENCODING environment variable: PYTHONIOENCODING=utf-8 python3 script.py

If you are facing this problem while reading/writing a file, then try this import codecs # File read with codecs.open(filename, 'r', encoding='utf8') as f: text = f.read() # File write with codecs.open(filename, 'w', encoding='utf8') as f: f.write(text)

Related

Parsing linux color control sequences

Python - temporarily change default encoding?

Groovy gives error expecting EOF, found '?' # line 9, column 25

nodejs telnet websocket response as <Buffer>

bash command truncating

Categories

Resources