Why does python socket.send() acts differently in windows and linux? - linux

I'm trying to send a message with non ASCII characters by socket using python 2.7 (inside a C++ program called QGIS) to a windows machine. The following code works well using a linux client machine, but does not work if I use a Windows client machine. Needless to say that I must make it work on both systems...
# -*- coding: utf-8 -*-
import socket
# message with non ASCII characters
message = u'Troço'
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('192.168.1.69',9991))
s.send(message)
s.close()
Like I said before, this works well in a linux machine, and the unicode reach the socket receiver with the right message. But, if I use it on a Windows machine I get a UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
4: ordinal not in range(128).
I have read several pages and answers on this topic:
Sending UTF-8 with sockets
https://docs.python.org/2/library/socket.html#socket.socket.recv
How to handle Unicode (non-ASCII) characters in Python?
They all seem to say that I must encode my unicode basestring message to a string before sending it, but, if I replace line 11 by the following:
s.send(message.encoding('utf-8'))
Although I don't get any error message neither in windows nor in linux, the received message looks weird on that particular character, which make me think that it was uncorrectly (or double) encoded somewhere, and it can only be inside the send() method.
Which makes me thing: Is socket.send() method affected by the operating system or even the operating system default encoding?
UPDATE: Problem solved
The "problem" laid on the receiving code. I have had no access to it, but, after several tries, I realized that it expects an utf-16 encoded message. That's why sending a utf-8 message gave bad results. Therefore, changing line 11 did the trick:
s.send(message.encoding('utf-16'))
I still have no clue on why sending an unicode message worked on linux, but not on windows, but it does not matter, all makes a bit more sense now.

Related

When do you encounter issues with encodings in Python3?

I have recently learned more in depth about ASCII, Unicode, UTF-8, UTF-16, etc. in Python3, but I am struggling to understand when would one run into issues while reading/writing to files.
So if I open a file:
with open(myfile, 'a') as f:
f.write(stuff)
where stuff = 'Hello World!'
I have no issues writing to a file.
If I have something like:
non_latin = '娜', I can still write to the file with no problems.
So when does one run into issues regarding encodings? When does one use encode() and decode()?
You run into issues if the default encoding for your OS doesn't support the characters written. In your case the default (obtained from locale.getpreferredencoding(False)) is probably UTF-8. On Windows, the default is an ANSI encoding like cp1252 and wouldn't support Chinese. Best to be explicit and use open(myfile,'w',encoding='utf8') for example.

How to output emoji to console in Node.js (on Windows)?

On Windows, there's some basic emoji support in the console, so that I can get a monochrome glyph if I type, e.g. ☕ or 📜. I can output a string from PowerShell or a C# console application or Python and they all show those characters fine enough.
However, from Node.js, I can only get a couple of emoji to display (e.g. ☕), but not other (instead of 📜 I see �). However, if I throw a string with those characters, they display correctly.
console.log(' 📜 ☕ ');
throw ' 📜 ☕ ';
If I run the above script, the output is
� ☕
C:\Code\emojitest\emojitest.js:2
throw ' 📜 ☕ ';
^
📜 ☕
Is there anyway that I can output those emojis correctly without throwing an error? Or is that exception happening outside of what's available to me through the standard Node.js APIs?
What you want may not be possible without a change to libuv. When you (or the
console) write to stdout or stderr on Windows and the stream is a TTY,
libuv does its own conversion from UTF‑8 to UTF‑16. In doing so it explicitly
refuses to output surrogate pairs, emitting instead the replacement character
U+FFFD � for any codepoint beyond the BMP.
Here’s the culprit in uv/src/win/tty.c:
/* We wouldn't mind emitting utf-16 surrogate pairs. Too bad, the */
/* windows console doesn't really support UTF-16, so just emit the */
/* replacement character. */
if (utf8_codepoint > 0xffff) {
utf8_codepoint = UNICODE_REPLACEMENT_CHARACTER;
}
The throw message appears correctly because Node lets Windows do the
conversion from UTF‑8 to UTF‑16 with MultiByteToWideChar() (which does emit
surrogate pairs) before writing the message to the console. (See
PrintErrorString() in src/node.cc.)
Note: A pull request has been submitted to resolve this issue.
(Disclaimer: I don't have a solution, I explored what makes exception handling special with regards to printing emoji, with the tools I have on Windows 10 -- With some luck that might sched some light on the issue, and perhaps someone will recognize something and come up with a solution)
Looks like Node's exception reporting code for Windows calls to a different Windows API, that happens to support Unicode better.
Let's see with Node 7.10 sources:
ReportException
→
AppendExceptionLine
→
PrintErrorString
In PrintErrorString, the Windows-specific section detects output type (tty/console or not):
- For non-tty/console context it will print to stderr (e.g. if you redirect to a file)
- In a cmd console (with no redirection), it will convert text with MultiByteToWideChar() and then pass that to WriteConsoleW().
If I run your program using ConEmu (easier than getting standard cmd to work with unicode & emoji -- yes I got a bit lazy here), I see something similar as what you saw: console.log fails to print emoji, but the emoji in exception message are printed OK (even the scroll glyph).
If I redirect all output to a file (node test.js > out.txt 2>&1, yes that works in Windows cmd too), I get "clean" Unicode in both cases.
So it seems when a program prints to stdout or stderr in a Windows console, the console does some (bad) reencoding work before printing. When the program uses Windows console API directly (doing the conversion itself with MultiByteToWideChar then write to console with WriteConsoleW()), the console shows the glorious unaltered emoji.
When a JS program uses console API to log stuff, maybe Node could try (on Windows) to detect console and do the same thing as it does for reporting exceptions. See #BrianNixon's answer that explains what is actually happening in libuv.
The next "Windows Terminal" (from Kayla Cinnamon) and the Microsoft/Terminal project should be able to display emojis.
This will be available starting June 2019.
Through the use of the Consolas font, partial Unicode support will be provided.
The request is in progress in Microsoft/Terminal issue 387.
And Microsoft/Terminal issue 190 formally demands "Add emoji support to Windows Console".
But there are still issues (March 2019):
I updated my Win10 from 1803 to 1809 several days ago, and now all characters >= U+10000 (UTF-8 with 4 bytes or more) no longer display.
I have also tried the newest insider version(Windows 10 Insider Preview 18358.1 (19h1_release)), unfortunately, this bug still exists.

√ not recognized in Terminal

For a class of mine I have to make a very basic calculator. I want to write the code in such a way that the user can just enter what they want to do (ex. √64) press return and get the answer. I wrote this:
if '√' in operation:
squareRoot = operation.replace('√','')
squareRootFinal = math.sqrt(int(squareRoot))
When I run this in IDLE, it works like a charm. However when I run this in Terminal I get the following error:
SyntaxError: Non-ASCII character '\xe2' in file x.py on line 50, but no encoding declared;
any suggestions?
Just declare the encoding. Python is begin a bit cautious here and not guessing the encoding of your text file. IDLE is a text editor and so has already guessed the encoding and stored things internally as unicode, which it can pass directly to the Python interpreter.
Put
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
at the top of the file. (It's pretty unlikely nowadays that your encoding is not UTF-8.)
For reference, in past times files had a variety of different possible encodings. That means that the same text could be stored in different ways in binary, when written to disk. Almost all encodings have the same interpretation of bytes 0 to 127—the ASCII subset. But if any other bytes occur in the file, their meaning is potentially ambiguous.
However, in recent years, UTF-8 has become by far the most common encoding, so it's almost always a safe guess.

How to convert Linux Python 3.4 code with national characters into executable code in windows

My national language is Polish.
I've got program in Python 3.4 which I wrote on linux. This program mostly work on text, Polish text. So of course, variable names don't have any special characters, but sometimes I put into them some strings with Polish characters, user will input from keyboard some strings with Polish characters and My program read from files, where I got strings with Polish characters.
Everything work well on Linux. I didn't think about encoding, it just worked. But now i want to make it work on Windows. Can you help me understand, what I should actually do to make this transform?
Or maybe some workaround - I just need to have Windows executable file. Perfect way for this, would be "Pyinstaller", but it work only for python 2.7, not 3.4. That's why I want to make it working on Windows, and in VirtualBox with py2exe compile into executable form. But maybe somone know way for this in Linux, it without this encoding problems, it would be great.
If not, I back to my question. I tried to convert my python scripts in gedit into ISO or CP1250 or 1252, I wrote in the file headline what coding I'm using, it actually worked a little, now my windows error pint me into my files with text form which I read some data, so I converted them too... But it didn't work.
So I decided, that it's no more time for blind trials, and I need to ask for help, I need to understand what encoding is used on windows, which on linux, what is the best way to convert one into another, and how make program read characters in right way.
The best way would be - I guess - not changing anything in encoding, but just make windows python understand what encoding I'm using. Is that possible?
Complete answer for my question would be great, but anything what will point me in right direction will also help me a lot.
OK. I'm not sure, if I understand your answer in comments, but tried sending text for myself via mail, coping it in virtualbox into notepad and save as utf_8. Still get this message:
C:\Users\python\Documents>py pytania.py
Traceback (most recent call last):
File "pytania.py", line 864, in <module>
start_probny()
File "pytania.py", line 850, in start_probny
utworzenie_danych()
File "pytania.py", line 740, in utworzenie_danych
utworzenie_pytania_piwo('a')
File "pytania.py", line 367, in utworzenie_pytania_piwo
for line in f: # Czytam po jednej linii
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1134: cha
racter maps to <undefined>
As mentioned by Zero Piraeus in a comment: The default source encoding for Python 3.x is UTF-8, regardless of what platform it's running on...
If you have problems, that probably because your source code has incorrect encoding. You should stick to UTF-8 only (even though PEP 0263 -- Defining Python Source Code Encodings allows changing it).
The error message you provided is clear:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1134
Python is currently expecting UTF8 (because "UnicodeDecodeError"!), but it encounters an illegal char (0x9d isn't a valid char is UTF8). To diagnose the problem, use iconv(1) on a Linux machine, to detect errors buy doing a dummy conversion:
iconv -f utf8 -t iso8859-2 -o /dev/null < test.py
You can try to reproduce the problem by creating a very simple python file, typically : print "test €uro encoding"

open a unicode name file on linux from Python

I was trying to open a file on Ubuntu in Python using:-
open('<unicode_string>', "wb")
unicode_string is '\u9879\u76ee\u7ba1\u7406'. It is a Chinese text.
But I get the following error:-
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
I am trying to understand what limits this behavior? Went through some links. I understand that the OS's filesystem has the responsibility to encode the string. For windows the 'mbcs' encoding handles possibly every character. What could be the problem with linux.
Does not fail for all linux setups. What should I be checking?

Resources