How to output emoji to console in Node.js (on Windows)? - node.js

On Windows, there's some basic emoji support in the console, so that I can get a monochrome glyph if I type, e.g. ☕ or 📜. I can output a string from PowerShell or a C# console application or Python and they all show those characters fine enough.
However, from Node.js, I can only get a couple of emoji to display (e.g. ☕), but not other (instead of 📜 I see �). However, if I throw a string with those characters, they display correctly.
console.log(' 📜 ☕ ');
throw ' 📜 ☕ ';
If I run the above script, the output is
� ☕
C:\Code\emojitest\emojitest.js:2
throw ' 📜 ☕ ';
^
📜 ☕
Is there anyway that I can output those emojis correctly without throwing an error? Or is that exception happening outside of what's available to me through the standard Node.js APIs?

What you want may not be possible without a change to libuv. When you (or the
console) write to stdout or stderr on Windows and the stream is a TTY,
libuv does its own conversion from UTF‑8 to UTF‑16. In doing so it explicitly
refuses to output surrogate pairs, emitting instead the replacement character
U+FFFD � for any codepoint beyond the BMP.
Here’s the culprit in uv/src/win/tty.c:
/* We wouldn't mind emitting utf-16 surrogate pairs. Too bad, the */
/* windows console doesn't really support UTF-16, so just emit the */
/* replacement character. */
if (utf8_codepoint > 0xffff) {
utf8_codepoint = UNICODE_REPLACEMENT_CHARACTER;
}
The throw message appears correctly because Node lets Windows do the
conversion from UTF‑8 to UTF‑16 with MultiByteToWideChar() (which does emit
surrogate pairs) before writing the message to the console. (See
PrintErrorString() in src/node.cc.)
Note: A pull request has been submitted to resolve this issue.

(Disclaimer: I don't have a solution, I explored what makes exception handling special with regards to printing emoji, with the tools I have on Windows 10 -- With some luck that might sched some light on the issue, and perhaps someone will recognize something and come up with a solution)
Looks like Node's exception reporting code for Windows calls to a different Windows API, that happens to support Unicode better.
Let's see with Node 7.10 sources:
ReportException
→
AppendExceptionLine
→
PrintErrorString
In PrintErrorString, the Windows-specific section detects output type (tty/console or not):
- For non-tty/console context it will print to stderr (e.g. if you redirect to a file)
- In a cmd console (with no redirection), it will convert text with MultiByteToWideChar() and then pass that to WriteConsoleW().
If I run your program using ConEmu (easier than getting standard cmd to work with unicode & emoji -- yes I got a bit lazy here), I see something similar as what you saw: console.log fails to print emoji, but the emoji in exception message are printed OK (even the scroll glyph).
If I redirect all output to a file (node test.js > out.txt 2>&1, yes that works in Windows cmd too), I get "clean" Unicode in both cases.
So it seems when a program prints to stdout or stderr in a Windows console, the console does some (bad) reencoding work before printing. When the program uses Windows console API directly (doing the conversion itself with MultiByteToWideChar then write to console with WriteConsoleW()), the console shows the glorious unaltered emoji.
When a JS program uses console API to log stuff, maybe Node could try (on Windows) to detect console and do the same thing as it does for reporting exceptions. See #BrianNixon's answer that explains what is actually happening in libuv.

The next "Windows Terminal" (from Kayla Cinnamon) and the Microsoft/Terminal project should be able to display emojis.
This will be available starting June 2019.
Through the use of the Consolas font, partial Unicode support will be provided.
The request is in progress in Microsoft/Terminal issue 387.
And Microsoft/Terminal issue 190 formally demands "Add emoji support to Windows Console".
But there are still issues (March 2019):
I updated my Win10 from 1803 to 1809 several days ago, and now all characters >= U+10000 (UTF-8 with 4 bytes or more) no longer display.
I have also tried the newest insider version(Windows 10 Insider Preview 18358.1 (19h1_release)), unfortunately, this bug still exists.

Related

Special character and encoding handling while running azure CLI commands in PowerShell [duplicate]

I've been forcing the usage of chcp 65001 in Command Prompt and Windows Powershell for some time now, but judging by Q&A posts on SO and several other communities it seems like a dangerous and inefficient solution. Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry? And if there isn't, is there a publicly announced timeline or agenda to support UTF-8 in the Windows CLI in the future?
Personally I've been using chcp 949 for Korean Character Support, but the weird display of the backslash \ and incorrect/incomprehensible displays in several applications (like Neovim), as well as characters that aren't Korean not being supported via 949 seems to become more of a problem lately.
Note:
This answer shows how to switch the character encoding in the Windows console to
(BOM-less) UTF-8 (code page 65001), so that shells such as cmd.exe and PowerShell properly encode and decode characters (text) when communicating with external (console) programs with full Unicode support, and in cmd.exe also for file I/O.[1]
If, by contrast, your concern is about the separate aspect of the limitations of Unicode character rendering in console windows, see the middle and bottom sections of this answer, where alternative console (terminal) applications are discussed too.
Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry?
As of (at least) Windows 10, version 1903, you have the option to set the system locale (language for non-Unicode programs) to UTF-8, but the feature is still in beta as of this writing.
To activate it:
Run intl.cpl (which opens the regional settings in Control Panel)
Follow the instructions in the screen shot below.
This sets both the system's active OEM and the ANSI code page to 65001, the UTF-8 code page, which therefore (a) makes all future console windows, which use the OEM code page, default to UTF-8 (as if chcp 65001 had been executed in a cmd.exe window) and (b) also makes legacy, non-Unicode GUI-subsystem applications, which (among others) use the ANSI code page, use UTF-8.
Caveats:
If you're using Windows PowerShell, this will also make Get-Content and Set-Content and other contexts where Windows PowerShell default so the system's active ANSI code page, notably reading source code from BOM-less files, default to UTF-8 (which PowerShell Core (v6+) always does). This means that, in the absence of an -Encoding argument, BOM-less files that are ANSI-encoded (which is historically common) will then be misread, and files created with Set-Content will be UTF-8 rather than ANSI-encoded.
[Fixed in PowerShell 7.1] Up to at least PowerShell 7.0, a bug in the underlying .NET version (.NET Core 3.1) causes follow-on bugs in PowerShell: a UTF-8 BOM is unexpectedly prepended to data sent to external processes via stdin (irrespective of what you set $OutputEncoding to), which notably breaks Start-Job - see this GitHub issue.
Not all fonts speak Unicode, so pick a TT (TrueType) font, but even they usually support only a subset of all characters, so you may have to experiment with specific fonts to see if all characters you care about are represented - see this answer for details, which also discusses alternative console (terminal) applications that have better Unicode rendering support.
As eryksun points out, legacy console applications that do not "speak" UTF-8 will be limited to ASCII-only input and will produce incorrect output when trying to output characters outside the (7-bit) ASCII range. (In the obsolescent Windows 7 and below, programs may even crash).
If running legacy console applications is important to you, see eryksun's recommendations in the comments.
However, for Windows PowerShell, that is not enough:
You must additionally set the $OutputEncoding preference variable to UTF-8 as well: $OutputEncoding = [System.Text.UTF8Encoding]::new()[2]; it's simplest to add that command to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file.
Fortunately, this is no longer necessary in PowerShell Core, which internally consistently defaults to BOM-less UTF-8.
If setting the system locale to UTF-8 is not an option in your environment, use startup commands instead:
Note: The caveat re legacy console applications mentioned above equally applies here. If running legacy console applications is important to you, see eryksun's recommendations in the comments.
For PowerShell (both editions), add the following line to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file, which is the equivalent of chcp 65001, supplemented with setting preference variable $OutputEncoding to instruct PowerShell to send data to external programs via the pipeline in UTF-8:
Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp; additionally, as stated, Windows PowerShell requires $OutputEncoding to be set - see this answer for details.
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
For example, here's a quick-and-dirty approach to add this line to $PROFILE programmatically:
'$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding' + [Environment]::Newline + (Get-Content -Raw $PROFILE -ErrorAction SilentlyContinue) | Set-Content -Encoding utf8 $PROFILE
For cmd.exe, define an auto-run command via the registry, in value AutoRun of key HKEY_CURRENT_USER\Software\Microsoft\Command Processor (current user only) or HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor (all users):
For instance, you can use PowerShell to create this value for you:
# Auto-execute `chcp 65001` whenever the current user opens a `cmd.exe` console
# window (including when running a batch file):
Set-ItemProperty 'HKCU:\Software\Microsoft\Command Processor' AutoRun 'chcp 65001 >NUL'
Optional reading: Why the Windows PowerShell ISE is a poor choice:
While the ISE does have better Unicode rendering support than the console, it is generally a poor choice:
First and foremost, the ISE is obsolescent: it doesn't support PowerShell (Core) 7+, where all future development will go, and it isn't cross-platform, unlike the new premier IDE for both PowerShell editions, Visual Studio Code, which already speaks UTF-8 by default for PowerShell Core and can be configured to do so for Windows PowerShell.
The ISE is generally an environment for developing scripts, not for running them in production (if you're writing scripts (also) for others, you should assume that they'll be run in the console); notably, with respect to running code, the ISE's behavior is not the same as that of a regular console:
Poor support for running external programs, not only due to lack of supporting interactive ones (see next point), but also with respect to:
character encoding: the ISE mistakenly assumes that external programs use the ANSI code page by default, when in reality it is the OEM code page. E.g., by default this simple command, which tries to simply pass a string echoed from cmd.exe through, malfunctions (see below for a fix):
cmd /c echo hü | Write-Output
Inappropriate rendering of stderr output as PowerShell errors: see this answer.
The ISE dot-sources script-file invocations instead of running them in a child scope (the latter is what happens in a regular console window); that is, repeated invocations run in the very same scope. This can lead to subtle bugs, where definitions left behind by a previous run can affect subsequent ones.
As eryksun points out, the ISE doesn't support running interactive external console programs, namely those that require user input:
The problem is that it hides the console and redirects the process output (but not input) to a pipe. Most console applications switch to full buffering when a file is a pipe. Also, interactive applications require reading from stdin, which isn't possible from a hidden console window. (It can be unhidden via ShowWindow, but a separate window for input is clunky.)
If you're willing to live with that limitation, switching the active code page to 65001 (UTF-8) for proper communication with external programs requires an awkward workaround:
You must first force creation of the hidden console window by running any external program from the built-in console, e.g., chcp - you'll see a console window flash briefly.
Only then can you set [console]::OutputEncoding (and $OutputEncoding) to UTF-8, as shown above (if the hidden console hasn't been created yet, you'll get a handle is invalid error).
[1] In PowerShell, if you never call external programs, you needn't worry about the system locale (active code pages): PowerShell-native commands and .NET calls always communicate via UTF-16 strings (native .NET strings) and on file I/O apply default encodings that are independent of the system locale. Similarly, because the Unicode versions of the Windows API functions are used to print to and read from the console, non-ASCII characters always print correctly (within the rendering limitations of the console).
In cmd.exe, by contrast, the system locale matters for file I/O (with < and > redirections, but notably including what encoding to assume for batch-file source code), not just for communicating with external programs in-memory (such as when reading program output in a for /f loop).
[2] In PowerShell v4-, where the static ::new() method isn't available, use $OutputEncoding = (New-Object System.Text.UTF8Encoding).psobject.BaseObject. See GitHub issue #5763 for why the .psobject.BaseObject part is needed.
You can put the command chcp 65001 in your Powershell Profile, which will run it automatically when you open Powershell. However, this won't do anything for cmd.exe.
Microsoft is currently working on an improved terminal that will have full Unicode support. It is open source, and if you're using Windows 10 Version 1903 or later, you can already download a preview version.
Alternatively, you can use a third-party terminal emulator such as Terminus.
The Powershell ISE displays Korean perfectly fine. Here's a sample text file encoded in utf8 that would work:
PS C:\Users\js> cat .\korean.txt
The Korean language (South Korean: 한국어/韓國語 Hangugeo; North
Korean: 조선말/朝鮮말 Chosŏnmal) is an East Asian language
spoken by about 77 million people.[3]
Since the ISE comes with every version of Windows 10, I do not consider it obsolete. I disagree with whoever deleted my original answer.
The ISE has some limitations, but some scripting can be done with external commands:
echo 'list volume' | diskpart # as admin
cmd /c echo hi
EDIT:
If you have Windows 10 1903, you can download Windows Terminal from the Microsoft Store https://devblogs.microsoft.com/commandline/introducing-windows-terminal/, and Korean text would work in there. Powershell 5 would need the text format to be UTF8 with bom or UTF16.
EDIT2:
It seems like the ideals are windows terminal + powershell 7 or vscode + powershell 7, for both pasting characters and output.
EDIT3:
Even in the EDIT2 situations, some unicode characters cannot be pasted, like ⇆ (U+21C6), or unicode spaces. Only PS7 in Osx would work.

How do I configure a terminal to read UTF-8 characters?

I am working on a project which accepts user input via the command line. I am using up-to-date Windows 10 and (after much running around in circles...) I am aware that it is notoriously bad when it comes to handling UTF-8 characters. Consequently, I looked to VS Code and the integrated terminal (PowerShell) to perform input into the program. Sadly, the terminal seemed unable to accept accented UTF-8 characters such as "ë". I then did more research and configured the settings.json for VS Code for UTF-8 BOM encoding. Still, the terminal failed to read accented characters. I am certain that my program is not the issue, nor is my font. I have reduced my code to a test algorithm that simply accepts input using readline-sync (which the developers confirm is compatible with UTF-8: https://github.com/anseki/readline-sync/issues/58) and "console.log"s it.
The test case I have been using is "Hëllo". When I input "Hëllo" into the VS Code terminal, my program outputs "H�llo". When I tried converting all of my apps to UTF-8 encoding using the administrative language settings for Windows 10 and subsequently input "Hëllo" via the command terminal, it output "Hllo". I also tried forcing CMD to use Code Page 65001 with chcp 65001 for UTF-8 encoding, but it still produced "Hllo".
Here is the code I used to configure the VS Code PowerShell terminal via settings.json:
{
"[powershell]": {
"files.encoding": "utf8bom",
"files.autoGuessEncoding": true
}
}
And here is the brief code I wrote to test my input/output and whether the "ë" is being read successfully (which it is not):
const rlSync = require('readline-sync');
const name = rlSync.question('Enter Player 1 Username (Case Sensitive): ');
console.log(name);
If y'all see any issues, please let me know!
I am looking for any way to properly configure my CLI to accept accented characters for use in my program. I do not mean to restrict this question to VS Code or Powershell. If there is a way to accomplish this with the basic Windows 10 CMD, I would love that. Thank you for any help y'all can provide! <3
Is there any particular reason you're using VSCode? I think you're looking for the System.Console InputEncoding/OutputEncoding - unfortunately my default encoding just works with "Hëllo" so couldn't accurately test, and I don't know if this works with VSCode.
Try this (one line at a time)
# store current encoding settings
$i = [System.Console]::InputEncoding
$o = [System.Console]::OutputEncoding
# set encoding to UTF8
[System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
# test
"Hëllo"
# revert (if you want. if you don't want, I would at least note the default encoding)
[System.Console]::InputEncoding = $i
[System.Console]::OutputEncoding = $o

Python ANSI Color Codes

Python 3.7, on Windows print does not work as expected for ANSI color codes until shell=True once in subprocess.call().
In the below links it appears to imply that the ANSI color codes should work using the "print" command out of the box.
How to print colour/color in python?
Print in terminal with colors using Python?
the second one mentions VT100 emulation... not sure what exactly that means. I am able to write a batch file that outputs the color fine so I would think (naively) that it should work the same way in Python.
However I am not able to use the ANSI color codes as it seems that the ESC character is being "commented out"(?) because for instance when I
print(u"\u001b[31mHelloWorld")
I am not able to see the colored output, as the ESC character seems to be necessary in Windows and prints in the python shell as "[?]" (a box with a question mark)
Is there something I am missing here?
I found myself an answer. As often happens I just did not look far enough.
the Colorama module can be installed with
py -m pip install colorama
and comes with a method definition at the root of the module called init
colorama.init()
This is a cross platform function in that it is only useful on windows (it saves the active Terminal state for reversal and writes the Terminal to preprocess ANSI codes), it does nothing for other operating systems.
I am thinking about implementing an even more lightweight solution using ctypes and setting the Interpret flags on the active terminal myself.
If you are interested in more information, see here:
https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences
Output Sequences
The following terminal sequences are intercepted by the console host when written into the output stream, if the ENABLE_VIRTUAL_TERMINAL_PROCESSING flag is set on the screen buffer handle using the SetConsoleMode flag. Note that the DISABLE_NEWLINE_AUTO_RETURN flag may also be useful in emulating the cursor positioning and scrolling behavior of other terminal emulators in relation to characters written to the final column in any row.
Emphasis mine.

Debug Inspector showing strange characters?

I'm just getting into Go for the first time and finally got things running on my Win10 machine. Finally got breakpoints working inside of IntelliJ IDEA, and I'm seeing stuff like this in my debugger window. Those messes of unicode chars should actually be a 24-char HEX id that's coming from MongoDB.
My best guess is that this is a problem with mgo not properly unmarshalling ObjectId objects, but this doesn't seem to be a problem for any of the devs running linux or macOS, so maybe it's just a Windows thing?
Any input would be appreciated!
No error here. bson.ObjectId has an underlying type of string:
type ObjectId string
But it is used to store 12 "arbitrary" bytes ("arbitrary" means it is not meant to be interpreted by runes, and it's not a valid UTF-8 encoded sequence). It is usually displayed using the hex representation of its bytes, for humans.
Debuggers don't take that convenience. They see it's a string, so they attempt to display it as a string (even though it's not meant to). This is not a Windows-only thing, the Atom editor with the delve debugger does the same on Linux too. Nothing to worry about.
If you print an ObjectId, it's usually the fmt package's "thing" to use its String() method to acquire the string value to be displayed. Debuggers do not necessarily do that.

D Language fails to display german Umlaute on Windows?

As you can see, D fails to output german Umlaute. At least on Windows. On Linux or BSD the same program outputs the string as I've saved it.
I already tried wstring or dstring, but the output is the same.
What am I doing wrong?
D will output UTF-8 regardless of the operating system. How the output will be interpreted depends on how it is displayed. In this particular case, it looks like your IDE is interpreting the output as if it was encoded in the Windows-1252 encoding.
For the standard Windows console, you could change the output encoding by calling SetConsoleOutputCP(65001), but note that this may have some undesired side effects (you should restore the codepage before your progam exits, and batch files may not run while the console output codepage is set to 65001).
CyberShadows post guided me to an acceptable answer. :-)
In Eclipse it is possible to change the output-encoding without changing global settings of the OS.
Go to Run --> Run-Configurations...
There select the Common-Tab and change the encoding to UTF-8. Now german Umlaute are displayed correctly. At least in Eclipse. :-)
Another possibility is to use https://babun.github.io/ . It is a Cygwin-based Shell that ouputs UTF-8:

Resources