Different file size between powershell and cmd [duplicate] - node.js

This question already has answers here:
Using redirection within the script produces a unicode output. How to emit single-byte ASCII text?
(6 answers)
Closed 4 years ago.
I am using a little processconf.js tool to build a configuration.json file from multiple .json files.
Here the command I am using :
node processconf.js file1.json file2.json > configuration.json
I was using cmd for a moment, but today I tried using Powershell and somehow from the same files and the same command I do have different results.
One file is 33kb(cmd) the other 66kb(powershell), looking at the files they have the exact same lines and I can't find any visual differences, why is that ?

PowerShell defaults to UTF16LE, while cmd doesn't do Unicode by default for redirection (which may sometimes end up mangling your data as well).
If you don't use the redirection operator in PowerShell but instead Out-File you can specify an encoding, e.g.
node processconf.js file1.json file2.json | Out-File -Encoding Utf8 configuration.json
I think -Encoding Oem would be somewhat the same as the cmd behaviour, but usually doesn't support Unicode and there's a conversion involved.
The redirection operator of course has no provisions for specifying any options, so it's often not the best choice when you care about the exact output format. And since PowerShell, contrary to Unix shells, handles objects, text and random binary data are very different things.
You'd get the same behaviour from cmd if you ran it with cmd /u, by the way.

Related

Why electron backend has different encoding and how to fix it? [duplicate]

I have a simple javascript file (let's call it index.js) with the following:
console.log('pérola');
I use VSCode on windows 10 and it uses as terminal the powershell, when I execute the file using:
node index.js
I get the following output:
pérola
If I run the following:
node index.js > output.txt
I get the following on the file:
p├®rola
It seems there is some issue with the encoding of powershell when writing to files, when I open the file on VSCode I can see on the bottom right that the encoding is UTF-16 LE.
I also already tried the following:
node index.js | out-file -encoding utf8 output.txt
The file is saved in UTF8 with BOM but still with wrong encoding since what I see is p├®rola and not pérola
Can someone explain me what is wrong here?
Thank you.
What node outputs is UTF-8-encoded.
PowerShell's > operator does not pass the underlying bytes through to the output file.
Instead, PowerShell converts the bytes output by node into .NET strings based on the encoding stored in [Console]::OutputEncoding and then saves the resulting strings based on the encoding implied by the > operator, which is - effectively, not technically - an alias of the Out-File cmdlet.
In other words: for PowerShell to properly interpret node's output you must (temporarily) set [Console]::OutputEncoding to [System.Text.Utf8Encoding]::new().
Additionally, you must then decide what character encoding you want the output file to have, by using Out-File -Encoding or - preferably, if the input is text already - Set-Content -Encoding instead of >.
That is, you need to do this unless > / Out-File's default character encoding works for you: it is "Unicode" (UTF16-LE) in Windows PowerShell, and BOM-less UTF-8 in PowerShell [Core] v6+.
See also:
This answer for background information on how to make PowerShell console windows use UTF-8 consistently when communication with external programs[1], both when sending data to external programs ($OutputEncoding) and when interpreting data from external programs ([Console]::OutputEncoding):
In short, place the following statement in your $PROFILE:
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
If you're running in the - obsolescent - Windows PowerShell ISE, you need an additional command to ensure that the ISE first allocates a hidden console behind the scenes; note that in the recommended replacement, Visual Studio Code with its PowerShell extension, this is not necessary:
$null = chcp # Run any console application to force the ISE to create a console.
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
This answer for a system-wide way to make non-Unicode (console) applications use UTF-8, available in recent versions of Windows 10. This makes both cmd.exe and PowerShell use UTF-8 by default.[1]
Caveat: This feature is still in beta a of Windows 10 20H2, and it can have unwanted side effects - see the linked answer.
[1] What encoding PowerShell's own cmdlets use is not controlled by this; PowerShell cmdlets have their own defaults, which are - unfortunately - inconsistent in Windows PowerShell, whereas in PowerShell [Core] v6+ (BOM-less) UTF-8 is the consistent default; see this answer.

Special character and encoding handling while running azure CLI commands in PowerShell [duplicate]

I've been forcing the usage of chcp 65001 in Command Prompt and Windows Powershell for some time now, but judging by Q&A posts on SO and several other communities it seems like a dangerous and inefficient solution. Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry? And if there isn't, is there a publicly announced timeline or agenda to support UTF-8 in the Windows CLI in the future?
Personally I've been using chcp 949 for Korean Character Support, but the weird display of the backslash \ and incorrect/incomprehensible displays in several applications (like Neovim), as well as characters that aren't Korean not being supported via 949 seems to become more of a problem lately.
Note:
This answer shows how to switch the character encoding in the Windows console to
(BOM-less) UTF-8 (code page 65001), so that shells such as cmd.exe and PowerShell properly encode and decode characters (text) when communicating with external (console) programs with full Unicode support, and in cmd.exe also for file I/O.[1]
If, by contrast, your concern is about the separate aspect of the limitations of Unicode character rendering in console windows, see the middle and bottom sections of this answer, where alternative console (terminal) applications are discussed too.
Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry?
As of (at least) Windows 10, version 1903, you have the option to set the system locale (language for non-Unicode programs) to UTF-8, but the feature is still in beta as of this writing.
To activate it:
Run intl.cpl (which opens the regional settings in Control Panel)
Follow the instructions in the screen shot below.
This sets both the system's active OEM and the ANSI code page to 65001, the UTF-8 code page, which therefore (a) makes all future console windows, which use the OEM code page, default to UTF-8 (as if chcp 65001 had been executed in a cmd.exe window) and (b) also makes legacy, non-Unicode GUI-subsystem applications, which (among others) use the ANSI code page, use UTF-8.
Caveats:
If you're using Windows PowerShell, this will also make Get-Content and Set-Content and other contexts where Windows PowerShell default so the system's active ANSI code page, notably reading source code from BOM-less files, default to UTF-8 (which PowerShell Core (v6+) always does). This means that, in the absence of an -Encoding argument, BOM-less files that are ANSI-encoded (which is historically common) will then be misread, and files created with Set-Content will be UTF-8 rather than ANSI-encoded.
[Fixed in PowerShell 7.1] Up to at least PowerShell 7.0, a bug in the underlying .NET version (.NET Core 3.1) causes follow-on bugs in PowerShell: a UTF-8 BOM is unexpectedly prepended to data sent to external processes via stdin (irrespective of what you set $OutputEncoding to), which notably breaks Start-Job - see this GitHub issue.
Not all fonts speak Unicode, so pick a TT (TrueType) font, but even they usually support only a subset of all characters, so you may have to experiment with specific fonts to see if all characters you care about are represented - see this answer for details, which also discusses alternative console (terminal) applications that have better Unicode rendering support.
As eryksun points out, legacy console applications that do not "speak" UTF-8 will be limited to ASCII-only input and will produce incorrect output when trying to output characters outside the (7-bit) ASCII range. (In the obsolescent Windows 7 and below, programs may even crash).
If running legacy console applications is important to you, see eryksun's recommendations in the comments.
However, for Windows PowerShell, that is not enough:
You must additionally set the $OutputEncoding preference variable to UTF-8 as well: $OutputEncoding = [System.Text.UTF8Encoding]::new()[2]; it's simplest to add that command to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file.
Fortunately, this is no longer necessary in PowerShell Core, which internally consistently defaults to BOM-less UTF-8.
If setting the system locale to UTF-8 is not an option in your environment, use startup commands instead:
Note: The caveat re legacy console applications mentioned above equally applies here. If running legacy console applications is important to you, see eryksun's recommendations in the comments.
For PowerShell (both editions), add the following line to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file, which is the equivalent of chcp 65001, supplemented with setting preference variable $OutputEncoding to instruct PowerShell to send data to external programs via the pipeline in UTF-8:
Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp; additionally, as stated, Windows PowerShell requires $OutputEncoding to be set - see this answer for details.
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
For example, here's a quick-and-dirty approach to add this line to $PROFILE programmatically:
'$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding' + [Environment]::Newline + (Get-Content -Raw $PROFILE -ErrorAction SilentlyContinue) | Set-Content -Encoding utf8 $PROFILE
For cmd.exe, define an auto-run command via the registry, in value AutoRun of key HKEY_CURRENT_USER\Software\Microsoft\Command Processor (current user only) or HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor (all users):
For instance, you can use PowerShell to create this value for you:
# Auto-execute `chcp 65001` whenever the current user opens a `cmd.exe` console
# window (including when running a batch file):
Set-ItemProperty 'HKCU:\Software\Microsoft\Command Processor' AutoRun 'chcp 65001 >NUL'
Optional reading: Why the Windows PowerShell ISE is a poor choice:
While the ISE does have better Unicode rendering support than the console, it is generally a poor choice:
First and foremost, the ISE is obsolescent: it doesn't support PowerShell (Core) 7+, where all future development will go, and it isn't cross-platform, unlike the new premier IDE for both PowerShell editions, Visual Studio Code, which already speaks UTF-8 by default for PowerShell Core and can be configured to do so for Windows PowerShell.
The ISE is generally an environment for developing scripts, not for running them in production (if you're writing scripts (also) for others, you should assume that they'll be run in the console); notably, with respect to running code, the ISE's behavior is not the same as that of a regular console:
Poor support for running external programs, not only due to lack of supporting interactive ones (see next point), but also with respect to:
character encoding: the ISE mistakenly assumes that external programs use the ANSI code page by default, when in reality it is the OEM code page. E.g., by default this simple command, which tries to simply pass a string echoed from cmd.exe through, malfunctions (see below for a fix):
cmd /c echo hü | Write-Output
Inappropriate rendering of stderr output as PowerShell errors: see this answer.
The ISE dot-sources script-file invocations instead of running them in a child scope (the latter is what happens in a regular console window); that is, repeated invocations run in the very same scope. This can lead to subtle bugs, where definitions left behind by a previous run can affect subsequent ones.
As eryksun points out, the ISE doesn't support running interactive external console programs, namely those that require user input:
The problem is that it hides the console and redirects the process output (but not input) to a pipe. Most console applications switch to full buffering when a file is a pipe. Also, interactive applications require reading from stdin, which isn't possible from a hidden console window. (It can be unhidden via ShowWindow, but a separate window for input is clunky.)
If you're willing to live with that limitation, switching the active code page to 65001 (UTF-8) for proper communication with external programs requires an awkward workaround:
You must first force creation of the hidden console window by running any external program from the built-in console, e.g., chcp - you'll see a console window flash briefly.
Only then can you set [console]::OutputEncoding (and $OutputEncoding) to UTF-8, as shown above (if the hidden console hasn't been created yet, you'll get a handle is invalid error).
[1] In PowerShell, if you never call external programs, you needn't worry about the system locale (active code pages): PowerShell-native commands and .NET calls always communicate via UTF-16 strings (native .NET strings) and on file I/O apply default encodings that are independent of the system locale. Similarly, because the Unicode versions of the Windows API functions are used to print to and read from the console, non-ASCII characters always print correctly (within the rendering limitations of the console).
In cmd.exe, by contrast, the system locale matters for file I/O (with < and > redirections, but notably including what encoding to assume for batch-file source code), not just for communicating with external programs in-memory (such as when reading program output in a for /f loop).
[2] In PowerShell v4-, where the static ::new() method isn't available, use $OutputEncoding = (New-Object System.Text.UTF8Encoding).psobject.BaseObject. See GitHub issue #5763 for why the .psobject.BaseObject part is needed.
You can put the command chcp 65001 in your Powershell Profile, which will run it automatically when you open Powershell. However, this won't do anything for cmd.exe.
Microsoft is currently working on an improved terminal that will have full Unicode support. It is open source, and if you're using Windows 10 Version 1903 or later, you can already download a preview version.
Alternatively, you can use a third-party terminal emulator such as Terminus.
The Powershell ISE displays Korean perfectly fine. Here's a sample text file encoded in utf8 that would work:
PS C:\Users\js> cat .\korean.txt
The Korean language (South Korean: 한국어/韓國語 Hangugeo; North
Korean: 조선말/朝鮮말 Chosŏnmal) is an East Asian language
spoken by about 77 million people.[3]
Since the ISE comes with every version of Windows 10, I do not consider it obsolete. I disagree with whoever deleted my original answer.
The ISE has some limitations, but some scripting can be done with external commands:
echo 'list volume' | diskpart # as admin
cmd /c echo hi
EDIT:
If you have Windows 10 1903, you can download Windows Terminal from the Microsoft Store https://devblogs.microsoft.com/commandline/introducing-windows-terminal/, and Korean text would work in there. Powershell 5 would need the text format to be UTF8 with bom or UTF16.
EDIT2:
It seems like the ideals are windows terminal + powershell 7 or vscode + powershell 7, for both pasting characters and output.
EDIT3:
Even in the EDIT2 situations, some unicode characters cannot be pasted, like ⇆ (U+21C6), or unicode spaces. Only PS7 in Osx would work.

Bad Encoding when redirecting output of nodejs program to file (windows 10 powershell possible issue)

I have a simple javascript file (let's call it index.js) with the following:
console.log('pérola');
I use VSCode on windows 10 and it uses as terminal the powershell, when I execute the file using:
node index.js
I get the following output:
pérola
If I run the following:
node index.js > output.txt
I get the following on the file:
p├®rola
It seems there is some issue with the encoding of powershell when writing to files, when I open the file on VSCode I can see on the bottom right that the encoding is UTF-16 LE.
I also already tried the following:
node index.js | out-file -encoding utf8 output.txt
The file is saved in UTF8 with BOM but still with wrong encoding since what I see is p├®rola and not pérola
Can someone explain me what is wrong here?
Thank you.
What node outputs is UTF-8-encoded.
PowerShell's > operator does not pass the underlying bytes through to the output file.
Instead, PowerShell converts the bytes output by node into .NET strings based on the encoding stored in [Console]::OutputEncoding and then saves the resulting strings based on the encoding implied by the > operator, which is - effectively, not technically - an alias of the Out-File cmdlet.
In other words: for PowerShell to properly interpret node's output you must (temporarily) set [Console]::OutputEncoding to [System.Text.Utf8Encoding]::new().
Additionally, you must then decide what character encoding you want the output file to have, by using Out-File -Encoding or - preferably, if the input is text already - Set-Content -Encoding instead of >.
That is, you need to do this unless > / Out-File's default character encoding works for you: it is "Unicode" (UTF16-LE) in Windows PowerShell, and BOM-less UTF-8 in PowerShell [Core] v6+.
See also:
This answer for background information on how to make PowerShell console windows use UTF-8 consistently when communication with external programs[1], both when sending data to external programs ($OutputEncoding) and when interpreting data from external programs ([Console]::OutputEncoding):
In short, place the following statement in your $PROFILE:
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
If you're running in the - obsolescent - Windows PowerShell ISE, you need an additional command to ensure that the ISE first allocates a hidden console behind the scenes; note that in the recommended replacement, Visual Studio Code with its PowerShell extension, this is not necessary:
$null = chcp # Run any console application to force the ISE to create a console.
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
This answer for a system-wide way to make non-Unicode (console) applications use UTF-8, available in recent versions of Windows 10. This makes both cmd.exe and PowerShell use UTF-8 by default.[1]
Caveat: This feature is still in beta a of Windows 10 20H2, and it can have unwanted side effects - see the linked answer.
[1] What encoding PowerShell's own cmdlets use is not controlled by this; PowerShell cmdlets have their own defaults, which are - unfortunately - inconsistent in Windows PowerShell, whereas in PowerShell [Core] v6+ (BOM-less) UTF-8 is the consistent default; see this answer.

Displaying Unicode characters in local web page using Bash [duplicate]

This question already has answers here:
How do you echo a 4-digit Unicode character in Bash?
(18 answers)
Closed 7 years ago.
Hello I am trying to display the Dice Unicode characters on a web sever using bash, however I am finding it more difficult then it should be. In short I found online that (printf '\u0026') works and prints & to my page. However when I change the number to my desired '\u2680' nothing is displayed. Admittedly I am not very knowledgeable in linux nor unicode. But I am very confused on why a lower number will work and a higher one will not, or what I am doing wrong.
I think I may have found the answer. I think that because I am echoing everything into a html it is parsing the Unicode in html and not using linux. (Not 100% sure about that so correct me if I am wrong.)
Either way, by simply putting the html code for the dice characters into the sh file I was able to display the characters that I wanted.
(i.e echo '&#9856(;)' without the parentheses)
First you need to provide more information regarding what you are generating using bash. Put a sample script and specify what operating system and web server are you using.
It is essential to be sure that the encoding using by shell/bash is the same as the default encoding of the webserver. The HTML page must have a proper encoding specified in the header.

I exported via mysqldump to a file. How do I find out the file encoding of the file?

Given a text file in ubuntu (or debian unix in general), how do I find out the file encoding of the file ? Can I run od or hexdump on it to fingerprint its encoding ? What should I be looking out for ?
There are many tools to do this. Try a web search for "detect encoding". Here are some of the tools I found:
The Internationalizations Classes for Unicode (ICU) are a great place to start. See especially their page on Character Set Detection.
Chardet is a Python module to guess the encoding
of a file. See chardet.feedparser.org
The *nix command-line tool file detects file types, but might also detect encodings if mentioned in the file (e.g. if there's a mime-type notation in
the file). See man file
Perl modules Encode::Detect and Encode::Guess .
Someone asked a similar question in StackOverflow. Search for the question, PHP: Detect encoding and make everything UTF-8. That's in the context of fetching files from the net and using PHP, but you could write a command-line PHP script.
Note well what the ICU page says about character set detection: "Character set detection is ..., at best, an imprecise operation using statistics and heuristics...." In my experience the problem domain makes a big difference in how easy or difficult the job is. Don't forget that it's possible that the octets in a file can be of ambiguous encoding, i.e. sensibly interpreted using multiple different encodings. They can also be of mixed encoding, i.e. different subsets of the octets make sense interpreted in different encodings. This is why there's not a single command-line tool I can recommend which always does the job.
If you have a single file and you just want to get it into a known encoding, my trick is to open the file with a text editor which can import using a bunch of different encodings, such as TextWrangler or OpenOffice.org. First, open the file and let the editor guess the encoding. Take a look at the result. If you aren't satisfied with it, guess an encoding, open the file with the editor specifying that encoding, and take a look at the result. Then save as a known encoding, e.g. UTF-16.
You can use enca. Enca is a small command line tool for encoding detection and convertion.
You can install it at debian / ubuntu by:
apt-get install enca
In order to use it, just call
enca FILENAME
Also see the manpage for more information.

Resources