Convert a string in PowerShell (in Europe) to UTF-8 - string

For a REST call I need the German "Stück" in UTF-8 as read from an access database with
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$filename;Persist Security Info=False;")
and try to convert it.
I have found out that PowerShell ISE seems to encode string constants in ANSI.
So I tried as a minimum test without database and got the same result:
$Text1 = "Stück" # entered via ISE, this is also what I get from the database
# ($StringFromDatabase -eq $Test1) shows $true
$enc = [System.Text.Encoding]::GetEncoding(1252).GetBytes($Text1)
# also tried [System.Text.Encoding]::GetEncoding("ISO-8859-1") # = 28591
$Text1 = [System.Text.Encoding]::UTF8.GetString($enc)
$Text1
$Text1 = "Stück" # = UTF-8, entered here with Notepad++, encoding set to UTF-8
"must see: $Text1"
So I get two outputs - the converted one (showing "St?ck") but I need to see "Stück".

that PowerShell ISE seems to encode string constants in ANSI.
That only applies when communicating with external programs, whereas you're using in-process .NET APIs.
As an aside: this discrepancy with regular console windows, which use the active OEM code page is one of the reasons that make the obsolescent ISE problematic - see the bottom section of this answer for more information.
String literals in memory are always .NET strings, which are UTF-16-encoded (composed of 16-bit Unicode code units), capable of representing all Unicode characters.[1]
Character encoding in web-service calls (Invoke-RestMethod, Invoke-WebRequest):
To send UTF-8 strings, specify charset=utf-8 as part of the -ContentType argument; e.g.:
Invoke-RestMethod -ContentType 'text/plain; charset=utf-8' ...
On receiving strings, PowerShell automatically decodes them either based on an explicitly specified charset field (character encoding) in the response's content header or, in its absence using ISO-8859-1 (which is closely related to, but in effect a subset of Windows-1252).
If a given response doesn't specify a charset but in actually uses a different encoding from ISO-8859-1 - say UTF-8 - PowerShell will misinterpret the strings received, which requires re-encoding after the fact - see this answer.
Character encoding when communicating with external programs:
If you need to send a string with a particular encoding to an external program (via the pipeline, which the target program receives via stdin), set the $OutputEncoding preference variable to that encoding, and PowerShell will automatically convert your .NET strings to the specified encoding.
To send UTF-8-encoded strings to external programs via the pipeline:
$OutputEncoding = [System.Text.UTF8Encoding]::new()
Note, however, that this alone isn't sufficient in order to correctly receive UTF-8 output from external programs; for that, you need to set [Console]::OutputEncoding to the same encoding.
To make your PowerShell session fully UTF-8-aware (irrespective of whether in the ISE or a regular console window):
# Needed in the ISE only:
chcp >$null # Dummy console-program call that ensures that a console is allocated.
# Set all encodings relevant to communicating with external programs to UTF-8.
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =
[System.Text.UTF8Encoding]::new()
See this answer for more information.
[1] Note, however, that Unicode characters with a code point greater than 0xFFFF, i.e. those outside the so-called BMP (Basic Multilingual Plane), must be represented with two 16-bit code units ([char]), namely so-called surrogate pairs.

Related

Strange characters found in XML file and PowerShell output after exporting from Excel: ​

I have an XML file that I'm trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​
I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.
In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?
I tried using -Encoding UTF8, but I'm relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!
PowerShell
$xmlpath = 'Path\To\The\File.xml'
[xml]$xmldata = (Get-Content $xmlpath)
$xmldata.applications.application.name
Example of Output
​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com
This is a prime example of why you shouldn't use the idiom [xml]$xmldata = (Get-Content $xmlpath) - as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell's Get-Content cmdlet interprets it as ANSI-encoded in the absence of a BOM - this answer explains the encoding part in detail.Thanks, choroba.
Instead, to ensure that the XML file's character encoding is interpreted correctly, use the following:
# Note: If you know that $xmlPath contains a *full*, native path,
# you don't need the Convert-Path call.
($xmlData = [xml]::new()).Load((Convert-Path -LiteralPath $xmlPath))
This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load .NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g., <?xml version="1.0" encoding="iso-8859-1"?>)
See also:
the bottom section of this answer for background information.
GitHub proposal #14505, which proposes introducing a New-Xml cmdlet that robustly parses XML files.
[1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content's -Encoding parameter in your original approach ([xml]$xmldata = (Get-Content -Encoding utf8 $xmlpath), but the .Load()-based approach is much more robust.

Node.js + Powershell: how to encode Unicode characters when using stdin.write

This question is related to this issue: Powershell: how to execute command for a path containing Unicode characters?
I have a Node.js app that spawns a single child process with Powershell 5.1, and then re-uses it to run different commands since it's faster than spawning a separate process.
Problem
The problem is, commands containing Unicode characters are failing silently.
Code
let childProcess = require('child_process')
let testProcess = childProcess.spawn('powershell', [])
testProcess.stdin.setEncoding('utf-8')
testProcess.stdout.on('data', (data) => {
console.log(data.toString())
})
testProcess.stdout.on('error', (error) => {
console.log(error)
})
// This path is working, I get command output in the console:
// testProcess.stdin.write("(Get-Acl 'E:/test.txt').access\n");
// This path is not working. I get nothing in the console
testProcess.stdin.write("(Get-Acl 'E:/test 📚.txt').access\n");
Edit #1
I've tried to encode the paths to UTF-8 on the Node.js side before sending the command to Powershell and then casting it to System.Char:
const path = 'E:/test $([char]0x1f4da).txt'
const command = `Get-Acl $(${path}).access`
testProcess.stdin.write(`${command}\n`)
but I'm not sure how to do it properly. It seems like I'm not encoding it to the correct format. And it's not really a proper solution either, I just encoded the emoji to utf manually. I would probably need to convert the whole path to UTF-16 or something to ensure there's no unsupported characters in it:
"E:/test 📚.txt".split("").reduce((hex,c) => hex += c.charCodeAt(0).toString(16).padStart(4,"0"),"")
Not sure it would even work
Try the following:
let childProcess = require('child_process')
let testProcess = childProcess.spawn(
'chcp 65001 >NUL & powershell.exe -NonInteractive -NoProfile -Command -',
{ shell: true }
)
testProcess.stdout.on('data', (data) => {
console.log(data.toString())
})
testProcess.stdout.on('error', (error) => {
console.log(error)
})
testProcess.stdin.write("Get-Item '📚.txt'\n");
While Node.js itself defaults to UTF-8, console applications spawned from it typically use the system's active OEM code page, such as Code page 437 on US-English systems, which is typically a fixed single-byte limited to 256 characters that lacks support for most Unicode characters.
As an an aside: there's a still-in-beta Windows 10 feature that allows setting both legacy code pages - ANSI and OEM - to UTF-8, but doing so has far-reaching consequences.
powershell.exe, the Windows PowerShell CLI is no exception, so in order to make it interpret its stdin input as UTF-8, the OEM code page must explicitly set to the UTF-8 code page, 65001, before powershell.exe is launched.
Thus, { shell : true } is used to ensure that powershell.exe is launched via cmd.exe, the default shell on Windows, which allows executing chcp 65001 first, which performs the switch to the UTF-8 code page.
Note: This switch to UTF-8 as the OEM code page also affects subsequent calls to console applications in the same process.
Additionally:
-NonInteractive is used to tell PowerShell that no user interactions are expected in the session, which notably prevents loading of the PSReadLine module used for command-line editing, which can cause problems with Unicode characters outside the BMP, i.e. characters with a code point higher than 0xFFFF (such as 📚), which require two [char] instances in .NET.
-NoProfile prevents loading (dot-sourcing) of the PowerShell profile files, given that they're (a) typically only needed in interactive sessions and (b) their loading not only slows things down, but can have side effects.
-Command - tells PowerShell to read commands from stdin; while omitting this parameter somewhat behaves similarly, it is the equivalent of -File -, which exhibits pseudo-interactive behavior.
As an aside: Ultimately, both -Command - and (implied) -File - exhibit unexpected behaviors, as discussed in GitHub issue #3223 and GitHub issue #15331

Linux using command file -i return wrong value charset=unknow-8bit for a windows-1252 encoded file

Using nodejs and iconv-lite to create a http response file in xml with charset windows-1252, the file -i command cannot identify it as windows-1252.
Server side:
r.header('Content-Disposition', 'attachment; filename=teste.xml');
r.header('Content-Type', 'text/xml; charset=iso8859-1');
r.write(ICONVLITE.encode(`<?xml version="1.0" encoding="windows-1252"?><x>€Àáção</x>`, "win1252")); //euro symbol and portuguese accentuated vogals
r.end();
The browser donwloads the file and then i check it in Ubuntu 20.04 LTS:
file -i teste.xml
/tmp/teste.xml: text/xml; charset=unknown-8bit
When i use gedit to open it, the accentuated vogal appear fine but the euro symbol it does not (all characters from 128 to 159 get messed up).
I checked in a windows 10 vm and in there all goes well. Both in Windows and Linux web browsers, it also shows all fine.
So, is it a problem in file command? How to check the right charsert of a file in Linux?
Thank you
EDIT
The result file can be get here
2nd EDIT
I found one error! The code line:
r.header('Content-Type', 'text/xml; charset=iso8859-1');
must be:
r.header('Content-Type', 'text/xml; charset=Windows-1252');
It's important to understand what a character encoding is and isn't.
A text file is actually just a stream of bits; or, since we've mostly agreed that there are 8 bits in a byte, a stream of bytes. A character encoding is a lookup table (and sometimes a more complicated algorithm) for deciding what characters to show to a human for that stream of bytes.
For instance, the character "€" encoded in Windows-1252 is the string of bits 10000000. That same string of bits will mean other things in other encodings - most encodings assign some meaning to all 256 possible bytes.
If a piece of software knows that the file is supposed to be read as Windows-1252, it can look up a mapping for that encoding and show you a "€". This is how browsers are displaying the right thing: you've told them in the Content-Type header to use the Windows-1252 lookup table.
Once you save the file to disk, that "Windows-1252" label form the Content-Type header isn't stored anywhere. So any program looking at that file can see that it contains the string of bits 10000000 but it doesn't know what mapping table to look that up in. Nothing you do in the HTTP headers is going to change that - none of those are going to affect how it's saved on disk.
In this particular case the "file" command could look at the "encoding" marker inside the XML document, and find the "windows-1252" there. My guess is that it simply doesn't have that functionality. So instead it uses its general logic for guessing an encoding: it's probably something ASCII-compatible, because it starts with the bytes that spell <?xml in ASCII; but it's not ASCII itself, because it has bytes outside the range 00000000 to 01111111; anything beyond that is hard to guess, so output "unknown-8bit".

problems with CICS Request Node CCSID

i got this problem:
i got a message flow developed in WMB7 fix 6, for integrated with CICS. My CICS CCSID is 037. The broker is running in a z/Linux with locale = en_US.UTF-8 and locale charmap = UTF-8. The MQSeries is in 1208. I got problems with special characters like (ñ,Ñ, á etc etc)
In my message flow i got this code:
DECLARE CICSRespMsg BLOB;
DECLARE CICSRespChar CHARACTER;
DECLARE MsgOut BLOB;
DECLARE MsgOutChar CHARACTER;
--EBCDIC TO ASCII
SET CICSRespMsg = InputRoot.BLOB.BLOB;
SET CICSRespChar = CAST(CICSRespMsg AS CHARACTER CCSID 037);
SET MsgOut = CAST(CICSRespChar AS BLOB CCSID 850);
SET MsgOutChar = CAST(MsgOut AS CHARACTER CCSID 850);
I tried changing from 850 to 819 and i got the same issue. Hope you can help me. Thanks so much!. ;(
So I'm not allowed to ask for clarification in my "answer", so I'll show you how to debug your problem as I can't provide you with an exact solution with the information provided.
You've shown a snippet of ESQL which is converting from ibm-037 to ibm-850 via Unicode. As ibm-850 doesn't support ñ I would expect the conversion to fail. However ibm-819, a.k.a latin-1, a.k.a iso-8859-1 does support the character and the conversion of ñ should succeed.
I don't know what you're doing after the compute node, so look at your input and output nodes, and look at the CCSID in the Properties folder. You say the MQSeries is in 1208 which I assume you mean the queue managers default CCSID is set to 1208. If this is being used on the output node then you'll have a problem as utf-8 (ibm-1208) is incompatible with latin-1 for these characters.
Place a trace node after your input node and trace to a file with ${Root} as the trace expression, place another trace node before your output node tracing the same to a different file. Look at the bytes:
ñ in 037 is 0x49
ñ in 819 is 0xf1
ñ in 1208 is 0xc3b1
if you see 0x1a it's been replaced with a substitution character.
If you want the output to be UTF-8 ensure that you use 1208 instead of 850/819 above and make sure that OutputRoot.Properties.CodedCharSetId is set to 1208.
If you want the output to be in latin-1, use 819 above and ensure that OutputRoot.Properties.CodedCharSetId is set to 819.
Hope this helps,
Andreas

Charset of Lotus Domino Server

I've a Java agent (running on Linux server) that manage document attachments, but something wrong with accented chars in their names (ò,è,ù ecc..).
I wrote this code to display the charset used:
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
System.out.println("CHARSET: " + enc);
This display
CHARSET: ASCII
In a server where everything works fine, same line print:
CHARSET: UTF8
Servers have same configuration (works with "Internet sites", where "Use UTF-8 for output" is set to "Yes").
Any idea about parameter to set (Domino/Linux)?
UPDATE
I'll try to explain better...
I call an agent through Ajax call.
In parameter, i pass "ààà" string. When i try to decode in UTF-8 inside agent, string is resolved with
"???"
instead of
"ààà"
This is what System.out.println() shows in console.
On another Domino server, everything works. I don't understand if it is a matter of server settings or OS settings.
Just a suggestion, but you could change the first line in your example to be:
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream(),
Charset.forName("UTF-8"));
That will force the OutputStreamWriter to UTF8, and your sample code will show consistent output on both servers. Without knowing more details, I can't say for sure if that's relevant to the real problem.
Although this might not directly answer your question, maybe you might be interessted in this article about encoding.

Resources