Node.js + Powershell: how to encode Unicode characters when using stdin.write - node.js

This question is related to this issue: Powershell: how to execute command for a path containing Unicode characters?
I have a Node.js app that spawns a single child process with Powershell 5.1, and then re-uses it to run different commands since it's faster than spawning a separate process.
Problem
The problem is, commands containing Unicode characters are failing silently.
Code
let childProcess = require('child_process')
let testProcess = childProcess.spawn('powershell', [])
testProcess.stdin.setEncoding('utf-8')
testProcess.stdout.on('data', (data) => {
console.log(data.toString())
})
testProcess.stdout.on('error', (error) => {
console.log(error)
})
// This path is working, I get command output in the console:
// testProcess.stdin.write("(Get-Acl 'E:/test.txt').access\n");
// This path is not working. I get nothing in the console
testProcess.stdin.write("(Get-Acl 'E:/test 📚.txt').access\n");
Edit #1
I've tried to encode the paths to UTF-8 on the Node.js side before sending the command to Powershell and then casting it to System.Char:
const path = 'E:/test $([char]0x1f4da).txt'
const command = `Get-Acl $(${path}).access`
testProcess.stdin.write(`${command}\n`)
but I'm not sure how to do it properly. It seems like I'm not encoding it to the correct format. And it's not really a proper solution either, I just encoded the emoji to utf manually. I would probably need to convert the whole path to UTF-16 or something to ensure there's no unsupported characters in it:
"E:/test 📚.txt".split("").reduce((hex,c) => hex += c.charCodeAt(0).toString(16).padStart(4,"0"),"")
Not sure it would even work

Try the following:
let childProcess = require('child_process')
let testProcess = childProcess.spawn(
'chcp 65001 >NUL & powershell.exe -NonInteractive -NoProfile -Command -',
{ shell: true }
)
testProcess.stdout.on('data', (data) => {
console.log(data.toString())
})
testProcess.stdout.on('error', (error) => {
console.log(error)
})
testProcess.stdin.write("Get-Item '📚.txt'\n");
While Node.js itself defaults to UTF-8, console applications spawned from it typically use the system's active OEM code page, such as Code page 437 on US-English systems, which is typically a fixed single-byte limited to 256 characters that lacks support for most Unicode characters.
As an an aside: there's a still-in-beta Windows 10 feature that allows setting both legacy code pages - ANSI and OEM - to UTF-8, but doing so has far-reaching consequences.
powershell.exe, the Windows PowerShell CLI is no exception, so in order to make it interpret its stdin input as UTF-8, the OEM code page must explicitly set to the UTF-8 code page, 65001, before powershell.exe is launched.
Thus, { shell : true } is used to ensure that powershell.exe is launched via cmd.exe, the default shell on Windows, which allows executing chcp 65001 first, which performs the switch to the UTF-8 code page.
Note: This switch to UTF-8 as the OEM code page also affects subsequent calls to console applications in the same process.
Additionally:
-NonInteractive is used to tell PowerShell that no user interactions are expected in the session, which notably prevents loading of the PSReadLine module used for command-line editing, which can cause problems with Unicode characters outside the BMP, i.e. characters with a code point higher than 0xFFFF (such as 📚), which require two [char] instances in .NET.
-NoProfile prevents loading (dot-sourcing) of the PowerShell profile files, given that they're (a) typically only needed in interactive sessions and (b) their loading not only slows things down, but can have side effects.
-Command - tells PowerShell to read commands from stdin; while omitting this parameter somewhat behaves similarly, it is the equivalent of -File -, which exhibits pseudo-interactive behavior.
As an aside: Ultimately, both -Command - and (implied) -File - exhibit unexpected behaviors, as discussed in GitHub issue #3223 and GitHub issue #15331

Related

Trim powershell output after a specific line

When running the sfc /scannow command from a powershell script, how do you trim the output to only keep all the data after "Verification 100% complete."?
Simple code example:
$Result = Invoke-Command { sfc /scannow } | Out-String
A truncated version of this output is:
Verification 97% complete.
Verification 98% complete.
Verification 98% complete.
Verification 99% complete.
Verification 99% complete.
Verification 100% complete.
Windows Resource Protection found corrupt files and successfully repaired them.
For online repairs, details are included in the CBS log file located at
windir\Logs\CBS\CBS.log. For example C:\Windows\Logs\CBS\CBS.log. For offline
repairs, details are included in the log file provided by the /OFFLOGFILE flag.
I also wouldn't mind removing the extraneous line breaks so the final output when returning $Result would simply be:
Windows Resource Protection found corrupt files and successfully repaired them. For online repairs, details are included in the CBS log file located at windir\Logs\CBS\CBS.log. For example C:\Windows\Logs\CBS\CBS.log. For offline repairs, details are included in the log file provided by the /OFFLOGFILE flag.
I have a 1024 character limit for the output, and right now I'm just truncating it with the following:
if ($Result.length -gt 1024) {
$Result = $Result.Substring($Result.Length-1024)
}
But I'd like to get rid of all the "status" type messaging from the output and just retain the final result of the scan.
Thank you!
First and foremost this is an encoding issue. The sfc command outputs UTF-16, but PowerShell doesn't know that (because it's a native command), so you typically get embedded null characters between the actual characters (try sfc /? | Out-String | Format-Hex to see what I mean).
Removing the embedded null characters would only partially fix the symptoms. On non-english systems the encoding of non-ASCII characters will still be wrong.
To fix the cause, you have to tell PowerShell the output encoding used by the command (see this answer for background, albeit it focuses on UTF-8):
# Save the current encoding settings and temporarily switch to UTF-16 (aka UnicodeEncoding).
$oldOutputEncoding = $OutputEncoding; $oldConsoleEncoding = [Console]::OutputEncoding
$OutputEncoding = [console]::OutputEncoding = New-Object System.Text.UnicodeEncoding
# No need for Invoke-Command. Just call it directly.
$result = sfc /scannow | Out-String
# Restore the previous settings.
$OutputEncoding = $oldOutputEncoding; [Console]::OutputEncoding = $oldConsoleEncoding
# Encoding fixed, but duplicate newlines remain. Replace them by single newlines:
$result = $result -replace '\r?\n\r?\n', [Environment]::NewLine
#Finally strip off everything up to and including the `100%` line:
$result = ($result -replace '(?s)^.*100\s*%.*?\r?\n').Trim()
See this RegEx101 demo for an explanation of the RegEx used by the last -replace command. Note that (?s) enables single-line mode so dot . matches newlines.

Getting extra newlines with fmt in Windows

I started using fmt for printing recently. I really like the lib, fast, easy to use. But when I completed my conversion, there are ways that my program can run that will render with a bunch of additional newlines. It's not every case, so this will get a bit deep.
What I have is a compiler and a build manager. The build manager (picture Ninja, although this is a custom tool) launches compile processes, buffers the output, and prints it all at once. Both programs have been converted to use fmt. The key function being called is fmt::vprint(stream, format, args). When the build manager prints directly, things are fine. But when I'm reading the child process output, any \n in the data has been prefixed with \r. Windows Terminal will render that fine, but some shells (such as the Visual Studio output window) do not, and will show a bunch of extra newlines.
fmt is open source so I was able to hack on it a bunch and see what is different between what it did and what my program was doing originally. The crux is this:
namespace detail {
FMT_FUNC void print(std::FILE* f, string_view text) {
#ifdef _WIN32
auto fd = _fileno(f);
if (_isatty(fd)) {
detail::utf8_to_utf16 u16(string_view(text.data(), text.size()));
auto written = detail::dword();
if (detail::WriteConsoleW(reinterpret_cast<void*>(_get_osfhandle(fd)),
u16.c_str(), static_cast<uint32_t>(u16.size()),
&written, nullptr)) {
return;
}
// Fallback to fwrite on failure. It can happen if the output has been
// redirected to NUL.
}
#endif
detail::fwrite_fully(text.data(), 1, text.size(), f);
}
} // namespace detail
As a child process, the _isatty() function will come back with false, so we fall back to the fwrite() function, and that triggers the \r escaping. In my original program, I have an fwrite() fallback as well, but it only picks up if GetStdHandle(STD_OUTPUT_HANDLE) returns nullptr. In the child process case, there is still a console we can WriteFile() to.
The other side-effect I see happening is if I use the fmt way of injecting color, eg:
fmt::print(fmt::emphasis::bold | fg(fmt::color::red), "Elapsed time: {0:.2f} seconds", 1.23);
Again Windows Terminal renders it correctly, but in Visual Studio's output window this turns into a soup of garbage. The native way of doing it -- SetConsoleTextAttribute(console, FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_INTENSITY);-- does not trigger that problem.
I tried hacking up the fmt source to be more like my original console printing code. The key difference was the _isatty() function. I suspect that's too broad of a question for the cases where console printing might fail.
\r is added because the file is opened in text mode. You could try (re)opening in binary mode or ignore \r on the read side.

Convert a string in PowerShell (in Europe) to UTF-8

For a REST call I need the German "Stück" in UTF-8 as read from an access database with
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$filename;Persist Security Info=False;")
and try to convert it.
I have found out that PowerShell ISE seems to encode string constants in ANSI.
So I tried as a minimum test without database and got the same result:
$Text1 = "Stück" # entered via ISE, this is also what I get from the database
# ($StringFromDatabase -eq $Test1) shows $true
$enc = [System.Text.Encoding]::GetEncoding(1252).GetBytes($Text1)
# also tried [System.Text.Encoding]::GetEncoding("ISO-8859-1") # = 28591
$Text1 = [System.Text.Encoding]::UTF8.GetString($enc)
$Text1
$Text1 = "Stück" # = UTF-8, entered here with Notepad++, encoding set to UTF-8
"must see: $Text1"
So I get two outputs - the converted one (showing "St?ck") but I need to see "Stück".
that PowerShell ISE seems to encode string constants in ANSI.
That only applies when communicating with external programs, whereas you're using in-process .NET APIs.
As an aside: this discrepancy with regular console windows, which use the active OEM code page is one of the reasons that make the obsolescent ISE problematic - see the bottom section of this answer for more information.
String literals in memory are always .NET strings, which are UTF-16-encoded (composed of 16-bit Unicode code units), capable of representing all Unicode characters.[1]
Character encoding in web-service calls (Invoke-RestMethod, Invoke-WebRequest):
To send UTF-8 strings, specify charset=utf-8 as part of the -ContentType argument; e.g.:
Invoke-RestMethod -ContentType 'text/plain; charset=utf-8' ...
On receiving strings, PowerShell automatically decodes them either based on an explicitly specified charset field (character encoding) in the response's content header or, in its absence using ISO-8859-1 (which is closely related to, but in effect a subset of Windows-1252).
If a given response doesn't specify a charset but in actually uses a different encoding from ISO-8859-1 - say UTF-8 - PowerShell will misinterpret the strings received, which requires re-encoding after the fact - see this answer.
Character encoding when communicating with external programs:
If you need to send a string with a particular encoding to an external program (via the pipeline, which the target program receives via stdin), set the $OutputEncoding preference variable to that encoding, and PowerShell will automatically convert your .NET strings to the specified encoding.
To send UTF-8-encoded strings to external programs via the pipeline:
$OutputEncoding = [System.Text.UTF8Encoding]::new()
Note, however, that this alone isn't sufficient in order to correctly receive UTF-8 output from external programs; for that, you need to set [Console]::OutputEncoding to the same encoding.
To make your PowerShell session fully UTF-8-aware (irrespective of whether in the ISE or a regular console window):
# Needed in the ISE only:
chcp >$null # Dummy console-program call that ensures that a console is allocated.
# Set all encodings relevant to communicating with external programs to UTF-8.
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =
[System.Text.UTF8Encoding]::new()
See this answer for more information.
[1] Note, however, that Unicode characters with a code point greater than 0xFFFF, i.e. those outside the so-called BMP (Basic Multilingual Plane), must be represented with two 16-bit code units ([char]), namely so-called surrogate pairs.

I want to run a script from another script, use the same version of perl, and reroute IO to a terminal-like textbox

I am somewhat familiar with various ways of calling a script from another one. I don't really need an overview of each, but I do have a few questions. Before that, though, I should tell you what my goal is.
I am working on a perl/tk program that: a) gathers information and puts it in a hash, and b) fires off other scripts that use the info hash, and some command line args. Each of these other scripts are available on the command line (using another command-line script) and need to stay that way. So I can't just put all that into a module and call it good.I do have the authority to alter the scripts, but, again, they must also be usable on the command line.
The current way of calling the other script is by using 'do', which means I can pass in the hash, and use the same version of perl (I think). But all the STDOUT (and STDERR too, I think) goes to the terminal.
Here's a simple example to demonstrate the output:
this_thing.pl
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use Tk;
my $mw = MainWindow->new;
my $button = $mw->Button(
-text => 'start other thing',
-command => \&start,
)->pack;
my $text = $mw->Text()->pack;
MainLoop;
sub start {
my $script_path = 'this_other_thing.pl';
if (not my $read = do $script_path) {
warn "couldn't parse $script_path: $#" if $#;
warn "couldn't do $script_path: $!" unless defined $read;
warn "couldn't run $script_path" unless $read;
}
}
this_other_thing.pl
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
print "Hello World!\n";
How can I redirect the STDOUT and STDIN (for interactive scripts that need input) to the text box using the 'do' method? Is that even possible?
If I can't use the 'do' method, what method can redirect the STDIN and STDOUT, as well as enable passing the hash in and using the same version of perl?
Edit: I posted this same question at Perlmonks, at the link in the first comment. So far, the best response seems to use modules and have the child script just be a wrapper for the module. Other possible solutions are: ICP::Run(3) and ICP in general, Capture::Tiny and associated modules, and Tk::Filehandle. A solution was presented that redirects the output and error streams, but seems to not affect the input stream. It's also a bit kludgy and not recommended.
Edit 2: I'm posting this here because I can't answer my own question yet.
Thanks for your suggestions and advice. I went with a suggestion on Perlmonks. The suggestion was to turn the child scripts into modules, and use wrapper scripts around them for normal use. I would then simply be able to use the modules, and all the code is in one spot. This also ensures that I am not using different perls, I can route the output from the module anywhere I want, and passing that hash in is now very easy.
To have both STDIN & STDOUT of a subprocess redirected, you should read the "Bidirectional Communication with Another Process" section of the perlipc man page: http://search.cpan.org/~rjbs/perl-5.18.1/pod/perlipc.pod#Bidirectional_Communication_with_Another_Process
Using the same version of perl works by finding out the name of your perl interpreter, and calling it explicitly. $^X is probably what you want. It may or may not work on different operating systems.
Passing a hash into a subprocess does not work easily. You can print the contents of the hash into a file, and have the subprocess read & parse it. You might get away without using a file, by using the STDIN channel between the two processes, or you could open a separate pipe() for this purpose. Anyway, printing & parsing the data back cannot be avoided when using subprocesses, because the two processes use two perl interpreters, each having its own memory space, and not being able to see each other's variables.
You might avoid using a subprocess, by using fork() + eval() + require(). In that case, no separate perl interpreter will be involved, the forked interpreter will inherit the whole memory of your program with all variables, open file descriptors, sockets, etc. in it, including the hash to be passed. However, I don't see from where your second perl script could get its hash when started from CLI.

Issue with filepath name, possible corrupt characters

Perl and html, CGI on Linux.
Issue with file path name, being passed in a form field, to a CGI on server.
The issue is with the Linux file path, not the PC side.
I am using 2 programs,
1) program written years ago, dynamic html generated in a perl program, and presented to the user as a form. I modified by inserting the needed code to allow a the user to select a file from their PC, to be placed on the Linux machine.
Because this program already knew the filepath, needed on the linux side, I pass this filepath in a hidden form field, to program 2.
2) CGI program on Linux side, to run when form on (1) is posted.
Strange issue.
The filepath that I pass, has a very strange issue.
I can extract it using
my $filepath = $query->param("serverfpath");
The above does populate $filepath with what looks like exactly the correct path.
But it fails, and not in a way that takes me to the file open error block, but such that the call to the CGI script gives an error.
However, if I populate $filepath with EXACTLY the same string, via hard coding it, it works, and my file successfully uploads.
For example:
$fpath1 = $query->param("serverfpath");
$fpath2 = "/opt/webhost/ims/DOCURVC/data"
A comparison of $fpath1 and $fpath2 reveals that they are exactly equal.
A length check of $fpath1 and $fpath2 reveals that they are exactly the same length.
I have tried many methods of cleaning the data in $fpath1.
I chomp it.
I remove any non standard characters.
$fpath1 =~ s/[^A-Za-z0-9\-\.\/]//g;
and this:
my $safe_filepath_characters = "a-zA-Z0-9_.-/";
$fpath1 =~ s/[^$safe_filepath_characters]//g;
But no matter what I do, using $fpath1 causes an error, using $fpath2 works.
What could be wrong with the data in the $fpath1, that would cause it to successfully compare to $fpath2, yet not be equal, visually look exactly equal, show as having the exact same length, but not work the same?
For the below file open block.
$upload_dir = $fpath1
causes complete failure of CGI to load, as if it can not find the CGI (which I know is sometimes caused by syntax error in the CGI script).
$uplaod_dir = $fpath2
I get a successful file upload
$uplaod_dir = ""
The call to the cgi does not fail, it executes the else block of the below if, as expected.
here is the file open block:
if (open ( UPLOADFILE, ">$upload_dir/$filename" ))
{
binmode UPLOADFILE;
while ( <$upload_filehandle> )
{
print UPLOADFILE;
}
close UPLOADFILE;
$msgstr="Done with Upload: upload_dir=$upload_dir filename=$filename";
}
else
{
$msgstr="ERROR opening for upload: upload_dir=$upload_dir filename=$filename";
}
What other tests should I be performing on $fpath1, to find out why it does not work the same as its hard-coded equivalent $fpath2
I did try character replacement, a single character at a time, from $fpath2 to $fpath1.
Even doing this with a single character, caused $fpath1 to have the same error as $fpath2, although the character looked exactly the same.
Is your CGI perhaps running perl with the -T (taint mode) switch (e.g., #!/usr/bin/perl -T)? If so, any value coming from untrusted sources (such as user input, URIs, and form fields) is not allowed to be used in system operations, such as open, until it has been untainted by using a regex capture. Note that using s/// to modify it in-place will not untaint the value.
$fpath1 =~ /^([A-Za-z0-9\-\.\/]*)$/;
$fpath1 = $1;
die "Illegal character in fpath1" unless defined $fpath1;
should work if taint mode is your issue.
But it fails, and not in a way that takes me to the file open error block, but such that the call to the CGI script gives an error.
Premature end of script headers? Try running the CGI from the command line:
perl your_upload_script.cgi serverfpath=/opt/webhost/ims/DOCURVC/data

Resources