Piping Text To An External Program Appends A Trailing Newline - linux

I have been comparing hash values between multiple systems and was surprised to find that PowerShells hash values are different than that of other terminals.
Linux terminals (CygWin, Bash for Windows, etc.) and Windows Command Prompt are all showing the same hash where as PowerShell is showing a different hash value.
This was tested using SHA256 but found the same issue when using other algorithms like md5.
Encoding Update:
Tried changing the PShell encoding but it did not have any effect on the returned hash values.
[Console]::OutputEncoding.BodyName
iso-8859-1
[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
utf-8
GitHub PowerShell Issue
https://github.com/PowerShell/PowerShell/issues/5974

tl;dr:
When PowerShell pipes a string to an external program:
It encodes it using the character encoding stored in the $OutputEncoding preference variable
It invariably appends a trailing (platform-appropriate) newline.
Therefore, the key is to avoid PowerShell's pipeline in favor of the native shell's, so as to prevent implicit addition of a trailing newline:
If you're running your command on a Unix-like platform (using PowerShell Core):
sh -c "printf %s 'string' | openssl dgst -sha256 -hmac authcode"
printf %s is the portable alternative to echo -n. If the string contains ' chars., double them or use `"...`" quoting instead.
In case you need to do this on Windows via cmd.exe, things get even trickier, because cmd.exe doesn't directly support echoing without a trailing newline:
cmd /c "<NUL set /p =`"string`"| openssl dgst -sha256 -hmac authcode"
Note that there must be no space before | for this to work. For an explanation and the limitations of this solution, see this answer.
Encoding issues would only arise if the string contained non-ASCII characters and you're running in Windows PowerShell; in that event, first set $OutputEncoding to the encoding that the target utility expects, typically UTF-8: $OutputEncoding = [Text.Utf8Encoding]::new()
PowerShell, as of Windows PowerShell v5.1 / PowerShell (Core) v7.2, invariably appends a trailing newline when you send a string without one via the pipeline to an external utility, which is the reason for the difference you're observing (that trailing newline will be a LF only on Unix platforms, and a CRLF sequence on Windows).
You can keep track of efforts to address this problem in GitHub issue #5974, opened by the OP.
Additionally, PowerShell's pipeline is invariably text-based when it comes to piping data to external programs; the internally UTF-16LE-based PowerShell (.NET) strings are transcoded based on the encoding stored in the automatic $OutputEncoding variable, which defaults to ASCII-only encoding in Windows PowerShell, and to UTF-8 encoding in PowerShell Core (both on Windows and on Unix-like platforms).
In PowerShell Core, a change is being discussed for piping raw byte streams between external programs.
The fact that echo -n in PowerShell does not produce a string without a trailing newline is therefore incidental to your problem; for the sake of completeness, here's an explanation:
echo is an alias for PowerShell's Write-Output cmdlet, which - in the context of piping to external programs - writes text to the standard input of the program in the next pipeline segment (similar to Bash / cmd.exe's echo).
-n is interpreted as an (unambiguous) abbreviation for Write-Output's -NoEnumerate switch.
-NoEnumerate only applies when writing multiple objects, so it has no effect here.
Therefore, in short: in PowerShell, echo -n "string" is the same as Write-Output -NoEnumerate "string", which - because only a single string is output - is the same as Write-Output "string", which, in turn, is the same as just using "string", relying on PowerShell's implicit output behavior.
Write-Output has no option to suppress a trailing newline, and even if it did, using a pipeline to pipe to an external program would add it back in.

Linux terminals and PowerShell use different encodings. So real bytes produced by echo -n "string" are different. I tried it on my Linux Mint terminal and Windows 10 PowerShell. Here what I got:
Linux Mint:
73 74 72 69 6E 67
Windows 10:
FF FE 73 00 74 00 72 00 69 00 6E 00 67 00 0D 00 0A 00
It seems that Linux terminals use UTF-8 and Windows PowerShell uses UTF-16 with a BOM. Also in PowerShell you cannot use '-n' parameter for echo. So echo places newline characters \r\n (0D 00 0A 00) at the end of the "string".
Edit: As mklement0 said below, Windows PowerShell uses ASCII by default when piping.

Related

display accented characters on mintty (cygwin) console?

in a nutshell, i would like to be able to type and display characters from iso-8859-1 on my cygwin mintty. unfortunately i haven't figured out how to do this.
my locale :
$ locale
LANG=C.ISO-8859-1
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C
mintty is configured as an xterm (although it seems to make no difference what terminal emulation i choose), and through options => text, i have configured the 'locale' section as C and the character set as ISO-8859-1.
when i type any accented character from my keyboard, the character does not display on the terminal. however, if i invoke cat, the characters i type display correctly. also, when i edit using vi (well, vim, actually), i am able to type (and display) accented characters without problems. so the problem seems to have something to do with the shell and not with the terminal emulation itself.
furthermore, if i write a little script to make a file named, for example, être.utx, the file displays as ???tre.utx when i ls it. looking at its hex, i get
$ ls *.utx | od -c -tx1
0000000 357 203 252 t r e . u t x \n
ef 83 aa 74 72 65 2e 75 74 78 0a
0000013
so it seems the script i wrote is creating a file whose name begins with the trigramme 0xEF 0x83 0xAA, rather than the single-byte character whose encoding should be 0xEA. i don't know how to interpret this ; i know it isn't utf-8, which would be 0xC3 0xAA.
it appears there is only one character set in my cygwin configuration that is configured to support 8859-1 : norwegian. [of course, i suppose i could learn norwegian, but i would prefer something a bit less strenuous, if possible...]
in any case, does anyone have an idea what i am doing wrong ?
many thanks in advance.
Just set mintty's locale to something utf8-ish.
In my case:
Window Menu (Alt+Space)
Options… (o)
Text (l.h. panel)
Locale → en_GB
Character Set → UTF-8
[Save]
Quit and restart
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_ NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
$ echo $'\u2154'
⅔
Nice

Git messes up with non-ascii characters on Linux container

I have a .Net Core (C#) project with the following line in one of the classes:
var input = "£";
But when I do a git clone in a Docker container (microsoft/dotnet:2.2-sdk) it messes it up and displays it as � (in bash using cat).
And when I run it, its Utf-8 bytes are [239, 191, 189] = [EF, BF, BD] which seem to be a so-called Unicode replacement character.
Windows editor that I use is VS 2017, but character is displayed properly on other windows machines and parsed properly by dotnet run/test command, so I don't think this is a problem of failing to save the character incorrectly.
Any ideas why I am seeing such a mess and how to solve it?
Some details
I get bytes using Encoding.UTF8.GetBytes("£");
It works perfectly well on Windows 10 machine
Linux version Debian GNU/Linux 9 (stretch) from the cat /etc/os-release
locale -a returns C C.UTF-8 POSIX
On Windows Notepad++, when opened, is claims to be ANSI and is displayed correctly.
Running fgrep 'var input' file.cs | od -tx1 -c
0000100 76 61 72 20 69 6e 70 75 74 20 3d 20 22 a3 22 3b
v a r i n p u t = " 243 " ;
Your file contains a single byte a3 which corresponds to the Windows-1252 encoding for the character £. Your Linux system displays � because it is not a valid UTF-8 encoding.
You should configure Visual Studio to use UTF-8 instead of Windows-1252.

Why is \n appended to filename in shell?

Why does this shell script add a return to the filename of the output file?
#!/bin/bash
/usr/bin/tail -n 1 /path/logchanged.csv >> "/path/logcontatenated.csv"
The filename is not called "logcontatenated.csv", but "logcontatenated.csv
"
I really can't find on the internet why this happens.
Could it be that you created that script using Windows? If the line ends in \r\n without trailing spaces the file name is interpreted as logcontatenated.csv\r. Try hd yourscript.sh to display a hexdump of your script. Line breaks should be only a single byte of 0a rather than two bytes of 0d 0a, i.e. make sure the byte before any 0a is NOT 0d. You could use dos2unix yourscript.sh to fix your script. You might need to install dos2unix first.
EDIT: Replaced 0c with 0d.

Why there are written randomly null characters in some of my output files?

I have some scripts in my RedHat server which contains Microfocus COBOL programs which generates a huge file of aprox 3GB in a sort of time of 3 hours on average. The programs write their output files directly in the directory /my_test/files/.
The problem is that sometimes (randomly) some files generated contains null character sections in the middle of the file. And when I check them up, if I reexecute the script again (with the same input parameters), the output file is perfectly generated (it doesn't contain any nullchars). I've checked it a lot of times and I'm pretty sure is not the fault of the COBOL programs (they use quite simple operations). The space in use of that folder is 40%.
Some programs updates the database, and if they finish with return code 0, then the changes are commited, and I don't have any backup, so this is the point of what I'm doing.
This is an example of a file declaration of one of the problematic COBOL programs:
FILE-CONTROL.
SELECT MYFILE
ASSIGN TO MYFILE
ORGANIZATION IS SEQUENTIAL
ACCESS MODE IS SEQUENTIAL
FILE STATUS IS FILE-STATUS.
DATA DIVISION.
FILE SECTION.
FD MYFILE
LABEL RECORD STANDARD
RECORDING MODE F.
01 REG-OUTPUT PIC X(400).
I've also checked for the nulls in the COBOL programs before the NULL files, but unfortunately there are no nulls spotted.
Then I thought about creating a crontab which executes the following script each 5 seconds:
if [[ -f /tmp/sorry_im_working ]]; then
exit
fi
trap 'rm -rf /tmp/sorry_im_working' EXIT
touch /tmp/sorry_im_working
lsof | awk 'BEGIN{
sfiles="";
} {
if($1=="PROGRAM" && $9~/my_test\/files/){
sfiles=sfiles" "$9
}
}END{
comm="find "sfiles" -newermt \x27-2 seconds\x27 -exec env LC_ALL=C bash -c \x27grep -Pq \x22\x5Cx00{200}\x22 <(tail -c 1000 {}) && echo {}\x27 \x5C\x3B";
while(comm | getline sout){
print sout;
};
close(comm);
}' >> /home/ouhma/nullfiles.txt
Therefore, I would like to ask you the following questions:
Any idea of what's going on here?
Do you have any other way to trigger the lastest modified files?
What other information of interest could I add to my log?
If you construct a file d with only \x00 :
hexdump -C d
00000000 5c 78 30 30 0a |\x00.|
00000005
and you :
grep -Faq '\x00' d;echo $?
0
But they're no null caracter inside d.
Maybe, is better to use grep -Paq '\x00'
Depending on the configuration and record structure that is used for the file MF will pad different characters with hex null.
Please copy the 'ASSIGN' clause and the 'FD' clause of the COBOL program.
BTW: if your COBOL programs run three ours to do some calculations and write three GB of data back you should investigate the storage and / or get a COBOL programmer to check the programs, sounds much to slow.
I suspect you are have non-printable characters in your file, the null inserts can be controlled, take a look # INSERTNULL file configuration.

Why is ^M being added to a script.r after modifying with Meld?

Esteemed Meld and Emacs/ESS users,
What I did:
Create a script.r using Emacs/ESS.
Make some modifications to script.r by pulling some lines of code from another_script.r
Reopen another_script.r (or script.r) in Emacs/ESS to continue working.
All the lines in another_script.r which were not pushed to script.r end with ^M
Some times it's the other way around - only the line that was pushed/pulled ends with ^M's. So far i haven't isolated exactly which action determines where the ^M's are placed. Either way i still end up with ^M's all over the place and i'd like to avoid getting them after using Meld!
FWIW: the directory is being synced by Dropbox; in Meld, Preferences > Encoding tab, "utf8" is entered in the text box; all actions are performed under Linux (Ubunt 12.04) with Meld v1.5.3, Emacs v23.3.1
Current workaround is running in a terminal: dos2unix /path/to/script.r which strips the ^Ms. But this shouldn't be necessary and I'm hoping some one here can tell me how to avoid these.
Cheers.
In a terminal i ran cat script.r | hexdump -C | head and amongst the output returned found a 0d 0a, which is DOS formatting for a new line (carriage return 0d immediately followed by a line feed 0a). I ran the same command on another_script.r i was merging with but only observed 0a, no 0d 0a, indicating Unix formatting.
To check further if this was the source of the ^M line endings, script.r was converted to unix formatting via dos2unix script.r & verified that 0d 0a was converted to 0a using hexdump -C as above. I performed a merge using Meld in attempting to replicate the process which yielded ^M line endings in my script's. I re-oppened both files in Emacs/ESS and found no ^M line endings. Short of converting script.r back to dos formatting and repeating the above procedure to see if the ^M line endings re-appear, i believe i've solved my ^M issue, which simply is that, unbeknownst to me, one of my files was dos formatted. My take home message: in a Windows dominated environ, never assume that one's personal linux environment doesn't contain DOS bits. Or line endings.

Resources