Git messes up with non-ascii characters on Linux container

Git messes up with non-ascii characters on Linux container - linux

I have a .Net Core (C#) project with the following line in one of the classes:
var input = "£";
But when I do a git clone in a Docker container (microsoft/dotnet:2.2-sdk) it messes it up and displays it as � (in bash using cat).
And when I run it, its Utf-8 bytes are [239, 191, 189] = [EF, BF, BD] which seem to be a so-called Unicode replacement character.
Windows editor that I use is VS 2017, but character is displayed properly on other windows machines and parsed properly by dotnet run/test command, so I don't think this is a problem of failing to save the character incorrectly.
Any ideas why I am seeing such a mess and how to solve it?
Some details
I get bytes using Encoding.UTF8.GetBytes("£");
It works perfectly well on Windows 10 machine
Linux version Debian GNU/Linux 9 (stretch) from the cat /etc/os-release
locale -a returns C C.UTF-8 POSIX
On Windows Notepad++, when opened, is claims to be ANSI and is displayed correctly.
Running fgrep 'var input' file.cs | od -tx1 -c
0000100 76 61 72 20 69 6e 70 75 74 20 3d 20 22 a3 22 3b
v a r i n p u t = " 243 " ;

Your file contains a single byte a3 which corresponds to the Windows-1252 encoding for the character £. Your Linux system displays � because it is not a valid UTF-8 encoding.
You should configure Visual Studio to use UTF-8 instead of Windows-1252.

Related

display accented characters on mintty (cygwin) console?

in a nutshell, i would like to be able to type and display characters from iso-8859-1 on my cygwin mintty. unfortunately i haven't figured out how to do this.
my locale :
$ locale
LANG=C.ISO-8859-1
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C
mintty is configured as an xterm (although it seems to make no difference what terminal emulation i choose), and through options => text, i have configured the 'locale' section as C and the character set as ISO-8859-1.
when i type any accented character from my keyboard, the character does not display on the terminal. however, if i invoke cat, the characters i type display correctly. also, when i edit using vi (well, vim, actually), i am able to type (and display) accented characters without problems. so the problem seems to have something to do with the shell and not with the terminal emulation itself.
furthermore, if i write a little script to make a file named, for example, être.utx, the file displays as ???tre.utx when i ls it. looking at its hex, i get
$ ls *.utx | od -c -tx1
0000000 357 203 252 t r e . u t x \n
ef 83 aa 74 72 65 2e 75 74 78 0a
0000013
so it seems the script i wrote is creating a file whose name begins with the trigramme 0xEF 0x83 0xAA, rather than the single-byte character whose encoding should be 0xEA. i don't know how to interpret this ; i know it isn't utf-8, which would be 0xC3 0xAA.
it appears there is only one character set in my cygwin configuration that is configured to support 8859-1 : norwegian. [of course, i suppose i could learn norwegian, but i would prefer something a bit less strenuous, if possible...]
in any case, does anyone have an idea what i am doing wrong ?
many thanks in advance.

Just set mintty's locale to something utf8-ish.
In my case:
Window Menu (Alt+Space)
Options… (o)
Text (l.h. panel)
Locale → en_GB
Character Set → UTF-8
[Save]
Quit and restart
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_ NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
$ echo $'\u2154'
⅔
Nice

How to fix changed source lines showing ^M with "git diff" but look fine with vim/gedit/cat -e?

I have searched high and low for anyone asking a similar question. It does not seem to be a simple case of :set fileformat=dos or :set fileformat=unix.
Writing the file out with :set fileencoding=latin1 and :set fileformat=dos changed such that git diff reports all the lines to have ^M appended.
The code was originally happily existing as:
...
if (v == value32S)
{
...
I made the outrageously radical improvement to (which looks fine on the screen in vim):
...
if (v == value32S ||
v == value33)
{
...
But git diff to check for erroneous changes shows:
diff --git a/csettings.cpp b/csettings.cpp
index 1234..8901 100755
--- a/csettings.cpp
+++ b/csettings.cpp
## -2466,7 +2466,8 ## bool MyClass::settingIsValid(QString s)
#if CONFIG_1 || CONFIG_2
- if (v == value32S)
+ if (v == value32S ||^M
+ v == value33)^M
{
doSomething(new_v);
where the bold italic text is reverse video.
I have tried several means to make the apparently spurious carriage returns go away. First was to be sure there wasn't a hidden character. View with vim :set list:
...
if (v == value32S ||$
v == value33)$
{$
...
Seems fine. Dumping the file (microdetails vary to protect NDA, and I am too lazy to make it a perfect deception):
$ hd csettings.cpp
(...)
0000eae0 xx xx xx xx xx xx xx xx xx 65 33 32 53 20 7c 7c |(v == value32S |||
0000eaf0 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 76 20 |.. v |
0000eb00 3d 3a 20 xx xx xx xx xx xx 65 33 33 29 0d 0a 20 |== ...value33).. |
All of the other lines also end in "0d 0a", so this looks fine. An interesting suggestion was to use cat -e (which was new to me):
$ cat -e c.cpp
...
if (v == value32S ||^M$
v == value33)^M$
{^M$
...
Another suggestion was to use file for clues:
$ file csettings.cpp
csettings.cpp: C source, UTF-8 Unicode text, with CRLF line terminators
Interestingly, this is the only file in this directory (of header files and cpp code) which isn't ASCII text. Some files have CRLF line terminators and some do not. Also, some show C++ source and others are C source which I assume isn't significant.
Deleting the file and git checkout to get a fresh copy also shows it as UTF-8, which I traced to having the degree symbol in some strings ("°F" and "°C") so UTF-8 doesn't seem to be an issue.
Still, I don't see why using vim to edit only these lines would cause this problem. Or maybe it isn't a problem? Any ideas?
----- Addendum -----
git config --get-regexp core.* shows
core.repositoryformatversion 0
core.filemode true
core.bare false
core.logallrefupdates true

By default, Git assumes that you're using Unix line endings in the repository and highlights carriage returns as trailing whitespace. However, by default, it highlights trailing whitespace only on new lines, since the goal is to avoid introducing new problems.
If you run git diff --ws-error-highlight=all, you'll see that there are also carriage returns on the lines being removed and on the context lines. If you don't want to see this, you can set core.whitespace to cr-at-eol, which will prevent it from being highlighted. There are no ill effects to this; it simply prevents carriage returns from being treated as trailing whitespace.
If you're planning on using this project on non-Windows systems, you should convert the line endings to Unix and use a .gitattributes file to specify the text attribute for text files so the line ending is automatically converted based on the operating system in use. This may be valuable even if your project is only used on Windows, since if someone has core.autocrlf set, you may end up with mixed line endings.

Assuming you are using unix based Operating System.
Normally using vi or cat command, ^M characters are not visible.
You can see using cat -v command.
Eg.
cat -v < file_name >
To get rid of these characters use dos2unix command.
Eg.
dos2unix < file_name >
This will remove those ^M characters and save the result in same file itself. So you don't have to create any temp file for storing intermediate file content.

Piping Text To An External Program Appends A Trailing Newline

I have been comparing hash values between multiple systems and was surprised to find that PowerShells hash values are different than that of other terminals.
Linux terminals (CygWin, Bash for Windows, etc.) and Windows Command Prompt are all showing the same hash where as PowerShell is showing a different hash value.
This was tested using SHA256 but found the same issue when using other algorithms like md5.
Encoding Update:
Tried changing the PShell encoding but it did not have any effect on the returned hash values.
[Console]::OutputEncoding.BodyName
iso-8859-1
[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
utf-8
GitHub PowerShell Issue
https://github.com/PowerShell/PowerShell/issues/5974

tl;dr:
When PowerShell pipes a string to an external program:
It encodes it using the character encoding stored in the $OutputEncoding preference variable
It invariably appends a trailing (platform-appropriate) newline.
Therefore, the key is to avoid PowerShell's pipeline in favor of the native shell's, so as to prevent implicit addition of a trailing newline:
If you're running your command on a Unix-like platform (using PowerShell Core):
sh -c "printf %s 'string' | openssl dgst -sha256 -hmac authcode"
printf %s is the portable alternative to echo -n. If the string contains ' chars., double them or use `"...`" quoting instead.
In case you need to do this on Windows via cmd.exe, things get even trickier, because cmd.exe doesn't directly support echoing without a trailing newline:
cmd /c "<NUL set /p =`"string`"| openssl dgst -sha256 -hmac authcode"
Note that there must be no space before | for this to work. For an explanation and the limitations of this solution, see this answer.
Encoding issues would only arise if the string contained non-ASCII characters and you're running in Windows PowerShell; in that event, first set $OutputEncoding to the encoding that the target utility expects, typically UTF-8: $OutputEncoding = [Text.Utf8Encoding]::new()
PowerShell, as of Windows PowerShell v5.1 / PowerShell (Core) v7.2, invariably appends a trailing newline when you send a string without one via the pipeline to an external utility, which is the reason for the difference you're observing (that trailing newline will be a LF only on Unix platforms, and a CRLF sequence on Windows).
You can keep track of efforts to address this problem in GitHub issue #5974, opened by the OP.
Additionally, PowerShell's pipeline is invariably text-based when it comes to piping data to external programs; the internally UTF-16LE-based PowerShell (.NET) strings are transcoded based on the encoding stored in the automatic $OutputEncoding variable, which defaults to ASCII-only encoding in Windows PowerShell, and to UTF-8 encoding in PowerShell Core (both on Windows and on Unix-like platforms).
In PowerShell Core, a change is being discussed for piping raw byte streams between external programs.
The fact that echo -n in PowerShell does not produce a string without a trailing newline is therefore incidental to your problem; for the sake of completeness, here's an explanation:
echo is an alias for PowerShell's Write-Output cmdlet, which - in the context of piping to external programs - writes text to the standard input of the program in the next pipeline segment (similar to Bash / cmd.exe's echo).
-n is interpreted as an (unambiguous) abbreviation for Write-Output's -NoEnumerate switch.
-NoEnumerate only applies when writing multiple objects, so it has no effect here.
Therefore, in short: in PowerShell, echo -n "string" is the same as Write-Output -NoEnumerate "string", which - because only a single string is output - is the same as Write-Output "string", which, in turn, is the same as just using "string", relying on PowerShell's implicit output behavior.
Write-Output has no option to suppress a trailing newline, and even if it did, using a pipeline to pipe to an external program would add it back in.

Linux terminals and PowerShell use different encodings. So real bytes produced by echo -n "string" are different. I tried it on my Linux Mint terminal and Windows 10 PowerShell. Here what I got:
Linux Mint:
73 74 72 69 6E 67
Windows 10:
FF FE 73 00 74 00 72 00 69 00 6E 00 67 00 0D 00 0A 00
It seems that Linux terminals use UTF-8 and Windows PowerShell uses UTF-16 with a BOM. Also in PowerShell you cannot use '-n' parameter for echo. So echo places newline characters \r\n (0D 00 0A 00) at the end of the "string".
Edit: As mklement0 said below, Windows PowerShell uses ASCII by default when piping.

Information about LF+CR and Windows coding vs Linux coding for Perl programming

I have perl script which works fine in windows. But when I try to run this in Mac OS X, it doesn't perform full task. Can anyone make explanation of LF+CR? Why these line endings don't work on Unix? & in that case what to do with the perl script written for windows if I want to run it in Linux/Unix?

Say you have a file containing the following bytes:
61 62 63 0D 0A 64 65 66 0D 0A
By Windows definition of a text file, that file has two lines consisting of 61 62 63 and 64 65 66 plus the line terminator.
By unix's definition of a text file (including Mac's), that file has two lines consisting of 61 62 63 0D and 64 65 66 0D plus the line terminator.
Since the lines are different, of course they don't work the same.
If you are on a unix system and you want two lines consisting of 61 62 63 and 64 65 66 plus the line terminator, the file will need to consist of the following bytes:
61 62 63 0A 64 65 66 0A
You can use dos2unix to translate Windows text files into unix text files.

You could use perl to change windows line endings (\r\n) to line feeds (\n):
perl -pi -e 's/\r\n?/\n/g' your-script.pl
By making the newline character optional, this can also convert legacy mac line endings (\r) to line feeds. You could make it an alias:
alias fixlines="perl -pi -e 's/\r\n?/\n/g'"

Why is ^M being added to a script.r after modifying with Meld?

Esteemed Meld and Emacs/ESS users,
What I did:
Create a script.r using Emacs/ESS.
Make some modifications to script.r by pulling some lines of code from another_script.r
Reopen another_script.r (or script.r) in Emacs/ESS to continue working.
All the lines in another_script.r which were not pushed to script.r end with ^M
Some times it's the other way around - only the line that was pushed/pulled ends with ^M's. So far i haven't isolated exactly which action determines where the ^M's are placed. Either way i still end up with ^M's all over the place and i'd like to avoid getting them after using Meld!
FWIW: the directory is being synced by Dropbox; in Meld, Preferences > Encoding tab, "utf8" is entered in the text box; all actions are performed under Linux (Ubunt 12.04) with Meld v1.5.3, Emacs v23.3.1
Current workaround is running in a terminal: dos2unix /path/to/script.r which strips the ^Ms. But this shouldn't be necessary and I'm hoping some one here can tell me how to avoid these.
Cheers.

In a terminal i ran cat script.r | hexdump -C | head and amongst the output returned found a 0d 0a, which is DOS formatting for a new line (carriage return 0d immediately followed by a line feed 0a). I ran the same command on another_script.r i was merging with but only observed 0a, no 0d 0a, indicating Unix formatting.
To check further if this was the source of the ^M line endings, script.r was converted to unix formatting via dos2unix script.r & verified that 0d 0a was converted to 0a using hexdump -C as above. I performed a merge using Meld in attempting to replicate the process which yielded ^M line endings in my script's. I re-oppened both files in Emacs/ESS and found no ^M line endings. Short of converting script.r back to dos formatting and repeating the above procedure to see if the ^M line endings re-appear, i believe i've solved my ^M issue, which simply is that, unbeknownst to me, one of my files was dos formatted. My take home message: in a Windows dominated environ, never assume that one's personal linux environment doesn't contain DOS bits. Or line endings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Git messes up with non-ascii characters on Linux container - linux

Your file contains a single byte a3 which corresponds to the Windows-1252 encoding for the character £. Your Linux system displays � because it is not a valid UTF-8 encoding. You should configure Visual Studio to use UTF-8 instead of Windows-1252.

Related

display accented characters on mintty (cygwin) console?

How to fix changed source lines showing ^M with "git diff" but look fine with vim/gedit/cat -e?

Piping Text To An External Program Appends A Trailing Newline

Information about LF+CR and Windows coding vs Linux coding for Perl programming

Why is ^M being added to a script.r after modifying with Meld?

Categories

Resources