Configure gitg to use UTF8 - linux

gitg doesn't display correctly git diff of UTF8 files, even while
git does it correctly (as seen on the console)
gitg correctly displays UTF8 files (not diff)
Is it possible to configure it to correctly display diff of UTF8 encoded files ? If so how ?
EDIT :
barti_ddu helped me realize that what seems to happen, in fact, is that gitg guess the encoding from the received diff file.
I have this problem when I replace badly encoded chars by well encoded ones : the first one in the diff is the bad one, which probably leads to a bad guess (and gives the impression I'm replacing well encoded chars by bad ones) :
So the (less important) goal would be to force gitg decode diff as UTF8 instead of guessing.

Related

How make git/java correctly deal with UTF-8 filenames on ISO-8859-1 Linux system?

I have a repository where several files have been checked in from Windows, and have unicode characters in the FILENAMES. For example AgêBean.java, GûBean.java, LêgbaBean.java, and XêviosoBean.java.
When these files are checked out on a CentOS 7 system, the bytes comprising filenames are interpreted as ISO-8859-1. This breaks stuff like the java compiler. For example, Java won’t compile the above files, because the unicode identifiers for the class, i.e. “AgêBean”, does not match the ISO-8859-1 filename, which the compiler sees as “AgêBean.java”
The short, ugly workaround is to rename the files, but if they are checked in, then the same problems appear on the Windows side.
So what are some better solutions? I can imagine a few, but I don’t know how to do any of them, and google is not yet being helpful:
A) Re-configuring my CentOS filesystem so that all filenames are UTF-8 (or UTF-16) encoded.
B) Configuring git on Linux to understand that the filenames in the repository are encoded UTF-8, but the local system is ISO-8859-1, so all filenames need to be converted when checked in or out.
C) Configuring java (and terminals, and editors) on Linux to understand that the filenames under this directory are UTF-8 encoded, so each is decoded correctly.
I’d be happiest with solution “A”, but so far I have not found how to do that. I hope it’s not compiled-into the Cent0S 7 (or RHEL 8) kernel.

Preset encoding for Search-in-Files Feature

I have a huge Filedump to handle (7000+ Files) which are all encoded in OEM-US (and i need them to remain OEM-US or return to OEM-US when I'm done)
The search in files feature from Notepad++ would actually solve all my Problems. (It's a single use job - I don't want to bore you with the details but its about sanatizing old code which has partially been written in foreign languages like german or french including their notorious characters like äöüèéàç)
The thing is: Most of the time, Notepad++ detects the wrong encodings and different encodings for different files. Usually, it detects ANSI or UTF-8 but sometimes it get exotic and all of a sudden my files are supposed to be encoded in Shift-JIS or Big5 Which messes up my search terms as they sometimes turn different special chars into the same set of replacement chars.
So I'm looking for a way to either
a) Tell notepad++ which encoding to select for the "search in files" job i want to run.
b) convert all Files to UTF-8, run the search-replace job there and restore the encoding to OEM-US
or
c) Find a different Software to handle this issue for me
Can someone help me?

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

json.dump to a gzip file in python 3

I'm trying to write a object to a gzipped json file in one step (minimising code, and possibly saving memory space). My initial idea (python3) was thus:
import gzip, json
with gzip.open("/tmp/test.gz", mode="wb") as f:
json.dump({"a": 1}, f)
This however fails: TypeError: 'str' does not support the buffer interface, which I think has to do with a string not being encoded to bytes. So what is the proper way to do this?
Current solution that I'm unhappy with:
Opening the file in text-mode solves the problem:
import gzip, json
with gzip.open("/tmp/test.gz", mode="wt") as f:
json.dump({"a": 1}, f)
however I don't like text-mode. In my mind (and maybe this is wrong, but supported by this), text mode is used to fix line-endings. This shouldn't be an issue because json doesn't have line endings, but I don't like it (possibly) messing with my bytes, it (possibly) is slower because it's looking for line-endings to fix, and (worst of all) I don't understand why something about line-endings fixes my encoding problems?
offtopic: I should have dived into the docs a bit further than I initially did.
The python docs show:
Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding (the default being UTF-8). 'b' appended to the mode opens the file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.
I don't fully agree that the result from a json encode is a string (I think it should be a set of bytes, since it explicitly defines that it uses utf-8 encoding), but I noticed this before. So I guess text mode it is.

Troubles with text encoding

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - François Dubois".
What am I missing here?
If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.
That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.
I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links

Resources