How does the Linux command `file` recognize the encoding of my files? - linux

How does the Linux command file recognize the encoding of my files?
zell#ubuntu:~$ file examples.desktop
examples.desktop: UTF-8 Unicode text
zell#ubuntu:~$ file /etc/services
/etc/services: ASCII text

The man page is pretty clear
The filesystem tests are based on examining the return from a stat(2)
system call...
The magic tests are used to check for files with data in particular
fixed formats. The canonical example of this is a binary executable
(compiled program) a.out file, whose format is defined in #include
and possibly #include in the standard include
directory. These files have a 'magic number' stored in a particular
place near the beginning of the file that tells the UNIX operating
system that the file is a binary executable, and which of several
types thereof. The concept of a 'magic' has been applied by extension
to data files. Any file with some invariant identifier at a small
fixed offset into the file can usually be described in this way. The
information identifying these files is read from the compiled magic
file /usr/share/misc/magic.mgc, or the files in the directory
/usr/share/misc/magic if the compiled file does not exist. In
addition, if $HOME/.magic.mgc or $HOME/.magic exists, it will be used
in preference to the system magic files. If /etc/magic exists, it will
be used together with other magic files.
If a file does not match any of the entries in the magic file, it is
examined to see if it seems to be a text file. ASCII, ISO-8859-x,
non-ISO 8-bit extended-ASCII character sets (such as those used on
Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded
Unicode, and EBCDIC character sets can be distinguished by the
different ranges and sequences of bytes that constitute printable text
in each set. If a file passes any of these tests, its character set is
reported.
In short, for regular files, their magic values are tested. If there's no match, then file checks whether it's a text file, making an educated guess about the specific encoding by looking at the actual values of bytes in the file.
Oh, and you can also download the source code and look at the implementation for yourself.

TLDR: Magic File Doesn't Support UTF-8 BOM Markers
(and that's the main charset you need to care about)
The source code is on GitHub so anyone can search it. After doing a quick search, things like BOM, ef bb bf, and feff do not appear at all. That means UTF-8, Byte-Order-Mark reading is not supported. Files made in other applications that use or preserve the BOM marker will all be returned as "charset=unknown" when using file.
In addition, none of the config files mentioned in the Magic File manpage are a part of magic file v. 4.17. In fact, /etc/magicfile/ doesn't exist at all, so I don't see any way in which I can configure it.
If you're stuck trying to get the ACTUAL charset encoding and magic file is all you have, you can determine if you have a UTF-8 file at the Linux CLI with:
hexdump -n 3 -C $path_to_filename
If the above returns the following sequence, ef bb bf, then you are 99% likely in possession of a BOM-marked UTF-8 file. This is not a 100% certainty, but it is far more useful than magic file, where it has no handling whatsoever for Byte Order Marks.

Related

How make git/java correctly deal with UTF-8 filenames on ISO-8859-1 Linux system?

I have a repository where several files have been checked in from Windows, and have unicode characters in the FILENAMES. For example AgêBean.java, GûBean.java, LêgbaBean.java, and XêviosoBean.java.
When these files are checked out on a CentOS 7 system, the bytes comprising filenames are interpreted as ISO-8859-1. This breaks stuff like the java compiler. For example, Java won’t compile the above files, because the unicode identifiers for the class, i.e. “AgêBean”, does not match the ISO-8859-1 filename, which the compiler sees as “AgêBean.java”
The short, ugly workaround is to rename the files, but if they are checked in, then the same problems appear on the Windows side.
So what are some better solutions? I can imagine a few, but I don’t know how to do any of them, and google is not yet being helpful:
A) Re-configuring my CentOS filesystem so that all filenames are UTF-8 (or UTF-16) encoded.
B) Configuring git on Linux to understand that the filenames in the repository are encoded UTF-8, but the local system is ISO-8859-1, so all filenames need to be converted when checked in or out.
C) Configuring java (and terminals, and editors) on Linux to understand that the filenames under this directory are UTF-8 encoded, so each is decoded correctly.
I’d be happiest with solution “A”, but so far I have not found how to do that. I hope it’s not compiled-into the Cent0S 7 (or RHEL 8) kernel.

Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819

All our source code is valid UTF-8, however some users on Windows cannot build them because their system is configured for a different encoding.
Without adding a BOM to source files, is it possible to tell MSVC to treat all source as UTF-8, irrespective of the users system encoding?
See MSDN's link regarding this topic (requires adding BOM header).
You can try:
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.
References
Docs - Visual C++ - ‎Documentation - IDE and Tools - Building - Build Reference: /utf-8 (Set Source and Executable character sets to UTF-8)
If you happen to create cross-platform code solving the problem using a command-line switch means that
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
or adding something like /utf-8 or /source-charset to the CFLAGs might mean you'll have to do a similar thing for other platforms, as well.
If possible it therefore might be better to avoid the problem, instead of solving it, by using an \uxxxx instead of an unicode character in strings: This way the source specifies which unicode characters to use, but doesn't actually contain them.

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

Display binary vtk file in ascii format

I have a program which writes the output as binary vtk file. Being a binary file, I am not able to understand it. However, I would like to read the contents of the file for my own understanding. Is there a way to view it in ascii format (so that I could understand the data better)?
A hex-editor or dumper will let you view the contents of the file, but it will likely still be unreadable. As a binary file, its contents are meant to be interpreted by your machine / vtk, and will be nonsensical without knowing the vtk format specifications.
Common hex dumpers include xxd, od, or hexdump.

I exported via mysqldump to a file. How do I find out the file encoding of the file?

Given a text file in ubuntu (or debian unix in general), how do I find out the file encoding of the file ? Can I run od or hexdump on it to fingerprint its encoding ? What should I be looking out for ?
There are many tools to do this. Try a web search for "detect encoding". Here are some of the tools I found:
The Internationalizations Classes for Unicode (ICU) are a great place to start. See especially their page on Character Set Detection.
Chardet is a Python module to guess the encoding
of a file. See chardet.feedparser.org
The *nix command-line tool file detects file types, but might also detect encodings if mentioned in the file (e.g. if there's a mime-type notation in
the file). See man file
Perl modules Encode::Detect and Encode::Guess .
Someone asked a similar question in StackOverflow. Search for the question, PHP: Detect encoding and make everything UTF-8. That's in the context of fetching files from the net and using PHP, but you could write a command-line PHP script.
Note well what the ICU page says about character set detection: "Character set detection is ..., at best, an imprecise operation using statistics and heuristics...." In my experience the problem domain makes a big difference in how easy or difficult the job is. Don't forget that it's possible that the octets in a file can be of ambiguous encoding, i.e. sensibly interpreted using multiple different encodings. They can also be of mixed encoding, i.e. different subsets of the octets make sense interpreted in different encodings. This is why there's not a single command-line tool I can recommend which always does the job.
If you have a single file and you just want to get it into a known encoding, my trick is to open the file with a text editor which can import using a bunch of different encodings, such as TextWrangler or OpenOffice.org. First, open the file and let the editor guess the encoding. Take a look at the result. If you aren't satisfied with it, guess an encoding, open the file with the editor specifying that encoding, and take a look at the result. Then save as a known encoding, e.g. UTF-16.
You can use enca. Enca is a small command line tool for encoding detection and convertion.
You can install it at debian / ubuntu by:
apt-get install enca
In order to use it, just call
enca FILENAME
Also see the manpage for more information.

Resources