I exported via mysqldump to a file. How do I find out the file encoding of the file? - linux

Given a text file in ubuntu (or debian unix in general), how do I find out the file encoding of the file ? Can I run od or hexdump on it to fingerprint its encoding ? What should I be looking out for ?

There are many tools to do this. Try a web search for "detect encoding". Here are some of the tools I found:
The Internationalizations Classes for Unicode (ICU) are a great place to start. See especially their page on Character Set Detection.
Chardet is a Python module to guess the encoding
of a file. See chardet.feedparser.org
The *nix command-line tool file detects file types, but might also detect encodings if mentioned in the file (e.g. if there's a mime-type notation in
the file). See man file
Perl modules Encode::Detect and Encode::Guess .
Someone asked a similar question in StackOverflow. Search for the question, PHP: Detect encoding and make everything UTF-8. That's in the context of fetching files from the net and using PHP, but you could write a command-line PHP script.
Note well what the ICU page says about character set detection: "Character set detection is ..., at best, an imprecise operation using statistics and heuristics...." In my experience the problem domain makes a big difference in how easy or difficult the job is. Don't forget that it's possible that the octets in a file can be of ambiguous encoding, i.e. sensibly interpreted using multiple different encodings. They can also be of mixed encoding, i.e. different subsets of the octets make sense interpreted in different encodings. This is why there's not a single command-line tool I can recommend which always does the job.
If you have a single file and you just want to get it into a known encoding, my trick is to open the file with a text editor which can import using a bunch of different encodings, such as TextWrangler or OpenOffice.org. First, open the file and let the editor guess the encoding. Take a look at the result. If you aren't satisfied with it, guess an encoding, open the file with the editor specifying that encoding, and take a look at the result. Then save as a known encoding, e.g. UTF-16.

You can use enca. Enca is a small command line tool for encoding detection and convertion.
You can install it at debian / ubuntu by:
apt-get install enca
In order to use it, just call
enca FILENAME
Also see the manpage for more information.

Related

How does the Linux command `file` recognize the encoding of my files?

How does the Linux command file recognize the encoding of my files?
zell#ubuntu:~$ file examples.desktop
examples.desktop: UTF-8 Unicode text
zell#ubuntu:~$ file /etc/services
/etc/services: ASCII text
The man page is pretty clear
The filesystem tests are based on examining the return from a stat(2)
system call...
The magic tests are used to check for files with data in particular
fixed formats. The canonical example of this is a binary executable
(compiled program) a.out file, whose format is defined in #include
and possibly #include in the standard include
directory. These files have a 'magic number' stored in a particular
place near the beginning of the file that tells the UNIX operating
system that the file is a binary executable, and which of several
types thereof. The concept of a 'magic' has been applied by extension
to data files. Any file with some invariant identifier at a small
fixed offset into the file can usually be described in this way. The
information identifying these files is read from the compiled magic
file /usr/share/misc/magic.mgc, or the files in the directory
/usr/share/misc/magic if the compiled file does not exist. In
addition, if $HOME/.magic.mgc or $HOME/.magic exists, it will be used
in preference to the system magic files. If /etc/magic exists, it will
be used together with other magic files.
If a file does not match any of the entries in the magic file, it is
examined to see if it seems to be a text file. ASCII, ISO-8859-x,
non-ISO 8-bit extended-ASCII character sets (such as those used on
Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded
Unicode, and EBCDIC character sets can be distinguished by the
different ranges and sequences of bytes that constitute printable text
in each set. If a file passes any of these tests, its character set is
reported.
In short, for regular files, their magic values are tested. If there's no match, then file checks whether it's a text file, making an educated guess about the specific encoding by looking at the actual values of bytes in the file.
Oh, and you can also download the source code and look at the implementation for yourself.
TLDR: Magic File Doesn't Support UTF-8 BOM Markers
(and that's the main charset you need to care about)
The source code is on GitHub so anyone can search it. After doing a quick search, things like BOM, ef bb bf, and feff do not appear at all. That means UTF-8, Byte-Order-Mark reading is not supported. Files made in other applications that use or preserve the BOM marker will all be returned as "charset=unknown" when using file.
In addition, none of the config files mentioned in the Magic File manpage are a part of magic file v. 4.17. In fact, /etc/magicfile/ doesn't exist at all, so I don't see any way in which I can configure it.
If you're stuck trying to get the ACTUAL charset encoding and magic file is all you have, you can determine if you have a UTF-8 file at the Linux CLI with:
hexdump -n 3 -C $path_to_filename
If the above returns the following sequence, ef bb bf, then you are 99% likely in possession of a BOM-marked UTF-8 file. This is not a 100% certainty, but it is far more useful than magic file, where it has no handling whatsoever for Byte Order Marks.

Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819

All our source code is valid UTF-8, however some users on Windows cannot build them because their system is configured for a different encoding.
Without adding a BOM to source files, is it possible to tell MSVC to treat all source as UTF-8, irrespective of the users system encoding?
See MSDN's link regarding this topic (requires adding BOM header).
You can try:
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.
References
Docs - Visual C++ - ‎Documentation - IDE and Tools - Building - Build Reference: /utf-8 (Set Source and Executable character sets to UTF-8)
If you happen to create cross-platform code solving the problem using a command-line switch means that
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
or adding something like /utf-8 or /source-charset to the CFLAGs might mean you'll have to do a similar thing for other platforms, as well.
If possible it therefore might be better to avoid the problem, instead of solving it, by using an \uxxxx instead of an unicode character in strings: This way the source specifies which unicode characters to use, but doesn't actually contain them.

How to get list of programs which can open a particular file extension in Linux?

Basically I am trying to get list of programs in Linux which are installed and can open particular file extension .jpg for example. If not all, At-least default program should get listed.
Linux (the kernel) has no knowledge on file types to application mapping. If you want to use Gnome programs you can look at https://people.gnome.org/~shaunm/admin-guide/mimetypes-7.html. For KDE there is another mechanism. Each toolkit can define it as it likes. And the programmer can use the defaults or not. So it is simply application specific!
What do you want to achieve?
If you (double) click with a explorer/browser application on an icon or file name, exactly the explorer/browser looks for the file type. Typically this is realized via mime type dictionary. But how a program looks for the file type and maybe execute another program is only related to the programmer who writes that program. The GUI tool-chains like Gnome and KDE have a lot of support for that topic and so you have basic conformity for each family of applications.
If you want to know how a application do the job, start it with strace. But it is quite hard to dig into the huge amount of data.
Also you can take a look for xdg-open. Many programs use this helper to start applications. As an example: If you start Dolphin with strace you will find a line like lstat64("/etc/xdg", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 after clicking on a file.
you can run from command line with:
xdg-open <file-name>
You maybe also want to have a look for applications which registers for file types: /usr/share/applications/*.desktop
Here you can find in each desktop file some mime-types which are registered for the applications. E.g. for audiacity is:
MimeType=application/x-audacity-project;audio/flac;audio/x-flac;audio/basic;audio/x-aiff;audio/x-wav;application/ogg;audio/x-vorbis+ogg;
For your example with jpg:
$ xdg-mime query filetype <any-jpg-file>
image/jpeg
$ grep 'image/jpeg' -R /usr/share/applications/*
...
/usr/share/applications/mimeinfo.cache:image/jpeg2000=kde4-kolourpaint.desktop;gimp.desktop;
So you can see that gimp is one of the default applications for jpg
The place to start looking is at the mailcap (/etc/mailcap) and MIME-types, e.g., in /etc/mime.types in Debian (the filename and path will vary according to who provides it).
The mailcap file gives some rules for opening a file, while MIME-types lists the known filetypes with a tag that allows multiple applications to know about the file types.
Except for embedded or reduced-functionality systems (such as those based on busybox), you would find these files on almost every UNIX-like system.

How can I merge these 3,500 mixed-charset text files?

I have about 3,500 text files, of mixed character sets: ISO-8859, UTF-8, ASCII, UTF-16, and maybe others.
I want to merge them all into one unicode text file, so I can run a Python script on it that expects it.
If I use cat, it doesn't exactly work.
What is the best way to solve this?
You could convert them up-front with a tool like iconv, or load them into Python with the correct encoding (by setting the correct encoding to open).
If you don't know what the encoding of each file is, then it is more complicated, because you'll need to detect the encoding of each file. There are many heuristics, but not absolutely standard way to do this. Again, using iconv can help a lot here.

How to convert old Japanese text encodings?

I run MacBook Pro under 10.6.7, and I am competent in Unix. I have old Japanese text files in various encodings (EUC, SJS, New-JIS) that I can no longer read or display. The old program jconv.c does not help, since it only converts among these encodings. Is there a way to convert them (or any one of them) to the current "normal" Japanese text that can be seen in TextEdit, etc.? I have set the Terminal preferences to SJS and EUC (can't find NewJIS), among others, including UTF-8. Eleanor
I recommend you look into iconv for doing such conversions.
nkf is a Linux command line program which can meet your requirement.
I am sure it is available at Debian. So you can download sorce code from the Net.

Resources