How to convert old Japanese text encodings? - text

I run MacBook Pro under 10.6.7, and I am competent in Unix. I have old Japanese text files in various encodings (EUC, SJS, New-JIS) that I can no longer read or display. The old program jconv.c does not help, since it only converts among these encodings. Is there a way to convert them (or any one of them) to the current "normal" Japanese text that can be seen in TextEdit, etc.? I have set the Terminal preferences to SJS and EUC (can't find NewJIS), among others, including UTF-8. Eleanor

I recommend you look into iconv for doing such conversions.

nkf is a Linux command line program which can meet your requirement.
I am sure it is available at Debian. So you can download sorce code from the Net.

Related

linux console how to change the codepage to dos cp437

I want to view some ansi-art on the linux local-console. (my setup:raspberry pi3 / newest raspbian - no x11)
i've tried many different settings in raspi-config, dpkg-reconfigure console-setup, /etc files, environment vars but i had no luck yet. do i need a special pcf font to get it working?
a reliable way to enable it for remote terminals would also be great.
thanks in advance
It depends on what your data uses (see chart). Codes 0..31 are a problem unless you have a program that can map those codes to a printable value (as noted in Why does showconsolefont have different output in tmux?, the showconsolefont program does this mapping of 0..31).
Most of the usable fonts for the Linux console are "psf" fonts: having a header which tells which Unicode values each glyph corresponds to. Using that, along with a known character set (cp437), you could convert the data or "play" it using an application which knows how to do this:
You could convert it using iconv or recode, or
The line-drawing (128..255) could be done using luit in a UTF-8 console.

Nwjs Localization, shows Japanese characters as boxes?

I am using Node.js, on SUSE. When system language is Japanese (my locale is ja_JP.UTF-8), node shows
Japnese characters as square boxes (For links to Japanese website)
Even tried i18n localization, with properties files for Japenese language.
Node displays all Japanese fonts as boxes. Window.Navigator.language does show "ja".
And things works great when language is English or French.
I tried different fonts but observed the similar issue.
re-searching for multilingual IDE (integrated development environment) i am wanting to build, and it would seem Japanese has a few competing formats, which are not utf-8, the kanji / ganji (spelling) has more letters / symbols than other written languages. and a few sites noted competing formats in japan other that utf notations.
http://icu-project.org/ gets data from http://cldr.unicode.org/ from what i understand a lot of software unix / linux / microsoft other big companies use the data. looking at chrome and nodejs, i saw notations of Japanese in a license file. but beyond that i did not find much more info. the cldr site seems iffy on amount of data for japan compared to other languages when skim reading over handful of files to see what data was in the cldr core files.
with above said, make sure you are setting the html header utf-8 correctly, see if a lang tag works
< span lang="jp">some text here</span>
i do not know were i read it, it was not kanjo (spelling) but i want to say ganji? (spelling) or rather the "vertical" reading not left to right, or right to left, but vertical reading (from top -> down). and to get things to display correctly it was being placed into "square boxes".

delete special characters preceding shebang (M-oM-;M-?#!/bin/bash) [duplicate]

I have a CSV file with special accents and save it in Notepad by selecting UTF-8 encoding. When I read the file using Java, it reads the BOM characters too.
So I want to save this file in UTF-8 format without appending a BOM initially in Notepad.
Otherwise, is there a built-in class in Java that eliminates the BOM characters that present at beginning, when reading the contents in a file?
Use Notepad++ - it is free and much better than Notepad. It will help to save text without a BOM using Encoding → Encode in UTF-8 without BOM: Notepad++ v6 and olders:
Notepad++ v7+:
When I encountered this problem in Java, I didn't find any library to parse these first three bytes (BOM). So my advice:
Use PushbackInputStream(in, 3).
Read the first three bytes
If it's not BOM (EF BB BF), push them back
Process the stream as UTF-8
I just learned from this Stack Overflow post, as #martin-geisler points out, that you can save files without the BOM in Windows Notepad, by selecting ANSI as the encoding.
I'm assuming that for more advanced uses this won't work because the resulting file is probably not the end encoding wished, but actually ANSI; but I tested and confirmed this works to save a very small .php script without BOM using only Notepad.
I learned the long, hard way that Windows' Notepad is not a true editor, although I'd like to point out for others that, despite this, it is misleadingly called up when you type "editor" on newer Windows machines, at least on one of mine.
I am currently using Emacs and other editors to solve this problem.
Use Notepad++ instead. See my personal blog post on it. From within Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM".
Notepad on Windows 10 version 1903 (May 2019 update) and later versions supports saving to UTF-8 without a BOM. In fact, UTF-8 is the default file format now.
Reference: Windows 10 Notepad is Getting Better UTF-8 Encoding Support
The answer is: Not at all. Notepad can't do that.
In Java you can just skip the first byte in your InputStream and be done.
You might want to try out Notepad2 or Notepad++. Those Notepad replacements have the option for you to choose whether to output BOM.
As for a Java solution, as far as I know, Java does not understand the standard UTF-8. I googled and found Java's UTF-8 and Unicode writing is broken - Use this fix that might be the solution.
We're using the utility BOMStripperInputStream.java to strip the BOM from our input if present.

Self Contained Linux Command line tool for converting text to doc, rtf, pdf

I'm looking for a command line tool for Linux that will allow me to convert UTF-8 plain text files to various formats. My problem is that I'm working on a secure company-specific flavour of Linux, so the tool can't rely on other packages, such as Open Office, being present. Does anyone know of such a tool?
Gnu a2ps allows you convert from anything to postscript (designed for printing). Not exactly what you want but if you have utilites to display postscript files, you can convert them into pdf.
Another option is Gnu enscript which "converts text to Postscript, HTML or RTF with syntax highlighting". I'm not sure if it supports UTF-8.
Conversion into doc will be harder since it's a closed format. But I have in the past cheated by creating an HTML file with inline css and then renaming it to .doc. Worked back in the early 2000s. DOn't know about now.

I exported via mysqldump to a file. How do I find out the file encoding of the file?

Given a text file in ubuntu (or debian unix in general), how do I find out the file encoding of the file ? Can I run od or hexdump on it to fingerprint its encoding ? What should I be looking out for ?
There are many tools to do this. Try a web search for "detect encoding". Here are some of the tools I found:
The Internationalizations Classes for Unicode (ICU) are a great place to start. See especially their page on Character Set Detection.
Chardet is a Python module to guess the encoding
of a file. See chardet.feedparser.org
The *nix command-line tool file detects file types, but might also detect encodings if mentioned in the file (e.g. if there's a mime-type notation in
the file). See man file
Perl modules Encode::Detect and Encode::Guess .
Someone asked a similar question in StackOverflow. Search for the question, PHP: Detect encoding and make everything UTF-8. That's in the context of fetching files from the net and using PHP, but you could write a command-line PHP script.
Note well what the ICU page says about character set detection: "Character set detection is ..., at best, an imprecise operation using statistics and heuristics...." In my experience the problem domain makes a big difference in how easy or difficult the job is. Don't forget that it's possible that the octets in a file can be of ambiguous encoding, i.e. sensibly interpreted using multiple different encodings. They can also be of mixed encoding, i.e. different subsets of the octets make sense interpreted in different encodings. This is why there's not a single command-line tool I can recommend which always does the job.
If you have a single file and you just want to get it into a known encoding, my trick is to open the file with a text editor which can import using a bunch of different encodings, such as TextWrangler or OpenOffice.org. First, open the file and let the editor guess the encoding. Take a look at the result. If you aren't satisfied with it, guess an encoding, open the file with the editor specifying that encoding, and take a look at the result. Then save as a known encoding, e.g. UTF-16.
You can use enca. Enca is a small command line tool for encoding detection and convertion.
You can install it at debian / ubuntu by:
apt-get install enca
In order to use it, just call
enca FILENAME
Also see the manpage for more information.

Resources