How to determine codepage of a file (that had some codepage transformation applied to it) - text

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.

It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.

The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

Related

How to output IBM-1027-codepage-binary-file?

My output (csv/json) from my newly-created program (using .NET framework 4.6) need to be converted to a IBM-1027-codepage-binary-file (to be imported to Japanese client's IBM mainframe),
I've search the internet and know that Microsoft doesn't have equivalent to IBM-1027 code page.
So how could I output a IBM-1027-codepage-binary-file if I have an UTF-8 CSV/json file in my hand?
I'm asking around for other solutions, but for now, I think I'm going to have to suggest you do the conversion manually; I assume whichever language you're using allows you to do a hex conversion, at worst. For mainframes, the codepage is usually implicit in the dataset, it isn't something that is included in the file header.
So, what you can do is build a conversion table, from https://www.ibm.com/support/knowledgecenter/en/SSEQ5Y_5.9.0/com.ibm.pcomm.doc/reference/html/hcp_reference26.htm. Grab a character from your json/csv file, convert to the appropriate hex digits, and write those hex digits to a file. Repeat until EOF. (Note to actually write the hex data, not the ascii representation of the hex data.) Make sure that when the client transfers the file to their system, they perform a binary transfer.
If you wanted to get more complicated than that, you could look at enhancing/overriding part of the converter to CP500, which does exist on Microsoft Windows. One of the design points for EBCDIC was to make doing character conversions as simple as possible, so many of the CP500 characters hex representations are the same as the CP1027, with the exception of the Kanji characters.
This is a separate answer, from a colleague; I don't have the ability to validate it, I'm afraid.
transfer the file to the host in raw mode, just tag it as ccsid 1208
(edited)
for uss export _BPXK_AUTOCVT=ALL
oedit/obrowse handles it automatically.

Why do some files appear as partial gibberish when opened in a text editor?

I often come across the situation where I would like to read a file's original content in a human-readable way. When opening this kind of file in a text editor, why is it that it is usually gibberish with some complete and comprehensible text ? I would think that if the file is converted to something other than it's original written format, that there would be no comprehensible text remaining, yet I often find it is somewhere in between.
For example, I know that if I open a binary in a text format, there will be nothing comprehensible left that isn't purely accidental.
Example screencapture of partial gibberish text
Why is there complete text in here mixed with gibberish? Does that mean if I open the file with some sort of different encoding (I don't know what's possible), the file will come through as fully readable text? I would understand if it were all-or-nothing (either gibberish-non-readable OR human language) but I don't understand the in-between.
Please provide educational responses, rather than "because that's the way it is" type answers.
Those are formatting characters; there is no standard use and vary by the format of the file in question. You can still extract the text as needed with a fair knowledge of grep and regex, but it won't be fun. The best bet is to open the file with the software that can read it properly, as a text editor like gedit or Notepad++ will read the raw data and display that. Adobe's pdf format has text embedded, for instance, and all that gibberish is instructions for the Reader software for displaying it correctly on the screen while still allowing for relatively straightforward text extraction when required.
Editors have no real way to interpret the special formatting characters, and would need to be loaded with APIs for every conceivable program. They would also need to be updated constantly, since the formatting changes regularly for a variety of reasons. Many times, it is just to keep the files from being backward compatible with their own or other products, forcing an upgrade path. Microsoft is rather famous for that, but they are by far not the only company to do so.

Rationale of fileencoding and encoding in vim or elsewhere

I don't get the point why there are encoding and also fileencoding in VIM.
In my knowledge, a file is like an array of bytes. When we create a text file, we create an array of characters (or symbols), and encode this character-array with encoding X to an array of bytes, and save the byte-array to disk. When read in text editor, it decode the byte-array with encoding X to reconstruct the original character-array, and display each character with a graph according to the font. In this process, only one encoding involved.
In VIM set encoding and fileencoding utf-8, which refers wiki of VIM about working with unicode,
encoding sets how vim shall represent characters internally. Utf-8
is necessary for most flavors of Unicode.
fileencoding sets the encoding for a particular file (local to
buffer)
"How vim shall represent characters internally" vs "encoding for a particular file"... resambles Unicode vs UTF-8? If so, why should a user bother with the former?
Any hint?
You're right; most programs have a fixed internal encoding (speaking of C datatypes, that's either char, which mostly then uses the underlying locale and may not be able to represent all characters, or UTF-8; or wchar (wide characters) which can represent the Unicode range). The choice is mainly driven by programming language and available APIs (as having to convert back and forth is tedious and not efficient).
Vim, because it supports a large variety of platforms (starting with the old Amiga where development started) and is geared towards programmers and highly advanced users allows to configure the internal representation.
heuristics
As long as all characters are recognizable, you don't need to care.
If certain files don't look right, you have to teach Vim to recognize the encoding via 'fileencodings', or explicitly specify it.
If certain characters do not show up right, you need to switch the 'encoding'. With utf-8, you're on the safe side.
If you have problems in the terminal only, fiddle with 'termencoding'.
As you can see, though it can be confusing to the beginner, you actually have all the power available to you!
I'll preface this by saying that I'm not a vim expert by any means.
I think the flaw in your thinking is here:
When read in text editor, it decode the byte-array with encoding X to reconstruct the original character-array, and display each character with a graph according to the font.
The thing is, vim is not responsible for rendering the glyph here. vim reads bytes from a file, stores them internally and sends bytes to the terminal which renders the glyph using a font. vim itself never touches fonts and hence never really needs to understand "characters". It only needs to work with bytes internally which it moves back and forth between files, internal buffers and the terminal.
Hence, there are three possible different byte storages involved:
fileencoding
(internal) encoding
termencoding
vim will convert between those as necessary. It could read from a Shift-JIS encoded file, store the data internally as UTF-16 and send/receive I/O to/from the terminal in UTF-8. I am not sure why you'd want to change the internal byte handling of vim (again, not an expert), but in any case, you can alter that setting if you want to.
Hypothesising follows: If you set encoding to a Unicode encoding, you're safe to be able to handle any possible character you may encounter. However, in some circumstances those Unicode encodings may be too large to comfortably fit into memory in very limited systems, so in this case you may want to use a more specialised encoding if you know what you're doing.

Overlayed text?

Quick text-processing question. It's not necessarily related to programming, but this is the best place I figured I should go.
Rate down to tell me this kind of question is not welcome here. (Though, I really like my one little reputation point.)
Anyways, how can I encode text so that two characters get rendered in the same charspace?
NOTE: this is for plain-text -- nothing particularly complex.
The best you can do is put a backspace character between the two. However the outcome isn't likely to be useful to you, it will depend on what software is being used to display the text. The most likely is that the backspace will be ignored or shown as some generic "unavailable" glyph. The second most likely is that the second character will completely erase the first. You'd have to be very lucky for the two characters to be displayed one over the other in the same space.
If it's plain text to be processed by any editor, as far as I know you can't. Even if your text is encoded in Unicode, I don't think it provides combining characters for normal letters, but just for accents and similar symbols which are intended to be combined with other glyphs.
BTW, I'm not sure that stackoverflow is the right place for this kind of stuff, I'd see it better in superuser.com.

Testing for Japanese/Chinese Characters in a string

I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.
I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?
You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.
Simplifications - if the character you are testing is a PRC simplification such as 门 is only available in main land Chinese.
Kana - if the character is one of the many Japanese kana characters such as あいうえお then the text block you are working with is definitely Japanese.
The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.
The process of developing Unicode included the Han Unification. This is because a lot of the Japanese characters are derived from, or the same as, Chinese characters; similarly with Korean. There are some characters (katakana and hiragana - see chapter 12 of the Unicode standard v5.1.0) commonly used in Japanese that would indicate that the text was Japanese rather than Chinese, but I believe it would be a statistical test rather than definitive.
Check out the O'Reilly book on CJKV Information Processing (CJKV is short for Chinese, Japanese, Korean, Vietnamese; I have the CJK predecessor lurking somewhere). There's also the O'Reilly book on Unicode Explained which may be some help, though probably not for this question (I don't recall a discussion of how to identify Japanese and Chinese text).
You probably can't do that reliably. Japanese uses a lot of the same characters as Chinese. I think the best you could do is to look at a block of text. If you see any uniquely Japanese characters, then you can assume the whole block is Japanese. If not, then it's probably Chinese.
However, I'm just learning Chinese, so I'm not an expert.
testing for characters in the katakana or hiragana ranges should be a very reliable means of determining whether or not the text is Japanese, especially if you are dealing with 'regular' user-generated text. if you are looking at legal documents or other more official fare it might be slightly more difficult, as there will be a much greater preponderance of complex chinese characters - but it should still be pretty reliable.
A workaround is to check the encoding before it is converted to Unicode.
There are many characters which are only (commonly) used in Japanese or only used in Chinese.
Japan and China both simplified many characters but often in different ways. You can check for Japanese Shinjitai and Simplified Chinese characters. There are many more of the latter than the former. If there are none of either then you probably have Traditional Chinese.
Of course if you're dealing with Unicode text you may find occasional rare characters or mixed languages which could throw off a heuristic so you're better off going with counting the types of characters to make a judgement.
A good way to find out which characters are common in one language and not in the others is to compare the legacy encodings against each other. You can find mappings of each to Unicode easily on the internet.
I used to have some code I wrote which did a binary search by codepoint and it was extremely fast even in JavaScript - I may have lost it in my travels though (-:

Resources