How do text editors store data above 1 byte? - text

The basic question is, how does notepad (or other basic text editors) store data. I ran into this because I was trying to compare file size of different compression techniques, and realized something isn't quite right.
To elaborate..
If I save a text file with the following contents:
a
The file is 1 byte. This one happens to be 97, or 0x61.
I create a text file with the following contents:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Which is all the characters from 0-255, or 0x00 to 0xFF.
This file is 256 bytes. 1 byte for each character. This makes sense to me.
Then I append the following character to the end of the above string.
†
A character not contained in the above string. All 8 bit characters were already used. This character is 8224, or 0x2020. A 2 bytes character.
And yet, the file size has only changed from 256 to 257 bytes. In fact, the above character saved by itself only shows 1 byte.
What am I missing?
Edit: Please note that in the second text block, many of the characters do not show up on here.

In ANSI encoding (This 8-bit Microsoft-specific encoding), you save each character in one byte (8-bit).
ANSI also called Windows-1252, or Windows Latin-1
You should have a look at ANSI table in ANSI Character Codes Chart or Windows-1252
So for † character, its code is 134, byte 0x86.

Using one byte to encode a character only makes sense on the surface. Works okay if you speak English, it is a fair disaster is you speak Chinese or Japanese. Unicode today has definitions for 110,187 typographic symbols with room to grow up to 1.1 million. A byte is not a good way to store a Unicode symbol since it can encode only 256 distinct values.
Accordingly, text editors must always encode text when they store it to a file. Encoding is required to map 110,187 values onto a byte-oriented storage medium. Inevitably that takes more than 1 byte per character if you speak Chinese.
There have been lots and lots of encoding schemes in common use. Popular in the previous century were code pages, a scheme that uses a character set. A language-specific mapping that tries as hard as it can to need only 1 byte of storage per character by picking 256 characters that are likely to be needed in the language. Japanese, Korean and Chinese used a multi-byte mapping because they had to, other languages used 1.
Code pages have been an enormous disaster, a program cannot properly read a text file that was encoded in another language's code page. It worked when text files stayed close to the machine that created it, the Internet in particular broke that usage. Japanese was particularly prone to this disaster since it had more than one code page in common use. The result is called mojibake, the user looks at gibberish in the text editor. Unicode came around in 1992 to try solve this disaster. One new standard to replace all the other ones, tends to invoke another kind of disaster.
You are subjected to that kind of disaster, particularly if you use Notepad. A program that tries to be compatible with text files that were created in the past 30 years. Google "bush hid the facts" for a hilarious story about that. Note the dialog you get when you use File > Save As, the dialog has an extra combobox titled "Encoding". The default is ANSI, a broken name from the previous century that means "code page". As you found out, that character indeed only needed 1 byte in your machine's default code page. Depends where you live, it is 1252 in Western Europe and the Americas. You'd get 0x86 if you look at the file with a hex viewer.
Given that the dialog gives you a choice and you should not favor ANSI's mojibake anymore, always favor UTF-8 instead. Maybe they'll update Notepad some day so it uses a better default, very hard to do.

Related

How to output IBM-1027-codepage-binary-file?

My output (csv/json) from my newly-created program (using .NET framework 4.6) need to be converted to a IBM-1027-codepage-binary-file (to be imported to Japanese client's IBM mainframe),
I've search the internet and know that Microsoft doesn't have equivalent to IBM-1027 code page.
So how could I output a IBM-1027-codepage-binary-file if I have an UTF-8 CSV/json file in my hand?
I'm asking around for other solutions, but for now, I think I'm going to have to suggest you do the conversion manually; I assume whichever language you're using allows you to do a hex conversion, at worst. For mainframes, the codepage is usually implicit in the dataset, it isn't something that is included in the file header.
So, what you can do is build a conversion table, from https://www.ibm.com/support/knowledgecenter/en/SSEQ5Y_5.9.0/com.ibm.pcomm.doc/reference/html/hcp_reference26.htm. Grab a character from your json/csv file, convert to the appropriate hex digits, and write those hex digits to a file. Repeat until EOF. (Note to actually write the hex data, not the ascii representation of the hex data.) Make sure that when the client transfers the file to their system, they perform a binary transfer.
If you wanted to get more complicated than that, you could look at enhancing/overriding part of the converter to CP500, which does exist on Microsoft Windows. One of the design points for EBCDIC was to make doing character conversions as simple as possible, so many of the CP500 characters hex representations are the same as the CP1027, with the exception of the Kanji characters.
This is a separate answer, from a colleague; I don't have the ability to validate it, I'm afraid.
transfer the file to the host in raw mode, just tag it as ccsid 1208
(edited)
for uss export _BPXK_AUTOCVT=ALL
oedit/obrowse handles it automatically.

Rationale of fileencoding and encoding in vim or elsewhere

I don't get the point why there are encoding and also fileencoding in VIM.
In my knowledge, a file is like an array of bytes. When we create a text file, we create an array of characters (or symbols), and encode this character-array with encoding X to an array of bytes, and save the byte-array to disk. When read in text editor, it decode the byte-array with encoding X to reconstruct the original character-array, and display each character with a graph according to the font. In this process, only one encoding involved.
In VIM set encoding and fileencoding utf-8, which refers wiki of VIM about working with unicode,
encoding sets how vim shall represent characters internally. Utf-8
is necessary for most flavors of Unicode.
fileencoding sets the encoding for a particular file (local to
buffer)
"How vim shall represent characters internally" vs "encoding for a particular file"... resambles Unicode vs UTF-8? If so, why should a user bother with the former?
Any hint?
You're right; most programs have a fixed internal encoding (speaking of C datatypes, that's either char, which mostly then uses the underlying locale and may not be able to represent all characters, or UTF-8; or wchar (wide characters) which can represent the Unicode range). The choice is mainly driven by programming language and available APIs (as having to convert back and forth is tedious and not efficient).
Vim, because it supports a large variety of platforms (starting with the old Amiga where development started) and is geared towards programmers and highly advanced users allows to configure the internal representation.
heuristics
As long as all characters are recognizable, you don't need to care.
If certain files don't look right, you have to teach Vim to recognize the encoding via 'fileencodings', or explicitly specify it.
If certain characters do not show up right, you need to switch the 'encoding'. With utf-8, you're on the safe side.
If you have problems in the terminal only, fiddle with 'termencoding'.
As you can see, though it can be confusing to the beginner, you actually have all the power available to you!
I'll preface this by saying that I'm not a vim expert by any means.
I think the flaw in your thinking is here:
When read in text editor, it decode the byte-array with encoding X to reconstruct the original character-array, and display each character with a graph according to the font.
The thing is, vim is not responsible for rendering the glyph here. vim reads bytes from a file, stores them internally and sends bytes to the terminal which renders the glyph using a font. vim itself never touches fonts and hence never really needs to understand "characters". It only needs to work with bytes internally which it moves back and forth between files, internal buffers and the terminal.
Hence, there are three possible different byte storages involved:
fileencoding
(internal) encoding
termencoding
vim will convert between those as necessary. It could read from a Shift-JIS encoded file, store the data internally as UTF-16 and send/receive I/O to/from the terminal in UTF-8. I am not sure why you'd want to change the internal byte handling of vim (again, not an expert), but in any case, you can alter that setting if you want to.
Hypothesising follows: If you set encoding to a Unicode encoding, you're safe to be able to handle any possible character you may encounter. However, in some circumstances those Unicode encodings may be too large to comfortably fit into memory in very limited systems, so in this case you may want to use a more specialised encoding if you know what you're doing.

How to determine codepage of a file (that had some codepage transformation applied to it)

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

How would one store German text in an embedded system?

I've created a memory mapped 1 bit interface to an LCD in an embedded system, along with 4 or 5 bit mapped fonts for the 90+ printable ASCII characters. Writing to the screen is as simple as using an echo like statement (it's embedded Linux).
Other than something strictly proprietory, what recommendations can people make for storing German (or Spanish, or French for that mattter)? Unicode seems to be a pretty heavy hitter.
If I understand you right, you are searching a lightwight encoding for german characters? In Europe, you normaly use Latin-1 or better ISO 8859-15. This is a 8-Bit ASCII extension containing most of the characters used by western languages.
Well, UTF-8 isn't that big. I recommend it if you want to be able to use one or more languages where you don't find a matching char in Latin-1.

Confusion on Unicode and Multibyte Articles

By referring Joel's Article
Some people are under the
misconception that Unicode is simply a
16-bit code where each character takes
16 bits and therefore there are 65,536
possible characters. This is not,
actually, correct.
After reading the whole article, my point is that, if someone told you, his text is in unicode, you will have no idea how much memory space taken up by every of his character. He have to tell you, "My unicode text is encoded in UTF-8", then only you will have idea how much memory space is taken up by every of his character.
Unicode = not necessary 2 byte for each character
However, when comes to Code Project's Article and Microsoft's Help, this confused me :
Microsoft :
Unicode is a 16-bit character
encoding, providing enough encodings
for all languages. All ASCII
characters are included in Unicode as
"widened" characters.
Code Project :
The Unicode character set is a "wide
character" (2 bytes per character) set
that contains every character
available in every language, including
all technical symbols and special
publishing characters. Multibyte
character set (MBCS) uses either 1 or
2 bytes per character
Unicode = 2 byte for each character ?
Is 65536 possible characters able to represent all language in this world?
Why the concept seems different among web developer community and desktop developer community?
Once upon a time,
Unicode had only as many characters as fit in 16 bits, and
UTF-8 did not exist or was not the de facto encoding to use.
These factors led to UTF-16 (or rather, what is now called UCS-2) to be considered synonymous with “Unicode”, because it was after all the encoding which supported all of Unicode.
Practically, you will see “Unicode” being used where “UTF-16” or “UCS-2” is meant. This is a historical confusion and should be ignored and not propagated. Unicode is a set of characters; UTF-8, UTF-16, and UCS-2 are different encodings.
(The difference between UTF-16 and UCS-2 is that UCS-2 is a true 16-bits-per-“character” encoding, and therefore encodes only the “BMP” (Basic Multilingual Plane) portion of Unicode, whereas UTF-16 uses “surrogate pairs” (for a total of 32 bits) to encode above-BMP characters.)
To expand on #Kevin's answer:
The description is Microsoft's Help is quite out of date, describing the state of the world in the NT 3.5/4.0 timeline.
You'll also occasionally see UTF-32 and UCS-4 mentioned as well, most often in the *nix world. UTF-32 is a 32-bit encoding of Unicode, a subset of UCS-4. The Unicode Standard Annex #19 describes the differences between them.
The best reference I've found describing the various encoding models is the Unicode Technical Report #17 Unicode Character Encoding Model, especially the tables in section 4.
Is 65536 possible characters able to represent all language in this world?
No.
Why the concept seems different among web developer community and desktop developer community?
Because Windows documentation is wrong. It took me a while to figure this out. MSDN says in at least two places that Unicode is a 16-bit encoding:
http://www.microsoft.com/typography/unicode/cscp.htm
http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx
One reason for the confusion is that at one point Unicode was a 16-bit encoding. From Wikipedia:
“Originally, both Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16 bit”
The other problem is that today in Windows APIs strings containing utf-16 encoded string data is usually represented using an array of wide characters, each one being 16-bits long. Despite that that Windows APIs support surrogate pairs of two 16-bit character types, to represent one Unicode code point.
Check out this blog post for more detailed information on the source of the confusion.

Resources