How much data can you encode in a single character? - string

If I were creating a videogame level editor in AS3 or .NET with a string-based level format, that can be copied, pasted and emailed, how much data could I encode into each character? What is important is getting the maximum amount of data for the minimum amount of characters displayed on the screen, regardless of how many bytes the computer is actually using to store these characters.
For example if I wanted to store the horizontal position of an object in 1 string character, how many possible values could that have? Are there are any characters that can't be sent over a the internet, or that can't be copy and pasted? What difference would things like UTF8 make? Answers please for either AS3 or C#/.NET, or both.
2nd update: ok so Flash uses UTF16 for its String class. There are lots of control characters that I cannot use. How could I manage which characters are ok to use? Just a big lookup table? And can operating systems and browser handle UTF16 to the extent that you can safely copy and paste a UTF16 string into an email, notepad, etc?

Updated: "update 1", "update 2"
You can store 8 Bits in a single charakter with ANSI, ASCII or UTF-8 encoding.
But, for example, if you whant to use ASCII-Encoding you shouldn't use the first 5 bits (0001 1111 = 0x1F) and the chars 0x7F there are represent system-charaters like "Escape, null, start of text, end of text ..) who are not can be copy and paste. So you could store 223 (1110 0000 = 0xE0) different informations in one single charakter.
If you use UTF-16 you have 2 bytes = 16 bits - system-characters to store your informationen.
A in UTF-8 Encoding: 0x0041 (the first 2 digits are every 0!) or 0x41
A in UTF-16 Encoding: 0x0041 (the first 2 digits can be higher then 0)
A in ASCII Encoding: 0x41
A in ANSI Encoding: 0x41
see images at the and of this post!
update 1:
If you not need to modify the values without any tool (c#-tool, javascript-base webpage, ...) you can alternative base64 or zip+base64 your informationens. this solution avoid the problem that you descript in your 2nd update. "here are lots of control characters that I cannot use. How could I manage which characters are ok to use?"
If this is not an option you can not avoid to use any type of lookup-table.
the shortest way for an lookuptable are:
var illegalCharCodes = new byte[]{0x00, 0x01, 0x02, ..., 0x1f, 0x7f};
or you code it like this:
//The example based on ASNI-Encoding but in principle its the same with utf-16
var value = 0;
if(charcode > 0x7f)
value = charcode - 0x1f - 1; //-1 because 0x7f is the first illegalCharCode higher then 0x1f
else
value = charcode - 0x1f;
value -= 1; //because you need a 0 value;
//charcode: 0x20 (' ') -> value: 0
//charcode: 0x21 ('!') -> value: 1
//charcode: 0x22 ('"') -> value: 2
//charcode: 0x7e ('~') -> value: 94
//charcode: 0x80 ('€') -> value: 95
//charcode: 0x81 ('�') -> value: 96
//..
update 2:
for Unicode (UTF-16) you can use this table: http://www.tamasoft.co.jp/en/general-info/unicode.html
Any character represent with a symbol like or are empty you should not use.
So you can not store 50,000 possible values in one utf-16 character if you allow to copy and past them. you need any spezial-encoder and you must use 2 UTF-16 character like:
//charcode: 0x0020 + 0x0020 (' ') > value: 0
//charcode: 0x0020 + 0x0020 (' !') > value: 2
//charcode: 0x0020 + 0x0020 ('!A') > value: something higher 40.000, i dont know excatly because i dont have count the illegal characters in UTF-16 :D
(source: asciitable.com)

Confusingly, a char is not the same thing as a character. In C and C++, a char is virtually always an 8-bit type. In Java and C#, a char is a UTF-16 code unit and thus a 16-bit type.
But in Unicode, a character is represented by a "code" point that ranges from 0 to 0x10FFFF, for which a 16-bit type is inadequate. So a character must either be represented by a 21-bit type (in practice, a 32-bit type), or use multiple "code units". Specifically,
IN UTF-32, all characters require 32 bits.
In UTF-16, characters U+0000 to U+FFFF (the "basic multilingual plane"), except for U+D800 to U+DFFF which cannot be represented, require 16 bits, and all other characters require 32 bits.
In UTF-8, characters U+0000 to U+007F (the ASCII reportoire) require 8 bits, U+0080 to U+07FF require 16 bits, U+0800 to U+FFFF require 24 bits, and all other characters require 32 bits.
If I were creating a videogame level
editor with a string-based level
format, how much data could I encode
into each char? For example if I
wanted to store the horizontal
position of an object in 1 char, how
many possible values could that have?
Since you wrote char rather than "character", the answer is 256 for C and 65,536 for C#.
But char isn't designed to be a binary data type. byte or short would be more appropriate.
Are there are any characters that
can't be sent over a the internet, or
that can't be copy and pasted?
There aren't any characters that can't be sent over the Internet, but you have to be careful using "control characters" or non-ASCII characters.
Many Internet protocols (especially SMTP) are designed for text rather than binary data. If you want to send binary data, you can Base64 encode it. That gives you 6 bits of information for each byte of the message.

In C, a char is a type of integer, and it's most typically one byte wide. One byte is 8 bits so that's 2 to the power 8, or 256, possible values (as noted in another answer).
In other languages, a 'character' is a completely different thing from an integer (as it should be), and has to be explicitly encoded to turn it into a byte. Java, for example, makes this relatively simple by storing characters internally in a UTF-16 encoding (forgive me some details), so they take up 16 bits, but that's just implementation detail. Different encodings such as UTF-8 mean that a character, when encoded for transmission, could occupy anything from one to four bytes.
Thus your question is slighly malformed (which is to say it's actually several distinct questions in one).
How many values can a byte have? 256.
What characters can be sent in emails? Mostly those ASCII characters from space (32) to tilde (126).
What bytes can be sent over the internet? Any you like, as long as you encode them for transmission.
What can be cut-and-pasted? If your platform can do Unicode, then all of unicode; if not, not.
Does UTF-8 make a difference? UTF-8 is a standard way of encoding a string of characters into a string of bytes, and probably not much to do with your question (Joel Spolsky has a very good account of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).
So pick a question!
Edit, following edit to question Aha! If the question is: 'how do I encode data in such a way that it can be mailed?', then the answer is probably 'Use base64'. That is, if you have some purely binary format for your levels, then base64 is the 'standard' (very much quotes-standard) way of encoding that binary blob in a way that will make it through mail. The things you want to google for are 'serialization' and 'deserialization'. Base64 is probably close to the practical maximum of information-per-mailable-character.
(Another answer is 'use XML', but the question seems to imply some preference for compactness, and that a basically binary format is desirable).

The number of different states a variable can hold is two to the power of the number of bits it has. How many bits a variable has is something that is likely to vary according to the compiler and machine used. But in most cases a char will have eight bits and two to the power eight is two hundred and fifty six.
Modern screen resolutions being what they are, you will most likely need more than one char for the horizontal position of anything.

Related

Encoding binary strings into arbitrary alphabets

If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).
But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).

How does UTF16 encode characters?

EDIT
Since it seems I'm not going to get an answer to the general question. I'll restrict it to one detail: Is my understanding of the following, correct?
That surrogates work as follows:
If the first pair of bytes is not between D800 and DBFF - there
will not be a second pair.
If it is between D800 and DBFF - a) there will be a second pair b)
the second pair will be in the range of DC00 and DFFF.
There is no single pair UTF16 character with a value between D800
and DBFF.
There is no single pair UTF16 character with a value between DC00
and DFFF.
Is this right?
Original question
I've tried reading about UTF16 but I can't seem to understand it. What are "planes" and "surrogates" etc.? Is a "plane" the first 5 bits of the first byte? If so, then why not 32 planes since we're using those 5 bits anyway? And what are surrogates? Which bits do they correspond to?
I do understand that UTF16 is a way to encode Unicode characters, and that it sometimes encodes characters using 16 bits, and sometimes 32 bits, no more no less. I assume that there is some list of values for the first 2 bytes (which are the most significant ones?) which indicates that a second 2 bytes will be present.
But instead of me going on about what I don't understand, perhaps someone can make some order in this?
Yes on all four.
To clarify, the term "pair" in UTF-16 refers to two UTF-16 code units, the first in the range D800-DBFF, the second in DC00-DFFF.
A code unit is 16-bits (2 bytes), typically written as an unsigned integer in hexadecimal (0x000A). The order of the bytes (0x00 0x0A or 0x0A 0x00) is specified by the author or indicated with a BOM (0xFEFF) at the beginning of the file or stream. (The BOM is encoded with the same algorithm as the text but is not part of the text. Once the byte order is determined and the bytes are reordered to the native ordering of the system, it typically is discarded.)

Proper encoding for fixed-length storage of Unicode strings?

I'm going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I'm fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I'm not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.
First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!
Unicode terminology:
Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points
So:
A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points
So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I'm assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)
In terms of communicating length restrictions via a user interface you'll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Encoding a 5 character string into a unique and repeatable 32bit Integer

I've not given this much thought yet, so I might turn out to be a silly question.
How can I take unique 5 ASCII character string and convert into a unique and reproducable (i.e needs to be the same every time) 32 bit integer?
Any ideas?
Assuming it is in fact ASCII (i.e., no characters with ordinal values greater than 127), you have five characters of 7 bits, or 35 bits of information. There is no way to generate a 32-bit code from 35 bits that is guaranteed to be unique; you're missing three bits, so each code will also represent 7 other valid ASCII strings. However, you can make it very, very unlikely that you will ever see a collision by being careful in how you calculate the code so that input strings that are very similar have very different codes. I see another answer has suggested CRC-32. You could also use a hash function such as MD5 or SHA-1 and use only the first 32 bits; this is probably best because hash functions are specifically designed for this purpose.
If you can further constrain the values of the input string (say, only alphanumeric, no lowercase, no control characters, or something of the sort), you can probably eliminate that extra data and generate guaranteed unique 32-bit codes for each string.
If they're guaranteed to be alphanumeric only, and case-insensitive ([A-Z][0-9]) you can treat it as a base-36 number.
If all five characters will belong to a set of 84 or fewer distinct characters, then you can squish five of them into a longword. Convert each character into a value 0..83, then
intvalue = ((((char4*84+char1)*83+char2)*82+char3)*81+char0)
char0 = intvalue % 84
char1 = (intvalue / 84) % 84;
char2 = (intvalue / (84*84)) % 84;
char3 = (intvalue / (84*84L*84)) % 84;
char4 = (intvalue / (84*84L*84*84L) % 84;
BTW, I wonder if anyone uses base-84 encoding as a standard; on many platforms it could be easier to handle than base-64, and the results would be more compact.
If you need to handle extended ASCII you are out of luck, as you would need 5 full chars which is 40 bits. Even with non-extended chars (top bit not used), you are still out of luck as you are trying to encode 35 bits of ASCII data into 32 bits of integer.
ascii goes from 0-255, which takes 8 bits... In 32 bits, you have 4 of those, not 5. So, to make it short and sweet, you can't do this.
Even if you are willing to ignore the high-order (values 128-255) ascii (use only ascii characters 0-127) and just use 7 bits per character, you are still 3 bits short (7*5 = 35 and you only have 32 available.
One way is to treat the 5 characters as numerals in base N, where N is the number of characters in your alphabet (the set of allowed characters). From there on, it's just simple base conversion.
Given that you have 32 bits available, and 5 characters to store, that means you can have 32^(1/5)=84 characters in your alphabet.
Assuming you only include basic ASCII, not extended ASCII (>127), you have 7 bits of information in a single character, so that's a bit of a problem - there are too many possibilities to create unique values for every string. However, the first 32 characters, as well as the last character, are control characters, and if you exclude those, you're down to 95 characters.
You still have to cut 11 characters, though. Wikipedia has a nice chart of the characters in ASCII which you can use to determine which characters you need.

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

Resources