How many bytes does one Unicode character take? - string

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?
I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?
And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?
I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:
Binary Hex Comments
0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
10xxxxxx 0x80..0xBF Continuation byte: one of 1-3 bytes following the first
110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
11110xxx 0xF0..0xF7 First byte of a 4-byte character encoding
So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.

You won't see a simple answer because there isn't one.
First, Unicode doesn't contain "every character from every language", although it sure does try.
Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.
To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).
Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).
As far as I know old ASCII characters took one byte per character.
Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).
How many bytes does a Unicode character require?
Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.
I assume that one Unicode character can contain every possible
character from any language - am I correct?
No. But almost. So basically yes. But still no.
So how many bytes does it need per character?
Same as your 2nd question.
And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode
versions?
No, those are encodings. They define how bytes/octets should represent Unicode characters.
A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to http://codepoints.net/U+1F6AA (replace 1F6AA with the codepoint in hex) to see an image.
U+0061 LATIN SMALL LETTER A: a
Nº: 97
UTF-8: 61
UTF-16: 00 61
U+00A9 COPYRIGHT SIGN: ©
Nº: 169
UTF-8: C2 A9
UTF-16: 00 A9
U+00AE REGISTERED SIGN: ®
Nº: 174
UTF-8: C2 AE
UTF-16: 00 AE
U+1337 ETHIOPIC SYLLABLE PHWA: ጷ
Nº: 4919
UTF-8: E1 8C B7
UTF-16: 13 37
U+2014 EM DASH: —
Nº: 8212
UTF-8: E2 80 94
UTF-16: 20 14
U+2030 PER MILLE SIGN: ‰
Nº: 8240
UTF-8: E2 80 B0
UTF-16: 20 30
U+20AC EURO SIGN: €
Nº: 8364
UTF-8: E2 82 AC
UTF-16: 20 AC
U+2122 TRADE MARK SIGN: ™
Nº: 8482
UTF-8: E2 84 A2
UTF-16: 21 22
U+2603 SNOWMAN: ☃
Nº: 9731
UTF-8: E2 98 83
UTF-16: 26 03
U+260E BLACK TELEPHONE: ☎
Nº: 9742
UTF-8: E2 98 8E
UTF-16: 26 0E
U+2614 UMBRELLA WITH RAIN DROPS: ☔
Nº: 9748
UTF-8: E2 98 94
UTF-16: 26 14
U+263A WHITE SMILING FACE: ☺
Nº: 9786
UTF-8: E2 98 BA
UTF-16: 26 3A
U+2691 BLACK FLAG: ⚑
Nº: 9873
UTF-8: E2 9A 91
UTF-16: 26 91
U+269B ATOM SYMBOL: ⚛
Nº: 9883
UTF-8: E2 9A 9B
UTF-16: 26 9B
U+2708 AIRPLANE: ✈
Nº: 9992
UTF-8: E2 9C 88
UTF-16: 27 08
U+271E SHADOWED WHITE LATIN CROSS: ✞
Nº: 10014
UTF-8: E2 9C 9E
UTF-16: 27 1E
U+3020 POSTAL MARK FACE: 〠
Nº: 12320
UTF-8: E3 80 A0
UTF-16: 30 20
U+8089 CJK UNIFIED IDEOGRAPH-8089: 肉
Nº: 32905
UTF-8: E8 82 89
UTF-16: 80 89
U+1F4A9 PILE OF POO: 💩
Nº: 128169
UTF-8: F0 9F 92 A9
UTF-16: D8 3D DC A9
U+1F680 ROCKET: 🚀
Nº: 128640
UTF-8: F0 9F 9A 80
UTF-16: D8 3D DE 80
Okay I'm getting carried away...
Fun facts:
If you're looking for a specific character, you can copy&paste it on http://codepoints.net/.
I wasted a lot of time on this useless list (but it's sorted!).
MySQL has a charset called "utf8" which actually does not support characters longer than 3 bytes. So you can't insert a pile of poo, the field will be silently truncated. Use "utf8mb4" instead.
There's a snowman test page (unicodesnowmanforyou.com).

Simply speaking Unicode is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).
Now you need to represent this code points using bytes, thats called character encoding. UTF-8, UTF-16, UTF-6 are ways of representing those characters.
UTF-8 is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).
UTF-32 each characters have 4 bytes a characters.
UTF-16 uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.

In UTF-8:
1 byte: 0 - 7F (ASCII)
2 bytes: 80 - 7FF (all European plus some Middle Eastern)
3 bytes: 800 - FFFF (multilingual plane incl. the top 1792 and private-use)
4 bytes: 10000 - 10FFFF
In UTF-16:
2 bytes: 0 - D7FF (multilingual plane except the top 1792 and private-use )
4 bytes: D800 - 10FFFF
In UTF-32:
4 bytes: 0 - 10FFFF
10FFFF is the last unicode codepoint by definition, and it's defined that way because it's UTF-16's technical limit.
It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8's encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.

In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.
Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.
The only encoding where (as of now) we can make the statement about the size is UTF-32. There it's always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)
What makes it so difficult are at least two things:
composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A).
code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8.
The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC.
Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn't seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.

There is a great tool for calculating the bytes of any string in UTF-8: http://mothereff.in/byte-counter
Update: #mathias has made the code public: https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js

Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)"
As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.
So your simple answer that you want is that it varies.

Unicode is a standard which provides a unique number for every character. These unique numbers are called code points (which is just unique code) to all characters existing in the world (some's are still to be added).
For different purposes, you might need to represent this code points in bytes (most programming languages do so), and here's where Character Encoding kicks in.
UTF-8, UTF-16, UTF-32 and so on are all Character Encodings, and Unicode's code points are represented in these encodings, in different ways.
UTF-8 encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;
UTF-16 has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it's enough for almost all the cases. Java uses UTF-16 encoding for its strings and characters;
UTF-32 has fixed length and each character takes exactly 4 bytes (32 bits).

For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a "surrogate pair." More specifically, a surrogate pair has the form:
[0xD800 - 0xDBFF] [0xDC00 - 0xDFF]
where [...] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).
See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.

Check out this Unicode code converter. For example, enter 0x2009, where 2009 is the Unicode number for thin space, in the "0x... notation" field, and click Convert. The hexadecimal number E2 80 89 (3 bytes) appears in the "UTF-8 code units" field.

From Wiki:
UTF-8, an 8-bit variable-width encoding which maximizes compatibility with ASCII;
UTF-16, a 16-bit, variable-width encoding;
UTF-32, a 32-bit, fixed-width encoding.
These are the three most popular different encoding.
In UTF-8 each character is encoded into 1 to 4 bytes ( the dominant encoding )
In UTF16 each character is encoded into 1 to two 16-bit words and
in UTF-32 every character is encoded as a single 32-bit word.

Related

Using hexdump and how to find associated character?

I execute hexdump on a data file and it prints out the following :
> hexdump myFile.data
a4c3
After switching byte order I have the following :
c3a4
Do I assume those HEX values are actual Unicode values?
If so, the values are :
and
Or do I take the c3a4 and treat it as UTF-8 data (since my Putty session is set to UTF-8) then convert it to Unicode?
If so, it results into E4 which then is
Which is the proper interpretation?
You cannot assume those hex values are Unicode values. In fact, hexdump will never (well, see below...) give you Unicode values.
Those hex values represent the binary data as it was written to disk when the file was created. But in order to translate that data back to any specific characters/symbols/glyphs, you need to know what specific character encoding was used when the file was created (ASCII, UTF-8, and so on).
Also, I recommend using hexdump with the -C option (that's the uppercase C) to give the so-called "canonical" representation of the hex data:
c3 a4 0a
In my case, there is also a 0a representing a newline character.
So, in the above example we have 0xc3 followed by 0xa4 (I added the 0x part to indicate we are dealing with hex values). I happen to know that this file used UTF-8 when it was created. I can therefore determine that the character in the file is ä (also referred to by Unicode U+00e4).
But the key point is: you must know how the file was encoded, to know with certainty how to interpret the bytes provided by hexdump.
Unicode is (amongst other things) an abstract numbering system for characters, separate from any specific encoding. That is one of the reasons why it is so useful. But it just so happens that its designers used the same encoding as ASCII for the initial set of characters. So that is why ASCII letter a has the same code value as Unicode a. As you can see with Unicode vs. UTF-8, the encodings are not the same, once you get beyond that initial ASCII code range.

Reading 2 byte utf8 binary data

I have binary files that contain utf8 strings for example.
4F 00 4B 00
I'm trying to read this data and write it out to a text file but when I do the following:
data.toString('utf8');
I get an output of:
O K
Take note of the two spaces being interpreted from the 00. Is there any way to specify I'm using 2 byte little endian characters? I imagine if this didn't contain ascii characters this would actually break and produce garbage data instead of extra spaces.
The problem likely is when you're reading, not writing the string. The string you shared is not UTF-8, it's UTF-16. So what I'm thinking you want is to read the string as UTF-16 and write it as UTF-8.
Specifically this is UTF-16LE.

YOaf/MrA - What character encoding?

What kind of character encoding are the strings below?
KDLwuq6IC
YOaf/MrAT
0vGzc3aBN
SQdLlM8G7
https://en.wikipedia.org/wiki/Character_encoding
Character encoding is the encoding of strings to bytes (or numbers). You are only showing us the characters itself. They don't have any encoding by themselves.
Some character encodings have a different range of characters that they can encode. Your characters are all in the ASCII range at least. So they would also be compatible with any scheme that incorporates ASCII as a subset such as Windows-1252 and of course Unicode (UTF-8, UTF-16LE, UTF-16BE etc).
Note that your code looks a lot like base 64. Base 64 is not a character encoding though, it is the encoding of bytes into characters (so the other way around). Base 64 can usually be recognized by looking for / and + characters in the text, as well as the text consisting of blocks that are a multiple of 4 characters (as 4 characters encode 3 bytes).
Looking at the text you are probably looking for an encoding scheme rather than a character-encoding scheme.

How much data can you encode in a single character?

If I were creating a videogame level editor in AS3 or .NET with a string-based level format, that can be copied, pasted and emailed, how much data could I encode into each character? What is important is getting the maximum amount of data for the minimum amount of characters displayed on the screen, regardless of how many bytes the computer is actually using to store these characters.
For example if I wanted to store the horizontal position of an object in 1 string character, how many possible values could that have? Are there are any characters that can't be sent over a the internet, or that can't be copy and pasted? What difference would things like UTF8 make? Answers please for either AS3 or C#/.NET, or both.
2nd update: ok so Flash uses UTF16 for its String class. There are lots of control characters that I cannot use. How could I manage which characters are ok to use? Just a big lookup table? And can operating systems and browser handle UTF16 to the extent that you can safely copy and paste a UTF16 string into an email, notepad, etc?
Updated: "update 1", "update 2"
You can store 8 Bits in a single charakter with ANSI, ASCII or UTF-8 encoding.
But, for example, if you whant to use ASCII-Encoding you shouldn't use the first 5 bits (0001 1111 = 0x1F) and the chars 0x7F there are represent system-charaters like "Escape, null, start of text, end of text ..) who are not can be copy and paste. So you could store 223 (1110 0000 = 0xE0) different informations in one single charakter.
If you use UTF-16 you have 2 bytes = 16 bits - system-characters to store your informationen.
A in UTF-8 Encoding: 0x0041 (the first 2 digits are every 0!) or 0x41
A in UTF-16 Encoding: 0x0041 (the first 2 digits can be higher then 0)
A in ASCII Encoding: 0x41
A in ANSI Encoding: 0x41
see images at the and of this post!
update 1:
If you not need to modify the values without any tool (c#-tool, javascript-base webpage, ...) you can alternative base64 or zip+base64 your informationens. this solution avoid the problem that you descript in your 2nd update. "here are lots of control characters that I cannot use. How could I manage which characters are ok to use?"
If this is not an option you can not avoid to use any type of lookup-table.
the shortest way for an lookuptable are:
var illegalCharCodes = new byte[]{0x00, 0x01, 0x02, ..., 0x1f, 0x7f};
or you code it like this:
//The example based on ASNI-Encoding but in principle its the same with utf-16
var value = 0;
if(charcode > 0x7f)
value = charcode - 0x1f - 1; //-1 because 0x7f is the first illegalCharCode higher then 0x1f
else
value = charcode - 0x1f;
value -= 1; //because you need a 0 value;
//charcode: 0x20 (' ') -> value: 0
//charcode: 0x21 ('!') -> value: 1
//charcode: 0x22 ('"') -> value: 2
//charcode: 0x7e ('~') -> value: 94
//charcode: 0x80 ('€') -> value: 95
//charcode: 0x81 ('�') -> value: 96
//..
update 2:
for Unicode (UTF-16) you can use this table: http://www.tamasoft.co.jp/en/general-info/unicode.html
Any character represent with a symbol like or are empty you should not use.
So you can not store 50,000 possible values in one utf-16 character if you allow to copy and past them. you need any spezial-encoder and you must use 2 UTF-16 character like:
//charcode: 0x0020 + 0x0020 (' ') > value: 0
//charcode: 0x0020 + 0x0020 (' !') > value: 2
//charcode: 0x0020 + 0x0020 ('!A') > value: something higher 40.000, i dont know excatly because i dont have count the illegal characters in UTF-16 :D
(source: asciitable.com)
Confusingly, a char is not the same thing as a character. In C and C++, a char is virtually always an 8-bit type. In Java and C#, a char is a UTF-16 code unit and thus a 16-bit type.
But in Unicode, a character is represented by a "code" point that ranges from 0 to 0x10FFFF, for which a 16-bit type is inadequate. So a character must either be represented by a 21-bit type (in practice, a 32-bit type), or use multiple "code units". Specifically,
IN UTF-32, all characters require 32 bits.
In UTF-16, characters U+0000 to U+FFFF (the "basic multilingual plane"), except for U+D800 to U+DFFF which cannot be represented, require 16 bits, and all other characters require 32 bits.
In UTF-8, characters U+0000 to U+007F (the ASCII reportoire) require 8 bits, U+0080 to U+07FF require 16 bits, U+0800 to U+FFFF require 24 bits, and all other characters require 32 bits.
If I were creating a videogame level
editor with a string-based level
format, how much data could I encode
into each char? For example if I
wanted to store the horizontal
position of an object in 1 char, how
many possible values could that have?
Since you wrote char rather than "character", the answer is 256 for C and 65,536 for C#.
But char isn't designed to be a binary data type. byte or short would be more appropriate.
Are there are any characters that
can't be sent over a the internet, or
that can't be copy and pasted?
There aren't any characters that can't be sent over the Internet, but you have to be careful using "control characters" or non-ASCII characters.
Many Internet protocols (especially SMTP) are designed for text rather than binary data. If you want to send binary data, you can Base64 encode it. That gives you 6 bits of information for each byte of the message.
In C, a char is a type of integer, and it's most typically one byte wide. One byte is 8 bits so that's 2 to the power 8, or 256, possible values (as noted in another answer).
In other languages, a 'character' is a completely different thing from an integer (as it should be), and has to be explicitly encoded to turn it into a byte. Java, for example, makes this relatively simple by storing characters internally in a UTF-16 encoding (forgive me some details), so they take up 16 bits, but that's just implementation detail. Different encodings such as UTF-8 mean that a character, when encoded for transmission, could occupy anything from one to four bytes.
Thus your question is slighly malformed (which is to say it's actually several distinct questions in one).
How many values can a byte have? 256.
What characters can be sent in emails? Mostly those ASCII characters from space (32) to tilde (126).
What bytes can be sent over the internet? Any you like, as long as you encode them for transmission.
What can be cut-and-pasted? If your platform can do Unicode, then all of unicode; if not, not.
Does UTF-8 make a difference? UTF-8 is a standard way of encoding a string of characters into a string of bytes, and probably not much to do with your question (Joel Spolsky has a very good account of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).
So pick a question!
Edit, following edit to question Aha! If the question is: 'how do I encode data in such a way that it can be mailed?', then the answer is probably 'Use base64'. That is, if you have some purely binary format for your levels, then base64 is the 'standard' (very much quotes-standard) way of encoding that binary blob in a way that will make it through mail. The things you want to google for are 'serialization' and 'deserialization'. Base64 is probably close to the practical maximum of information-per-mailable-character.
(Another answer is 'use XML', but the question seems to imply some preference for compactness, and that a basically binary format is desirable).
The number of different states a variable can hold is two to the power of the number of bits it has. How many bits a variable has is something that is likely to vary according to the compiler and machine used. But in most cases a char will have eight bits and two to the power eight is two hundred and fifty six.
Modern screen resolutions being what they are, you will most likely need more than one char for the horizontal position of anything.

example of a utf-8 format octet string

I'm working w/ a function that expects a string formatted as a utf-8 encoded octet string. Can someone give me an example of what a utf-8 encoded octet string would look like?
Put another way, if I convert 'foo' to bytes, I get 112, 111, 111. What would these char codes look like as a utf-8 encoded octet string? Would it be "0x70 0x6f 0x6f"?
The context of my question is the process of generating an openid signature as described in the openid spec: "The message MUST be encoded in UTF-8 to produce a byte string." I'm looking for an example of what this would look like.
Thanks
No. UTF-8 characters can span multiple bytes. If you want to learn about UTF-8, you should start with its article on Wikipedia, which has a good description.
I think you may have made some mistakes in encoding your example, but in any case, my guess is that the answer that you really need is the UTF-8 is a superset of ASCII (the standard way to encode characters into bytes).
So, if you give an ASCII encoded string into a function that expects a UTF-8 encoded string, it should work just fine.
However, the opposite isn't true at all. UTF-8 can represent a lot of character ASCII cannot, so giving a UTF-8 encoded string to a function that expects an ASCII (i.e. 'normal') string is dangerous (unless you're positive that all the characters are part of the ASCII subset).
The string "foo" gets encoded as 66 6F 6F, but it's like that in nearly all ASCII derivatives. That's one of the biggest features of UTF-8: Backwards compatibility with 7-bit ASCII. If you're only dealing with ASCII, you don't have to do anything special.
Other characters are encoded with up to 4 bytes. Specifically, the bits of the Unicode code point are broken up into one of the patterns:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
with the requirement of using the shortest sequence that fits. So, for example, the Euro sign ('€' = U+20AC = binary 10 000010 101100) gets encoded as 1110 0010, 10 000010, 10 101100 = E2 82 AC.
So, it's just a simple matter of going through the Unicode code points in a string and encoding each one in UTF-8.
The hard part is figuring out what encoding your string is in to begin with. Most modern languages (e.g., Java, C#, Python 3.x) have distinct types for "byte array" and "string", where "strings" always have the same internal encoding (UTF-16 or UTF-32), and you have to call an "encode" function if you want to convert it to an array of bytes in a specific encoding.
Unfortunately, older languages like C conflate "characters" and "bytes". (IIRC, PHP is like this too, but it's been a few years since I used it.) And even if your language does support Unicode, you still have to deal with disk files and web pages with unspecified encodings. For more details, search for "chardet".

Resources