Does a strings length equal the byte size? - string

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.

Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.

It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.

It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.

Not always, it depends on the encoding.

There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.

You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

Related

Can lowercasing a UTF-8 string cause it to grow?

I'm helping out with someone writing some code to compare UTF-8 strings in a case-insensitive way. The scheme they are using is to uppercase the strings and then compare. The input strings can all fit in a 255 byte array. The output string similarly must fit in a 255 byte array.
I'm not a UTF-8 or Unicode expert, but I think this this scheme can't work for all strings. My understanding is that either lower casing or upper casing a UTF-8 string can result in the output string being longer (byte array wise), and as such changing case is probably not the best way to attack this problem. I'm trying to demonstrate the difficulty by giving a few strings that will not work with this design.
For example, take a string of the character U+0587 repeated 100 times. U+0587 takes two bytes in UTF-8, so the overall length of the byte array for the string is 200 bytes (ignoring the trailing null for now). If that string is uppercased, however, it becomes U+0535 U+0552, and each of those takes two bytes, for a total of 4 bytes. The 200 byte array is now 400 bytes, and cannot be stored in the limited space available.
So here's my question: I gave an example of a lowercase character needing more space to store when uppercased. Are there any examples of an uppercase character needing more space to store when lowercased? The locale is always en_US.UTF-8 in this case.
Thanks for any help.
Yes. Examples from my environment:
U+023A Ⱥ U+023E Ⱦ
There are several related factors that could cause variation. You already pointed out one:
The locale that you specify will affect the casing of characters that that locale is concerned with.
The version of the Unicode Common Locale Data Repository that your library uses.
The version of the Unicode Character Database that your library uses.
These aren't fixed targets because we can expect that there will be future versions and that there will be users using characters from them.
Ultimately, this comes down to your environment and to practical purpose this has.

What exactly is encoding-independent means

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?
From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

What is the difference between a unicode and binary string?

I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60' into b'\x4f\x60' ?
First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u up front. Just write strings. So instantly you should see that the literal u'4f60' is just like writing actual '4f60'.
A bytes literal - aka b'some literal' - is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x escaped version. Don't be confused by this - b'\x61' is the same as b'a'. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes literals and str literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change b'\\u4f60' into b'\x4f\x60' ?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60 is that Han ideograph glyph. \x4f\x60 is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f) followed by backtick.
I can ask python to turn that unicode-escaped bytes sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change '\\u4f60' into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60' - utf-16be.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16 is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.

What is the difference between binary safe strings and binary unsafe strings?

I was reading redis manifesto[1] and it seems redis accepts only binary safe strings as keys but I don't know the difference between the two. Can anyone explain with an example?
[1] http://oldblog.antirez.com/post/redis-manifesto.html
According to Redis documentation, simple Redis strings have syntax "+redis_response\r\n" whereas bulk Redis strings have syntax "$str_len\r\nbinary_safe_string\r\n".
In other words, binary safe string in Redis can contain any data as simple as "foo" to any binary data upto 512MB say a JEPG image. Binary safe string has its length encoded in it and does not terminate with any particular character such as a NULL terminating string in C which ends with '\0.
HTH,
Swanand
I'm not familiar with the system in question, but the term "binary safe string" might be used either to describe certain string-storage types or to describe particular string instances. In a binary-safe string type, a string of length N may be used to encapsulate any sequence of N values in the range either 0-255 or 0-65535 (for 8- or 16-bit types, respectively). A binary-safe string instance might be one whose representation may be subdivided into uniformly-sized pieces, with each piece representing one character, as distinct from a string instance in which different characters require different amounts of storage space.
Some string types (which are not binary safe) will use variable-length representations for certain characters, and will behave oddly if asked to act upon e.g. a string which contains the code for "first half of a multi-part character" followed by something other than a "second half of multi-part character". Further, some code which works with strings will assume that it the Nth character will be stored in either the Nth byte or the Nth pair of bytes, and will malfunction if given a string in which, e.g. the 8th character is stored in the 12th and 13th pairs of bytes.
Looking only briefly at the link provided, I would guess that it's saying that the redis does not expect to only work with strings that use different numbers of bytes to hold different characters, though I'm not quite clear whether it's assuming that a string type will be able to handle any possible sequence of bytes, or whether it's assuming that any string instance which it's given may be safely regarded as a sequence of bytes. I think the fundamental concepts of interest, though, are (1) some string types use variable-length encodings and others do not; (2) even in types that use variable-length encodings, a useful subset of string instances will consist only of fixed-length characters.
Binary-safe means that a string can contain any character, while binary-unsafe can not, such as '\0' in C language. '\0' is the ending of a string, which means characters after '\0' and before '\0' will be considered as two different strings.

Proper encoding for fixed-length storage of Unicode strings?

I'm going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I'm fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I'm not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.
First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!
Unicode terminology:
Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points
So:
A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points
So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I'm assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)
In terms of communicating length restrictions via a user interface you'll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Resources