What's the difference between a "bit" and "octet"? - python-3.x

What's the difference between a "bit" and "octet"? Some python books, depending on the author, seem to use the terms interchangeably. I asked a PHD level guy and he said there was a difference but didn't explain what the difference was.

A bit is a single binary digit.
An octet is a collection 8 bits, sometimes called a "byte". There is no formal definition of a byte as 8 bits (though it is the generally accepted standard). The term octet is used when it is necessary to unambiguously specify that there are only 8 bits in the collection.

An octet is always eight bits. A byte is typically eight bits, or the width of a character in a given architecture. Some older computers represented characters in six bits. See http://en.wikipedia.org/wiki/Byte .

An octet and a byte is the same.
A bit is just one 0 or 1. Eight bits make a byte.

Related

How to use leading_zeros/trailing_zeros in platform independent way?

I want find the first non-zero bit in the binary representation of a u32. leading_zeros/trailing_zeros looks like what I want:
let x: u32 = 0b01000;
println!("{}", x.trailing_zeros());
This prints 3 as expected and described in the docs. What will happen on big-endian machines, will it be 3 or some other number?
The documentation says
Returns the number of trailing zeros in the binary representation
is it related to machine binary representation (so the result of trailing_zeros depends on architecture) or base-2 numeral system (so result will be always 3)?
The type u32 respresents binary numbers with 32 bits as an abstract concept. You can imagine them as abstract, mathematical numbers in the range from 0 to 232-1. The binary representation of these numbers is written in the usual convention of starting with the most significant bit (MSB) and ending with the least significant bit (LSB), and the trailing_zeros() method returns the number of trailing zeros in that representation.
Endianness only comes into play when serializing such an integer to bytes, e.g. for writing it to a bytes buffer, a file or the network. You are not doing any of this in your code, so it doesn't matter here.
As mentioned above, writing a number starting with the MSB is also just a convention, but this convention is pretty much universal today for numbers written in positional notation. For programming, this convention is only relevant when formatting a number for display, parsing a number from a string, and maybe for naming methods like trailing_zeros(). When storing an u32 in a register, the bits don't have any defined order.

How does UTF16 encode characters?

EDIT
Since it seems I'm not going to get an answer to the general question. I'll restrict it to one detail: Is my understanding of the following, correct?
That surrogates work as follows:
If the first pair of bytes is not between D800 and DBFF - there
will not be a second pair.
If it is between D800 and DBFF - a) there will be a second pair b)
the second pair will be in the range of DC00 and DFFF.
There is no single pair UTF16 character with a value between D800
and DBFF.
There is no single pair UTF16 character with a value between DC00
and DFFF.
Is this right?
Original question
I've tried reading about UTF16 but I can't seem to understand it. What are "planes" and "surrogates" etc.? Is a "plane" the first 5 bits of the first byte? If so, then why not 32 planes since we're using those 5 bits anyway? And what are surrogates? Which bits do they correspond to?
I do understand that UTF16 is a way to encode Unicode characters, and that it sometimes encodes characters using 16 bits, and sometimes 32 bits, no more no less. I assume that there is some list of values for the first 2 bytes (which are the most significant ones?) which indicates that a second 2 bytes will be present.
But instead of me going on about what I don't understand, perhaps someone can make some order in this?
Yes on all four.
To clarify, the term "pair" in UTF-16 refers to two UTF-16 code units, the first in the range D800-DBFF, the second in DC00-DFFF.
A code unit is 16-bits (2 bytes), typically written as an unsigned integer in hexadecimal (0x000A). The order of the bytes (0x00 0x0A or 0x0A 0x00) is specified by the author or indicated with a BOM (0xFEFF) at the beginning of the file or stream. (The BOM is encoded with the same algorithm as the text but is not part of the text. Once the byte order is determined and the bytes are reordered to the native ordering of the system, it typically is discarded.)

Security: longer keys versus more available characters

I apologize if this has been answered before, but I was not able to find anything. This question was inspired by a comment on another security-related question here on SO:
How to generate a random, long salt for use in hashing?
The specific comment is as follows (sixth comment of accepted answer):
...Second, and more importantly, this will only return hexadecimal
characters - i.e. 0-9 and A-F. It will never return a letter higher
than an F. You're reducing your output to just 16 possible characters
when there could be - and almost certainly are - many other valid
characters.
– AgentConundrum Oct 14 '12 at 17:19
This got me thinking. Say I had some arbitrary series of bytes, with each byte being randomly distributed over 2^(8). Let this key be A. Now suppose I transformed A into its hexadecimal string representation, key B (ex. 0xde 0xad 0xbe 0xef => "d e a d b e e f").
Some things are readily apparent:
len(B) = 2 len(A)
The symbols in B are limited to 2^(4) discrete values while the symbols in A range over 2^(8)
A and B represent the same 'quantities', just using different encoding.
My suspicion is that, in this example, the two keys will end up being equally as secure (otherwise every password cracking tool would just convert one representation to another for quicker attacks). External to this contrived example, however, I suspect there is an important security moral to take away from this; especially when selecting a source of randomness.
So, in short, which is more desirable from a security stand point: longer keys or keys whose values cover more discrete symbols?
I am really interested in the theory behind this, so an extra bonus gold star (or at least my undying admiration) to anyone who can also provide the math / proof behind their conclusion.
If the number of different symbols usable in your password is x, and the length is y, then the number of different possible passwords (and therefore the strength against brute-force attacks) is x ** y. So you want to maximize x ** y. Both adding to x or adding to y will do that, Which one makes the greater total depends on the actual numbers involved and what your practical limits are.
But generally, increasing x gives only polynomial growth while adding to y gives exponential growth. So in the long run, length wins.
Let's start with a binary string of length 8. The possible combinations are all permutations from 00000000 and 11111111. This gives us a keyspace of 2^8, or 256 possible keys. Now let's look at option A:
A: Adding one additional bit.
We now have a 9-bit string, so the possible values are between 000000000 and 111111111, which gives us a keyspace size of 2^9, or 512 keys. We also have option B, however.
B: Adding an additional value to the keyspace (NOT the keyspace size!):
Now let's pretend we have a trinary system, where the accepted numbers are 0, 1, and 2. Still assuming a string of length 8, we have 3^8, or 6561 keys...clearly much higher.
However! Trinary does not exist!
Let's look at your example. Please be aware I will be clarifying some of it, which you may have been confused about. Begin with a 4-BYTE (or 32-bit) bitstring:
11011110 10101101 10111110 11101111 (this is, btw, the bitstring equivalent to 0xDEADBEEF)
Since our possible values for each digit are 0 or 1, the base of our exponent is 2. Since there are 32 bits, we have 2^32 as the strength of this key. Now let's look at your second key, DEADBEEF. Each "digit" can be a value from 0-9, or A-F. This gives us 16 values. We have 8 "digits", so our exponent is 16^8...which also equals 2^32! So those keys are equal in strength (also, because they are the same thing).
But we're talking about REAL passwords, not just those silly little binary things. Consider an alphabetical password with only lowercase letters of length 8: we have 26 possible characters, and 8 of them, so the strength is 26^8, or 208.8 billion (takes about a minute to brute force). Adding one character to the length yields 26^9, or 5.4 trillion combinations: 20 minutes or so.
Let's go back to our 8-char string, but add a character: the space character. now we have 27^8, which is 282 billion....FAR LESS than adding an additional character!
The proper solution, of course, is to do both: for instance, 27^9 is 7.6 trillion combinations, or about half an hour of cracking. An 8-character password using upper case, lower case, numbers, special symbols, and the space character would take around 20 days to crack....still not nearly strong enough. Add another character, and it's 5 years.
As a reference, I usually make my passwords upwards of 16 characters, and they have at least one Cap, one space, one number, and one special character. Such a password at 16 characters would take several (hundred) trillion years to brute force.

Word, Doubleword, Quadword

It's my second question, one after another. That's the problem with assembly (x86 - 32bit) too.
"Programming from the Ground Up" says that 4bytes are 32bits and that's a word.
But Intel's "Basic Architecture" guide says, that word is 16bits (2 bytes) and 4 bytes is a dualword.
Memory uses 4bytes words, to get to another word I have to skip next 4 bytes, on each word I can make 4 offsets (0-3) to read a byte, so it's wrong with Intel's name, but this memory definition goes from Intel, so what's there bad?
And how to operate on words, dualword, quadwords in assembly? How to define the number as quadword?
To answer your first question, the processor word size is a function of the architecture. Thus, a 32-processor has a 32-bit word. In software types, including assembly, usually there is need to identify the size unambigously, so word type for historical reasons is 16-bits. So probably both sources are correct, if you read them in context: the first one is referring to the processor word, while the Intel guide is referring to the word type.
We've got different "word"s: program words, memory words, OS-specific words, architecture-specific words (program space word, flash word, eeprom word), even address words.
It's just a matter of convention what size the word word refers to.
I usually find the size of the word by looking at the number of hex digits the context is using to show them. Intel's most common type, 4 digits (0x0000), is two bytes.
And for further information, even byte is a convention. In many systems in the past bytes have been 7 or 9 bits. Most architectures nowadays have 8-bit bytes. The correct name for an always-8-bit structure is an octet.

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.
Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.
It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.
It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.
Not always, it depends on the encoding.
There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.
You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

Resources