While reading up on RFC 1035 (a.k.a DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION) I found the following:
4.1.4. Message compression
In order to reduce the size of messages, the domain system utilizes a
compression scheme which eliminates the repetition of domain names in a
message. In this scheme, an entire domain name or a list of labels at
the end of a domain name is replaced with a pointer to a prior occurance
of the same name.
The pointer takes the form of a two octet sequence:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| 1 1| OFFSET |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The first two bits are ones. This allows a pointer to be distinguished
from a label, since the label must begin with two zero bits because
labels are restricted to 63 octets or less. (The 10 and 01 combinations
are reserved for future use.) The OFFSET field specifies an offset from
the start of the message (i.e., the first octet of the ID field in the
domain header). A zero offset specifies the first byte of the ID field,
etc.
Why the "63 octets or less" restriction on labels guarantees that pointers that begin with two one (1) bits will can be distinguished from labels?
The explanation is in the same RFC 1035
Each label is represented as a one octet length field followed by that
number of octets.
So the leftmost (first) octet contains the label field length and its maximum value can be 63 or 0011111 binary and the presence of at least one 1 in two highest bits can distinguish label from something else.
Related
The fb_var_screeninfo has several fields that I can use for determining the pixel formats, most notably bits_per_pixel and length / offset fields for red / green / blue / alpha ("transp") channels.
Now I noticed some apparent redundancy. If I analyze the length fields, I already know the bits per pixel, so checking bits_per_pixel should be unnecessary. But, there's theory and there's practice. I can imagine for example that in some fringe cases, the length values would not be populated properly, while bits_per_pixel would always be valid.
So. My question is: Can I rely on the length and offset fields to always be valid, and just ignore the bits_per_pixel field?
I'm in the process of writing a JPEG file decoder to learn about the workings of JPEG files. In working through ITU-T81, which specifies JPEG, I ran into the following regarding the DQT segment for quantization tables:
In many of JPEG's segments, there is an n parameter which you read from the segment, which then indicates how many iterations of the following item there are. However, in the DQT case, it just says "multiple", and its not defined how many multiples there are. One can possibly infer from Lq, but the way this multiple is defined is a bit of an anomaly compared to the other segments.
For anyone who is familiar with this specification, what is the right way to determine how many multiples, or n, of (Pq, Tq, Q0..Q63) there should be?
Take the length field (LQ), subtract the length of the Pq/Tq field (one byte if I remember), and that is N.
I'm new in P2P networking and currently I try to understand some basic things specified by Kademlia papers. The main thing that I cannot understand is Kademlia distance metric. All papers define the distance as XOR of two IDs. The ID size is 160 bits, so the result is also has 160 bits.
The question: what is a convenient way to represent this distance as integer?
Some implementations, that I checked, use the following:
distance = 160 - prefix length (where prefix length is number of leading zeros).
Is it correct approach?
Some implementations, that I checked, use the following: distance = 160 - prefix length (where prefix length is number of leading zeros). Is it correct approach?
That approach is based on an early revision of the kademlia paper and is insufficient to implement some of the later chapters of the final paper.
A full-fledged implementation should use a tree-like routing table that orders buckets by their absolute position in the keyspace which can be resized when bucket splitting happens.
The ID size is 160 bits, so the result is also has 160 bits. The question: what is a convenient way to represent this distance as integer?
The distance metrics are 160bit integers. You can use a big-integer class or roll your own based on arrays. To get shared prefix bit counts you just have to count the leading zeroes, which scale logarithmically with the network size and should normally fit in much smaller integers once you're done.
I watched a video on youtube about the bits. After watching it I have a confusion what the actual size does a number, string or any character takes. What I have understood from the video is.
1= 00000001 // 1 bit
3= 00000011 // 2 bits
511 = 111111111 // 9bits
4294967295= 11111111111111111111111111111111 //32 bit
1.5 = ? // ?
I just want to know above given statement expect in decimal point are correct ? or all numeric , string or any character take 8 byte. I am using 64 bit operating system.
And what is the binary code of decimal value
If I understand correctly, you're asking how many bits/bytes are used to represent a given number or character. I'll try to cover the common cases:
Integer (whole number) values
Since most systems use 8-bits per byte, integer numbers are usually represented as a multiple of 8 bits:
8 bits (1 byte) is typical for the C char datatype.
16 bits (2 bytes) is typical for int or short values.
32 bits (4 bytes) is typical for int or long values.
Each successive bit is used to represent a value twice the size of the previous one, so the first bit represents one, the second bit represents two, the third represents four, and so on. If a bit is set to 1, the value it represents is added to the "total" value of the number as a whole, so the 4-bit value 1101 (8, 4, 2, 1) is 8+4+1 = 13.
Note that the zeroes are still counted as bits, even for numbers such as 3, because they're still necessary to represent the number. For example:
00000011 represents a decimal value of 3 as an 8-bit binary number.
00000111 represents a decimal value of 7 as an 8-bit binary number.
The zero in the first number is used to distinguish it from the second, even if it's not "set" as 1.
An "unsigned" 8-bit variable can represent 2^8 (256) values, in the range 0 to 255 inclusive. "Signed" values (i.e. numbers which can be negative) are often described as using a single bit to indicate whether the value is positive (0) or negative (1), which would give an effective range of 2^7 (-127 to +127) either way, but since there's not much point in having two different ways to represent zero (-0 and +0), two's complement is commonly used to allow a slightly greater storage range: -128 to +127.
Decimal (fractional) values
Numbers such as 1.5 are usually represented as IEEE floating point values. A 32-bit IEEE floating point value uses 32 bits like a typical int value, but will use those bits differently. I'd suggest reading the Wikipedia article if you're interested in the technical details of how it works - I hope that you like mathematics.
Alternatively, non-integer numbers may be represented using a fixed point format; this was a fairly common occurrence in the early days of DOS gaming, before FPUs became a standard feature of desktop machines, and fixed point arithmetic is still used today in some situations, such as embedded systems.
Text
Simple ASCII or Latin-1 text is usually represented as a series of 8-bit bytes - in other words it's a series of integers, with each numeric value representing a single character code. For example, an 8-bit value of 00100000 (32) represents the ASCII space () character.
Alternative 8-bit encodings (such as JIS X 0201) map those 2^8 number values to different visible characters, whilst yet other encodings may use 16-bit or 32-bit values for each character instead.
Unicode character sets (such a the 8-bit UTF-8 or 16-bit UTF-16) are more complicated; a single UTF-16 character might be represented as a single 16-bit value or a pair of 16-bit values, whilst UTF-8 characters can be anywhere from one 8-bit byte to four 8-bit bytes!
Endian-ness
You should also be aware that values spanning more than a single 8-bit byte are typically byte-ordered in one of two ways: little endian, or big endian.
Little Endian: A 16-bit value of 512 would be represented as 11111111 00000001 (i.e. smallest-value bits come first).
Big Endian: A 16-bit value of 512 would be represented as 00000001 11111111 (i.e. largest-value bits come first).
You may also hear of mixed-endian, middle-endian, or bi-endian representations - see the Wikipedia article for further information.
There is difference between a bit, a byte.
1 bit is a single digit 0 or 1 in base 2.
1 byte = 8 bits.
Yes the statements you gave are correct.
Binary code of 1.5 will be 001.100. However this is how we interpret binary. The way computer stores numbers is different and is based on compiler, platform. For example C uses IEEE 754 format. Google about it to learn more.
Your OS is 64 bit means your CPU architecture is 64 bit.
a bit is one binary digit e.g 0 or 1.
a byte is eight bits, or two hex digits
A Nibble is half a byte or 1 hex digit
Words get a bit more complex. Originally it was the number of bytes required to cover the range of addresses available in memory. As we had a lot of hybrids and memory schemes. Word is usually two-bytes, a double word, 4 bytes etc.
In a computing everything comes down to binary, a combination of 0s and 1s. Characters, decimal numbers etc are representations.
So the character 0 is 7 (or 8 bit ascii) is 00110000 in binary, 30 in hex, and 48 as decimal (base 10) number. It's only '0' if you choose to 'see' it as a single byte character.
Representing numbers with decimal points is even more varied. There are many accepted ways of doing that, but they are conventions not rules.
Have a look for 1 and 2s complement, gray code, BCD, floating point representation and such to get more of an idea.
How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.
The original string can contain the characters A-Z (upper case), 0-9 ans a dash.
If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.
In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:
where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:
. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.
If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.
In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.
However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!