How to use leading_zeros/trailing_zeros in platform independent way?

How to use leading_zeros/trailing_zeros in platform independent way? - rust

I want find the first non-zero bit in the binary representation of a u32. leading_zeros/trailing_zeros looks like what I want:
let x: u32 = 0b01000;
println!("{}", x.trailing_zeros());
This prints 3 as expected and described in the docs. What will happen on big-endian machines, will it be 3 or some other number?
The documentation says
Returns the number of trailing zeros in the binary representation
is it related to machine binary representation (so the result of trailing_zeros depends on architecture) or base-2 numeral system (so result will be always 3)?

The type u32 respresents binary numbers with 32 bits as an abstract concept. You can imagine them as abstract, mathematical numbers in the range from 0 to 232-1. The binary representation of these numbers is written in the usual convention of starting with the most significant bit (MSB) and ending with the least significant bit (LSB), and the trailing_zeros() method returns the number of trailing zeros in that representation.
Endianness only comes into play when serializing such an integer to bytes, e.g. for writing it to a bytes buffer, a file or the network. You are not doing any of this in your code, so it doesn't matter here.
As mentioned above, writing a number starting with the MSB is also just a convention, but this convention is pretty much universal today for numbers written in positional notation. For programming, this convention is only relevant when formatting a number for display, parsing a number from a string, and maybe for naming methods like trailing_zeros(). When storing an u32 in a register, the bits don't have any defined order.

Related

Encoding binary strings into arbitrary alphabets

If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).

But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).

It is possible to represent a 64-bit number as string when hardware doesn't support 64-bit number?

I want to show a 64-bit number as string. The problem is that my hardware doesn't support 64-bit number, just 32-bit.
So, I have the 64-bit number splitted into two 32-bit number (High and low part).
Example: 64-bit number : 12345678987654321 (002B DC54 6291 F4B1h)
32-bit low part: 1653732529 (6291 F4B1h)
32-bit high part: 2874452 (002B DC54h)
I think the solution to my problem would be showing this number as string.
It is possible?
Thanks.

yes you can use an array of 32 bit uints or even lower bit-width ...
for printing you can use this:
hex to dec
so first print a hex string which is easy on any bit-width (as you just stack up the lower bit-widths prints together from MSW to LSW) and then convert the hex text to dec text...
With this chained array of units you can do the math operations like this:
Cant make value propagate through carry
Doing operation on array of uints is much much more faster than on strings ...
but if you insist yes you can use string representation too ...
There are also hybrid representation like BCD that are suitable for this but your MCU would need to have support for it ...

Depending on your language of choice, the language may allow you to use greater-than-32bit integers, even on 32bits architectures (like python).
If that is the case the problem becomes trivial: compute the value, then compute the corresponding hex string.

How can I combine nom parsers to get a more bit-oriented interface to the data?

I'm working on decoding AIS messages in Rust using nom.
AIS messages are made up of a bit vector; the various fields in each message are an arbitrary number of bits long, and they don't always align on byte boundaries.
This bit vector is then ASCII encoded, and embedded in an NMEA sentence.
From http://catb.org/gpsd/AIVDM.html:
The data payload is an ASCII-encoded bit vector. Each character represents six bits of data. To recover the six bits, subtract 48 from the ASCII character value; if the result is greater than 40 subtract 8. According to [IEC-PAS], the valid ASCII characters for this encoding begin with "0" (64) and end with "w" (87); however, the intermediate range "X" (88) to "_" (95) is not used.
Example
!AIVDM,1,1,,A,D03Ovk1T1N>5N8ffqMhNfp0,0*68 is the NMEA sentence
D03Ovk1T1N>5N8ffqMhNfp0 is the encoded AIS data
010100000000000011011111111110110011000001100100000001011110001110000101011110001000101110101110111001011101110000011110101110111000000000 is the decoded AIS data as a bit vector
Problems
I list these together because I think they may be related...
1. Decoding ASCII to bit vector
I can do this manually, by iterating over the characters, subtracting the appropriate values, and building up a byte array by doing lots of work bitshifting, and so on. That's fine, but it seems like I should be able to do this inside nom, and chain it with the actual AIS bit parser, eliminating the interim byte array.
2. Reading arbitrary number of bits
It's possible to read, say, 3 bits from a byte array in nom. But, each call to bits! seems to consume a full byte at once (if reading into a u8).
For example:
named!(take_3_bits<u8>, bits!(take_bits!(u8, 3)));
will read 3 bits into a u8. But if I run take_3_bits twice, I'll have consumed 16 bits of my stream.
I can combine reads:
named!(get_field_1_and_2<(u8, u8)>, bits!(pair!(take_bits!(u8, 2), take_bits!(u8, 3))));
Calling get_field_1_and_2 will get me a (u8, u8) tuple, where the first item contains the first 2 bits, and the second item contains the next 3 bits, but nom will then still advance a full byte after that read.
I can use peek to prevent the nom's read pointer from advancing, and then manually manage it, but again, that seems like unnecessary extra work.

Proper encoding for fixed-length storage of Unicode strings?

I'm going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I'm fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I'm not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.
First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!

Unicode terminology:
Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points
So:
A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points
So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I'm assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)
In terms of communicating length restrictions via a user interface you'll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Does a strings length equal the byte size?

Exactly that: Does a strings length equal the byte size? Does it matter on the language?
I think it is, but I just want to make sure.
Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.
As the answer is no, that's all I need know.

Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.
By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.

It entirely depends on the platform and representation.
For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.
Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.
As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.

It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.

Not always, it depends on the encoding.

There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)
Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)
Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.
Unicode strings (in various languages) use two bytes per char.
Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.

You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.
In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string