bitshift large strings for encoding QR Codes - string

As an example, suppose a QR Code data stream contains 55 data words (each one byte in length) and 15 error correction words (again one byte). The data stream begins with a 12 bit header and ends with four 0 bits. So, 12 + 4 bits of header/footer and 15 bytes of error correction, leaves me 53 bytes to hold 53 alphanumeric characters. The 53 bytes of data and 15 bytes of ec are supplied in a string of length 68 (str68). The problem seems simple enough - concatenate 2 bytes of (right-shifted) header data with str68 and then left shift the entire 70 bytes by 4 bits.
This is the first time in many years of programming that I have ever needed to do something like this, I am a c and bit shifting noob, so please be gentle... I have done a little investigation and so far have not been able to figure out how to bitshift 70 bytes of data; any help would be greatly appreciated.
Larger QR codes can hold 2000 bytes of data...

You need to look at this 4 bits at a time.
The first 4 bits you need to worry about are the lower bits of the first byte. Fortunately this is an easy case because they need to end up in the upper bits of the first byte.
The next 4 bits you need to worry about are the upper bits of the second byte. These need to end up as the lower bits of the first byte.
The next 4 bits you need to worry about are the lower bits of the second byte. But fortunately you already know how to do this because you already did it for the first byte.
You continue in this vein until you have dealt with the lower bytes of the 70th byte.

Related

How can I shorten a hexadecimal string further?

I am using MongoDB's build in id fields to label products and for ease of usage/typability, I would like to compress the _id field down from a hexadecimal string that looks like 5b69c35ac2cc78c8979a8a9b to something shorter and involving all letters of the alphabet (both uppercase and lowercase) and numbers. preferably it would involve no more than 10 or 12 characters. Are there any common methods of accomplishing this in Node.JS/MongoDB?
You could convert them to base64, that would make them 16 characters long.
Example:
Buffer.from('5b69c35ac2cc78c8979a8a9b', 'hex').toString('base64') // W2nDWsLMeMiXmoqb
It's better if you can directly access the Buffer - converting many ObjectIds from string could be costly.
The code 5b69c35ac2cc78c8979a8a9b is 24 bytes long (in hex), which means the absolute minimum number of bytes needed to represent this value without losing information is 12, ranging from 0-255 which is not what we want.
If we take a look at the ObjectId we could (maybe) eliminate some bytes:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
Removing machine identifier and process id (if all id's are generated by the same process) would leave us with 7 bytes (0-255), which is still not ideal to encode in base64 or even base32.
So it would probably be better to just use a 32 bit unsigned integer for the product codes and display it as hex using 8 bytes (the leading zeros could be removed).
Encoding those 4 bytes in base64 wouldn't help much (every 3 bytes become 4 bytes), and personally I would prefer case insensitive id's for use in url's which would leave us only with base32.
For better ease of usage/typability than hexadecimal, those 4 bytes could be encoded in z-base-32 and would fit in 7 bytes without padding (7 * 5 bits = 35 bits).

Problems with SHA 2 Hashing and Java

I am working on following the SHA-2 cryptographically functions as stated in https://en.wikipedia.org/wiki/SHA-2.
I am examining the lines that say:
begin with the original message of length L bits append a single '1' bit;
append K '0' bits, where K is the minimum number >= 0 such that L + 1 + K + 64 is a multiple of 512
append L as a 64-bit big-endian integer, making the total post-processed length a multiple of 512 bits.
I do not understand the last two lines. If my string is short can its length after adding K '0' bits be 512. How should I implement this in Java code?
First of all, it should be made clear that the "string" that is talked about is not a Java String but a bit string. These algorithms are binary/bit based. The implementation will generally not handle bits but bytes. So there is a translation phase where you should see bytes instead of bits.
SHA-512 is operated on in blocks of 512 bits (SHA-224/256) or 1024 bits (SHA-384/512). So basically you have a 64 or 128 byte buffer that you are filling before operating on it. You could also directly cache the data in 32 bit int fields (SHA-224/256) or 64 bit long fields, as that is the word size that is operated on.
Now the padding is relatively simple procedure. The padding is called bit-padding. As it is used in big-endian mode (SHA-2 fortunately uses this instead of the braindead little endian mode in SHA-3) the padding consists of a single bit set on the highest order bit in a byte, with the rest filled by zero's. That makes for a value of (byte) 0x80 which must be put in the buffer.
If you cannot create this padding because the buffer is full then you will have to process the previous block, and then set the first bit of the now available buffer to (byte) 0x80. In the newer Java you can also use (byte) 0b1_0000000 byte the way, which is more explicit.
Now you simply add zero's until you have 8 to 16 bytes left, again depending on the hash output size used. If there aren't enough bytes then fill till the end, process the block, and re-start filling with zero bytes until you have 8 or 16 bytes left again.
Now finally you have to encode the number of bits in those 8 or 16 bytes you've left. So multiply your input by eight, and make sure you encode those bytes in the same way as you'd expect in Java with the least significant bits as much to the right as possible. You might want to use https://docs.oracle.com/javase/8/docs/api/java/nio/ByteBuffer.html#putLong-long- for this if you don't want to program it yourself. You may probably forget about anything over 2^56 bytes anyway, so if you have SHA-384/SHA-512 then simply set the first eight bytes to zero.
And that's it, except that you still need to process that last block and then use as many bytes from the left as required for your particular output size.

How to determine the byte length of a long

Is there a fast way to determine the number of bytes used in a long? I'm looking for something like this:
len((1000**1000).to_bytes())
(The problem of course is that to_bytes wants the number of bytes as input.)
(x.bit_length() + 7) // 8 will do what you want. Number of bits, converted to bytes and rounded up.

Understanding the spec of the ogg header format

For writing my own ogg-container-class (not using libogg), I try to understand the needed header format. According to the spec, at byte 27 of the stream (starting to count at 0) starts the "segment_table (containing packet lacing values)". This is the red marked byte 13. Concerning the Opus-data that I want to include, the Opus data must start with OpusHead (4F 70 75 73) on its beginning. Why doesn't it start on position 27 where the red 13 is placed? A 13 is a "device control 3" symbol that neither occurs in the Ogg spec, nor in the Opus spec.
EDIT: I found this link that describes the spec a little. There it becomes clear (which it is not from the first link imho) that the 13 (byte 27) is the size of the following segment.
That appears to be a single byte giving the length of the following segment_table data. So there is 13(hex) bytes (16 decimal) bytes of segment_table data.
RFC 3533 is a more verbose description of the format header.
Byte 26 says how many bytes the segment table occupies, so you read that, add 27, and that tells you where the first packet starts (or continues).
The segment table tells you the length(s) of the encapsulated packet(s). Basically you read through the table, adding together the values in each successive byte. If the value you just added is < 255 then that marks a packet boundary, so record the current value of the accumulator, reset it to zero, then continue until you reach the end of the table.
In your example, the segment table size in byte 26 is 1, so the data starts at 27+1 or byte 28, which is the start of the 'OpusHead' string. The value in the 1 byte segment table is 0x13, so the packet is 19 bytes long. 28+19 is 47 (or 0x2f) which is the start of the 'OggS' capture pattern at the start of the next header.
This slightly complicated algorithm is designed to store framing data for many small packets with bounded overhead while still allowing arbitrarily large packets. Note also that packets can be continued between pages, spanning 2 or more segment tables.

Efficient binary-to-string formatting (like base64, but for UTF8/UTF16)?

I have many bunches of binary data, ranging from 16 to 4096 bytes, which need to be stored to a database and which should be easily comparable as a unit (e.g. two bunches of data batch only if the lengths match and all bytes match). Strings are nice for that, but converting binary data blindly to a string is apt to cause problems due to character encoding/reinterpretation issues.
Base64 was a common method for storing strings in an era when 7-bit ASCII was the norm; its 33% space penalty was a little annoying, but not horrible. Unfortunately, if one is using UTF-16, the space penalty is 166% (8 bytes to store 3) which seems pretty icky.
Is there any common storage method for storing binary data in a valid Unicode string which will allow better efficiency in UTF-16 (and hopefully not be too horrible in UTF-8)? A base-32768 coding would store 240 bits in sixteen characters, which would take 32 bytes of UTF-16 or 48 bytes of UTF-8. By comparison, base64 coding would use 40 characters, which would take 80 bytes of UTF-16 or 40 bytes of UTF-8. An approach which was designed to take the same space in UTF-8 or UTF-16 might store 48 bits in three characters that would take eight bytes in either UTF-8 or UTF-16, thus storing 240 bits in 40 bytes of either UTF-8 or UTF-16.
Are there any standards for anything like that?
Base32768 does exactly what you wanted. Sorry it took five years to exist.
Usage (this is JavaScript, although porting the base32768 module to another programming language is eminently practical):
var base32768 = require("base32768");
var buf = new Buffer("d41d8cd98f00b204e9800998ecf842", "hex"); // 15 bytes
var str = base32768.encode(buf);
console.log(str); // "迎裶垠⢀䳬Ɇ垙鸂", 8 code points
var buf2 = base32768.decode(str);
console.log(buf.equals(buf2)); // true
Base32768 selects 32,768 characters from the Basic Multilingual Plane. Each character takes 2 bytes when represented as UTF-16 or 3 bytes when represented as UTF-8, giving exactly the efficiency characteristics you describe: 240 bits can be stored in 16 characters i.e. 32 bytes of UTF-16 or 48 bytes of UTF-8. (Except for the occasional padding character, analogous to the = padding seen in Base64.)
This is done by dicing the input bytes (i.e. 8-bit unsigned numbers) into 15-bit unsigned numbers and assigning each resulting 15-bit number to one of the 32,768 characters.
Note that the characters chosen are also "safe" - no whitespace, control characters, combining diacritics or susceptibility to normalization corruption.

Resources