How do I properly truncate a 64 bit float to a 32 bit truncated float and back (dropping precision) with Node.js? - node.js

So obviously there's no 32 bit float, but if we're trying to store big data efficiently and many of our values are floats no greater than 100,000 with exactly 2 decimal places, it makes sense to store the 64 bit value in 32 bits by dropping the bits representing the precision that we don't need.
I tried doing this by simply writing to a 64 bit BE float buffer like so and slicing the first 4 bytes:
// float32 = Number between 0.00 and 100000.00
const setFloat32 = (float32) => {
b64.writeDoubleBE(float32, 0) // b64 = 64 bit buffer
b32 = b64.slice(0, 4)
return b32;
}
And reading it by adding on 4 empty bytes:
// b32 = the 32 bit buffer from the previous func
const readFloat32 = (b32) => {
// b32Empty = empty 32 bit buffer
return Buffer.concat([b32, b32Empty]).readDoubleBE(0);
}
But this modified flat decimals like:
1.85 => 1.8499994277954102
2.05 => 2.049999237060547
How can I fix my approach to do this correctly, and do so in the most efficient manner for read speed?

If you only want to keep two decimals of precision, you can convert your value to a shifted integer and store that:
function shiftToInteger(val, places) {
// multiply by a constant to shift the decimals you want to keep into
// integer positions, then use Math.round() or Math.floor()
// to truncate the rest of the decimals - depending upon which behavior you want
// then return the shifted integer that will fit into a U32 for storage
return Math.round(val * (10 ** places));
}
This creates a shifted integer that can then be stored in a 32 bit value (with the value limits you described), such as a Uint32Array or Int32Array. To use it when you retrieve it from storage, you would then divide it by 100 to convert it back to a standard Javascript float for usage.
The key is to convert whatever decimal precision you want to keep to an integer so you can store that in a non-float type of value that is just large enough for your max anticipated value. You can storage efficiency because you're using all the storage bits for the desired precision rather than wasting a lot of unnecessary storage bits on the decimal precision that you don't need to keep.
Here's an example:
function shiftToInteger(val, places) {
return Math.round(val * (10 ** places));
}
function shiftToFloat(integer, places) {
return integer / (10 ** places);
}
let x = new Uint32Array(10);
x[0] = shiftToInteger(1.85, 2);
console.log(x[0]); // output shifted integer value
console.log(shiftToFloat(x[0], 2)); // convert back to decimal value

Related

Why are secp256k1 privateKeys not always 32 bytes in nodejs?

I was generating a lot of secp256k1 keys using node's crypto module when I ran into a problem that some generated private keys were not always 32bytes in length. I wrote a test script and it shows clearly that that happens quite often.
What is the reason for that and is there a fix or do I have to check for length and then regenerate until I get 32 bytes?
This is the test script for reproducing the issue:
const { createECDH, ECDH } = require("crypto");
const privateLens = {};
const publicLens = {};
for(let i = 0; i < 10000; i++){
const ecdh = createECDH("secp256k1");
ecdh.generateKeys();
const privateKey = ecdh.getPrivateKey("hex");
const publicKey = ecdh.getPublicKey("hex");
privateLens[privateKey.length+""] = (privateLens[privateKey.length+""] || 0) + 1;
publicLens[publicKey.length+""] = (publicLens[publicKey.length+""] || 0) + 1;
}
console.log(privateLens);
console.log(publicLens);
The output (of multiple runs) looks like this:
% node test.js
{ '62': 32, '64': 9968 }
{ '130': 10000 }
% node test.js
{ '62': 40, '64': 9960 }
{ '130': 10000 }
% node test.js
{ '62': 39, '64': 9961 }
{ '130': 10000 }
I just don't get it... if I encode it in base64 its always the same length, but decoding that back to a buffer shows 31 bytes for some keys again.
Thanks, any insights are highly appreciated!
For EC cryptography the key is not fully random over the bytes, it's a random number in the range [1, N) where N is the order of the curve. Now generally the number generated will be in the same ballpark as the 256 bit order. This is especially true since N has been (deliberately) chosen to be very close to 2^256, i.e. the high order bits are all set to 1 for secp256k1.
However, about once in 256, the first bits are still all set to zero for the chosen private key s. That means that it takes 31 or fewer bytes instead of 32 bytes. Once in 65536 it will be even 30 bytes instead of 32, etc. And once in somewhere over 4 billion times (short scale) it will even use 29 bytes.
Base64 uses one character for 6 bits excluding overhead. However generally it just encodes blocks of 3 bytes to 4 characters at a time (possibly including padding with = characters). That means that 32 bytes will take ceil(32 / 3) * 4 = 44 bytes. Now since ceil(31 / 3) * 4 = 44 you won't notice anything. However, once in 65536 times you'll get ceil(30 / 3) * 4 = 40. After that going to 36 characters becomes extremely unlikely (although not negligibly small cryptographically speaking, "just" once in 2^48 times - there are lotteries that do worse I suppose)...
So no, you don't have to regenerate the keys - for the algorithm they are perfectly valid after all. For private keys you don't generally have much compatibility requirements, however generally you'd try and encode such keys to a static size (32 bytes, possibly using 00 valued bytes at the left). Re-encoding them as statically sized keys might be a good idea...

Estimating max payload size on a compressed list of integers

I have 1 million rows in an application. It makes a request to a server such as the following:
/search?q=hello
And the search returns a sorted list of integers, representing the rows that have been matched in the input dataset (which the user has in their browser). How would I go about estimating the maximum size that the payload would return? For example, to start we have:
# ~7 MB if we stored "all results" uncompressed
6888887
# ~ 3.5MB if we stored "all results" relative to 0 or ALL matches (cuts it down by two)
3444443
And then we would want to compress these integers using some sort of decompression (Elias-Fano ?) What would be the "worst-case" scenario for the size of 1M sorted integers? And how would that calculation be made?
The application has one million rows of data, so lets say R1 --> R1000000, or if zero-indexing, range(int(1e6)). The server will respond with something like: [1,2,3], indicating that (only) rows 1, 2, and 3 were matched.
There are 2^(10^6) different sorted (duplicate-less) lists of integers < 10^6. Mapping each such list, say [0, 4, ...], to the corresponding
bit array (say 10001....) yields 10^6 bits, i.e 125kB of information. As each bit array corresponds to a unique possible sorted list, and vice versa, this is the most
compact (in the sense of: having the smallest max. size) representation.
Of course, if some results are more probable than others, there may be more efficient (in the sense of: having a smaller average size) representations.
For example, if most result sets are small, a simple run-length encoding may generally yield smaller encodings.
Inevitably, in that case the maximal size of the encoding (the max. payload size you were asking about) will be more than 125 kB
Compressing the above-mentioned 125 kB bit array with e.g. zlib will yield an acceptably compact encoding for small result sets. Moreover, zlib has a function deflateBound() that, given the uncompressed size, will calculate the max payload size (which, in your case, will definitely be larger than 125 kB, but not by much)
Input specification:
Row numbers between 0 and 999999 (if you need 1-indexing, you can apply offset)
Each row number appears only once
The numbers are sorted in ascending order (useful, we'd want to sort them anyway)
A great idea you've had was to invert the meaning of the result when the number of matches is more than half the possible values. Let us retain that, and assume we are given a flag and a list of matches/misses.
Your initial attempt at coding this encoded the numbers as text with comma separation. That means that for 90% of the possible values you need 6 characters + 1 separator -- so 7 bytes on average. However, since the maximum value is 999999, you really only need 20 bits to encode each entry.
Hence, the first idea to reducing the size is to use binary encoding.
Binary Encoding
The simplest approach is to write the number of values sent followed by a stream of 32bit integers.
A more efficient approach would be to pack two 20-bit values into each 5 bytes written. In case of an odd count, you would just pad the 4 excess bits with zeros.
Those approaches may be good for small amounts of matches (or misses). However, the important thing to note is that for each row, we only need to track 1 bit of information -- whether it's present or not. That means that we can encode the results as a bitmap of 1000000 bits.
Combining those two approaches, we can use a bitmap when there are many matches or misses, and switch to binary coding when it's more efficient.
Range Reduction
The next potential improvement to use when coding sorted sequences of integers is to use range reduction.
The idea is to code the values from largest to smallest, and reducing the number of bits per value as they get smaller.
First, we encode the number of bits N necessary to represent the first value.
We encode the first value using N bits
For each following value
Encode the value using N bits
If the value requires fewer bits to encode, reduce N appropriately
Entropy Coding
Let's go back to the bitmap encoding. Based on the Shannon entropy theory
the worst case is where we have 50% matches. The further the probabilities are skewed, the fewer bits we need on average to code each entry.
Matches | Bits
--------+-----------
0 | 0
1 | 22
2 | 41
3 | 60
4 | 78
5 | 96
10 | 181
100 | 1474
1000 | 11408
10000 | 80794
100000 | 468996
250000 | 811279
500000 | 1000000
To do this, we need to use an entropy coder that can code fractional bits -- something like arithmetic or range coder or some of the new ANS based coders like FSE. Alternatively, we could group symbols together and use Huffman coding.
Prototypes and Measurements
I've written a test using a 32-bit implementation of FastAC by Amir Said, which limits the model to 4 decimal places.
(This is not really a problem, since we shouldn't be feeding such data to the codec directly. This is just a demonstration.)
First some common code:
typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
{
size_t count(NUM_VALUES / bits);
if (NUM_VALUES % bits > 0) {
return count + 1;
}
return count;
}
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
{
size_t index(match / bits);
size_t offset(match % bits);
if (state) {
symbols[index] |= 1 << offset;
} else {
symbols[index] &= ~(1 << offset);
}
}
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
{
size_t index(match / bits);
size_t offset(match % bits);
return (symbols[index] & (1 << offset)) != 0;
}
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
{
assert((bits > 0) && (bits <= 8));
match_symbols_t symbols(symbol_count(bits), 0);
for (auto match : matches) {
set_symbol(symbols, bits, match, true);
}
return symbols;
}
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
{
match_list_t result;
for (uint32_t i(0); i < 1000000; ++i) {
if (get_symbol(symbols, bits, i)) {
result.push_back(i);
}
}
return result;
}
First, simpler variant is to write the number of matches, determine the probability of match/miss and clamp it to the supported range.
Then simply encode each value of the bitmap using this static probability model.
class arithmetic_codec_v1
{
public:
buffer_t compress(match_list_t const& matches)
{
uint32_t match_count(static_cast<uint32_t>(matches.size()));
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
codec.start_encoder();
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
// Initialize the model
static_bit_model model;
model.set_probability_0(get_probability_0(match_count));
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
codec.encode(entry, model);
}
}
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
}
match_list_t decompress(buffer_t& compressed)
{
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
codec.start_decoder();
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
match_list_t result;
if (match_count > 0) {
static_bit_model model;
model.set_probability_0(get_probability_0(match_count));
result.reserve(match_count);
for (uint32_t i(0); i < NUM_VALUES; ++i) {
uint32_t entry = codec.decode(model);
if (entry == 1) {
result.push_back(i);
}
}
}
codec.stop_decoder();
return result;
}
private:
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
{
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
}
};
The second approach is to adapt the model based on the symbols we code.
After each match is encoded, reduce the probability of the next match.
Once all matches we coded, stop.
The second variation compresses slightly better, but at a noticeable performance cost.
class arithmetic_codec_v2
{
public:
buffer_t compress(match_list_t const& matches)
{
uint32_t match_count(static_cast<uint32_t>(matches.size()));
uint32_t total_count(NUM_VALUES);
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
codec.start_encoder();
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
static_bit_model model;
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
model.set_probability_0(get_probability_0(match_count, total_count));
codec.encode(entry, model);
--total_count;
if (entry) {
--match_count;
}
if (match_count == 0) {
break;
}
}
}
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
}
match_list_t decompress(buffer_t& compressed)
{
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
codec.start_decoder();
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
uint32_t total_count(NUM_VALUES);
match_list_t result;
if (match_count > 0) {
static_bit_model model;
result.reserve(match_count);
for (uint32_t i(0); i < NUM_VALUES; ++i) {
model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
if (codec.decode(model) == 1) {
result.push_back(i);
--match_count;
}
if (match_count == 0) {
break;
}
}
}
codec.stop_decoder();
return result;
}
private:
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
{
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
}
};
Practical Approach
Practically, it's probalby not worth designing a new compression format.
In fact, it might not even be worth it writing the results as bits, just make an array of bytes with values 0 or 1.
Then use an existing compression library -- zlib is very common, or you could try lz4 or snappy, bzip2, lzma... the choices are plentiful.
ZLib Example
class zlib_codec
{
public:
zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
{
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
z_stream defstream;
defstream.zalloc = nullptr;
defstream.zfree = nullptr;
defstream.opaque = nullptr;
deflateInit(&defstream, Z_BEST_COMPRESSION);
size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));
buffer_t compressed(max_compress_size);
defstream.avail_in = static_cast<uInt>(symbols.size());
defstream.next_in = &symbols[0];
defstream.avail_out = static_cast<uInt>(max_compress_size);
defstream.next_out = &compressed[0];
deflate(&defstream, Z_FINISH);
deflateEnd(&defstream);
compressed.resize(defstream.total_out);
return compressed;
}
match_list_t decompress(buffer_t& compressed)
{
z_stream infstream;
infstream.zalloc = nullptr;
infstream.zfree = nullptr;
infstream.opaque = nullptr;
inflateInit(&infstream);
match_symbols_t symbols(symbol_count(bits_per_symbol));
infstream.avail_in = static_cast<uInt>(compressed.size());
infstream.next_in = &compressed[0];
infstream.avail_out = static_cast<uInt>(symbols.size());
infstream.next_out = &symbols[0];
inflate(&infstream, Z_FINISH);
inflateEnd(&infstream);
return make_matches(symbols, bits_per_symbol);
}
private:
uint32_t bits_per_symbol;
};
BZip2 Example
class bzip2_codec
{
public:
bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
{
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
uint32_t compressed_size = symbols.size() * 2;
buffer_t compressed(compressed_size);
int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
, &compressed_size
, (char*)&symbols[0]
, symbols.size()
, 9
, 0
, 30);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
}
compressed.resize(compressed_size);
return compressed;
}
match_list_t decompress(buffer_t& compressed)
{
match_symbols_t symbols(symbol_count(bits_per_symbol));
uint32_t decompressed_size = symbols.size();
int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
, &decompressed_size
, (char*)&compressed[0]
, compressed.size()
, 0
, 0);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
}
if (decompressed_size != symbols.size()) {
throw std::runtime_error("Size mismatch.");
}
return make_matches(symbols, bits_per_symbol);
}
private:
uint32_t bits_per_symbol;
};
Comparison
Code repository, including dependencies for 64bit Visual Studio 2015 is at https://github.com/dan-masek/bounded_sorted_list_compression.git
Storing a compressed list of sorted integers is extremely common in data retrieval and database applications, and a variety of techniques have been developed.
I'm pretty sure that an unguessably random selection of about half of the items in your list is going to be your worst case.
Many popular integer-list-compression techniques, such as Roaring bitmaps, fallback to using (with such worst-case input data) a 1-bit-per-index bitmap.
So in your case, with 1 million rows, the maximum size payload returned would be (in the worst case) a header with the "using a bitmap" flag set,
followed by a bitmap of 1 million bits (125,000 bytes), where for example the 700th bit of the bitmap is set to 1 if the 700th row in the database is a match, or set to 0 if the 700th row in the database does not match. (Thanks, Dan MaĊĦek!)
My understanding is that, while quasi-succinct Elias-Fano compression and other techniques are very useful for compressing many "naturally-occurring" sets of sorted integers, for this worst-case data set, none of them give better compression, and most of them give far worse "compression", than a simple bitmap.
(This is analogous to the way most general-purpose data compression algorithms, such as DEFLATE, when fed "worst-case" data such as indistinguishable-from-random encrypted data, create "compressed" files with a few bytes of overhead with the "stored/raw/literal" flag set, followed by a simple copy of the uncompressed file).
Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression"
https://en.wikipedia.org/wiki/Bitmap_index#Compression
https://en.wikipedia.org/wiki/Inverted_index#Compression

c# - byte array improper conversion to MB

The file is about 24mb, and it's held in a DataBase so I convert it to a bit array and then, after multiple suggestions, I use bitconverter.tosingle(,) and this is giving me bad results, here's my code:
byte[] imgData = prod.ImageData;
float myFloat = BitConverter.ToSingle(imgData, 0);
float mb = (myFloat / 1024f) / 1024f;
When I debug, I get these results:
byte[24786273]
myFloat = 12564.0361
mb = 0.0119819986
what is weird is that he size of the array is exactly as the file should be. How do I correctly convert this to float and then so it shows as mb?
EDIT: I tried setting up myFloat as imgData.Length, then the size is correct, however is this a correct way to do it, and can it cause a problem in the future with bigger values?
You are taking the first four bytes out of the image and converting it to an IEEE floating point. I'm not an expert on image files so I'm not sure if the first four bytes are always the length, even if this is the case it would still not be correct (see the specification). However the length of the file is already known through the length of the array, so an easier way to get the size is:
byte[] imgData = prod.ImageData;
float mb = (imgData.Length / 1024f) / 1024f;
To address your concerns: this will still work for large files, consider a 24TB example.
var bytes = 24L * 1024 * 1024 * 1024 * 1024;
var inMb = (bytes / 1024.0F / 1024.0F);

What NSNumber (Integer 16, 32, 64) in Core Data should I use to keep NSUInteger

I want to keep NSUInteger into my core data and I don't know which type should I use (integer 16, 32, 64) to suit the space needed.
From my understanding:
Integer 16 can have minimum value of -32,768 to 32,767
Integer 32 can have minimum value of -2,147,483,648 to 2,147,483,647
Integer 64 can have minimum value of -very large to very large
and NSUInteger is type def of unsigned long which equal to unsigned int (Types in objective-c on iPhone)
so If I convert my NSUInteger to NSNumber with numberWithUnsignedInteger: and save it as NSNumber(Integer 32) I could retrieve my data back safely right?
Do you really need the entire range of an NSUInteger? On iOS that's an unsigned 32 bit value, which can get very large. It will find into a signed 64 bit.
But you probably don't need that much precision anyway. The maximum for a uint32_t is UINT32_MAX which is 4,294,967,295 (4 billion). If you increment once a second, it'll take you more than 136 years to reach that value. Your user's iPhone won't be around by then... :)
If at all possible, when writing data to disk or across a network, it's best to be explicit about the size of value. Instead of using NSUInteger as the datatype, use uint16_t, uint32_t, or uint64_t depending on the range you need. This then naturally translates to Integer 16, 32, and 64 in Core Data.
To understand why, consider this scenario:
You opt to use Integer 64 type to store your value.
On a 64-bit iOS device (eg iPhone 6) it stores the value 5,000,000,000.
On a 32-bit iOS device this value is fetched from the store into an NSUInteger (using NSNumber's unsignedIntegerValue).
Now because NSUInteger is only 32-bits on the 32-bit device, the number is no longer 5,000,000,000 because there aren't enough bits to represent 5 billion. If you had swapped the NUInteger in step 3 for uint64_t then the value would still be 5 billion.
If you absolutely must use NSUInteger, then you'll just need to be wary about the issues described above and code defensively for it.
As far as storing unsigned values into the seemingly signed Core Data types, you can safely store them and retrieve them:
NSManagedObject *object = // create object
object.valueNumber = #(4000000000); // Store 4 billion in an Integer 32 Core Data type
[managedObjectContext save:NULL] // Save value to store
// Later on
NSManagedObject *object = // fetch object from store
uint32_t value = object.valueNumber.unsignedIntegerValue; // value will be 4 billion

strtod() and sprintf() inconsistency under GCC and MSVC

I'm working on a cross-platform app for Windows and Mac OS X, and I have a problem with two standard C library functions:
strtod() - string-to-double conversion
sprintf() - when used for outputting double-precision floating point numbers)
Their GCC and MSVC versions return different results, in some digits of mantissa. But it plays a cruicial role if the exponent value is large. An example:
MSVC: 9,999999999999999500000000000000e+032
GCC: 9,999999999999999455752309870428e+32
MSVC: 9,999999999999999500000000000000e+033
GCC: 9,999999999999999455752309870428e+33
MSVC: 9,999999999999999700000000000000e+034
GCC: 9,999999999999999686336610791798e+34
The input test numbers have an identical binary representation under MSVC and GCC.
I'm looking for a well-tested cross-platform open-source implementation of those functions, or just for a pair of functions that would correctly and consistently convert double to string and back.
I've already tried the clib GCC implementation, but the code is too long and too dependent on other source files, so I expect the adaptation to be difficult.
What implementations of string-to-double and double-to-string functions would you recommend?
Converting between floating point numbers and strings is hard - very hard. There are numerous papers on the subject, including:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
How to Print Floating-Point Numbers Accurately
How to Read Floating-Point Numbers Accurately
General Decimal Arithmetic
The last of those is a treasure trove of information on floating point decimal arithmetic.
The GNU glibc implementation is likely to be about as good as it gets - but it won't be short or simple.
Addressing examples
A double normally stores 16 (some might argue 17) significant decimal digits. MSVC is processing 17 digits. Anything beyond that is noise. GCC is doing as you ask it, but there aren't enough bits in a double to warrant the extra 14 digits you are requesting. If you had 16-byte 'long double' values (SPARC, PPC, Intel x86_64 for Mac), then you might warrant 32 significant figures. However, the differences you are showing are QoI; I might even argue that MS is doing a better job than GCC/glibc here (and I don't often say that!).
The only algorithm I know for printing the exact value of a floating point number in decimal is as follows:
Convert the mantissa to a decimal integer. You can either do this by pulling apart the bits to read the mantissa directly, or you can write a messy floating point loop that first multiplies the value by a power of two to put it in the range 1<=x<10, then pulls off a digit at a time by casting to int, subtracting, and multiplying by 10.
Apply the exponent by repeatedly multiplying or dividing by 2. This is an operation on the string of decimal digits you generated. Every ~3 multiplications will add an extra digit to the left. Every single dividion will add an extra digit to the right.
It's slow and ugly but it works...
The following function dtoa returns a string that losslessly converts back into the same double.
If you rewrite aisd to test all of your string-to-float implementations, you'll have portable output among them.
// Return whether a string represents the given double.
int aisd(double f, char* s) {
double r;
sscanf(s, "%lf", &r);
return r == f;
}
// Return the shortest lossless string representation of an IEEE double.
// Guaranteed to fit in 23 characters (including the final '\0').
char* dtoa(char* res, double f) {
int i, j, lenF = 1e9;
char fmt[8];
int e = floor(log10(f)) + 1;
if (f > DBL_MAX) { sprintf(res, "1e999"); return res; } // converts to Inf
if (f < -DBL_MAX) { sprintf(res, "-1e999"); return res; } // converts to -Inf
if (isnan(f)) { sprintf(res, "NaN"); return res; } // NaNs don't work under MSVCRT
// compute the shortest representation without exponent ("123000", "0.15")
if (!f || e>-4 && e<21) {
for (i=0; i<=20; i++) {
sprintf(fmt, "%%.%dlf", i);
sprintf(res, fmt, f);
if (aisd(f, res)) { lenF = strlen(res); break; }
}
}
if (!f) return res;
// compute the shortest representation with exponent ("123e3", "15e-2")
for (i=0; i<19; i++) {
sprintf(res, "%.0lfe%d", f * pow(10,-e), e); if (aisd(f, res)) break;
j = strlen(res); if (j >= lenF) break;
while (res[j] != 'e') j--;
res[j-1]--; if (aisd(f, res)) break; // try mantissa -1
res[j-1]+=2; if (aisd(f, res)) break; // try mantissa +1
e--;
}
if (lenF <= strlen(res)) sprintf(res, fmt, f);
return res;
}
See Can't get a NaN from the MSVCRT strtod/sscanf/atof functions for the MSVCRT NaN problem. If you don't need to recognize NaNs, you can output infinity ("1e999") when you get one.

Resources