Discrepencies in Python hard coding string vs str() methods

Discrepencies in Python hard coding string vs str() methods - string

Okay. Here is my minimal working example. When I type this into python 3.6.2:
foo = '0.670'
str(foo)
I get
>>>'0.670'
but when I type
foo = 0.670
str(foo)
I get
>>>'0.67'
What gives? It is stripping off the zero, which I believe has to do with representing a float on a computer in general. But by using the str() method, why can it retain the extra 0 in the first case?

You are mixing strings and floats. The string is sequence of code points (one code point represents one character) representing some text and interpreter processing it as a text. The string is always inside single-quotes or double-quotes (e.g. 'Hello'). The float is a number and Python know it so it also know that 1.0000 is the same as 1.0.
In the first case you saved into foo a string. The str() call on string just take the string and return it as is.
In the second case you saved 0.670 as a float (because it's not wrapped in quotes). When Python converting float into a string it always tries create the shortest string possible.
Why Python automatically truncates the trailing zero?
When you try save some real number into computer's memory you have to convert it into binary representation. Usually (but there some exceptions) it's saved in format described in the standard IEEE 754 and Python uses it for floats too.
Let's go to the some example:
from struct import pack
x = -1.53
y = -1.53000
print("X:", pack(">d", x).hex())
print("Y:", pack(">d", y).hex())
The pack() function takes input and based on given format (>d) convert it into bytes. In this case it takes float number and give as how it is saved in memory. If you run the code you will see the x and y are saved in the memory in the same way. The memory doesn't contain information about the format of saved number.
Of course you can add some information about it but:
It would take another memory and it's good practice to use as much memory as you actually need and don't waste it.
What would be result of 0.10 + 0.1 should it be 0.2 or 0.20?
For scientific purposes and significant figures shouldn't it leave the value as the user defined it?
It doesn't matter how you defined the input number. The important is what format you want to use for presenting. As I said the str() always tries create the shortest string possible. str() is good for some simple scripts or tests. For scientific purposes (or for uses where some representation is required) you can convert your numbers to string as you want or need.
For example:
x = -1655484.4584631
y = 42.0
# always print number with sign and exactly 5 numbers from fractional part
print("{:+.5f}".format(x)) # -1655484.45846
print("{:+.5f}".format(y)) # +42.00000
# always print number in scientific format (sign is showed only when the number is negative)
print("{:-2e}".format(x)) # -1.66e+06
print("{:-2e}".format(y)) # 4.20e+01
For more information about formatting numbers and others types look at the Python's documentation.

Related

Python float() limitation on scientific notation

python 3.6.5
numpy 1.14.3
scipy 1.0.1
cerberus 1.2
I'm trying to convert a string '6.1e-7' to a float 0.00000061 so I can save it in a mongoDb field.
My problem here is that float('6.1e-7') doesn't work (it will work for float('6.1e-4'), but not float('6.1e-5') and more).
Python float
I can't seem to find any information about why this happen, on float limitations, and every examples I found shows a conversion on e-3, never up to that.
Numpy
I installed Numpy to try the float96()/float128() ...float96() doesn't exist and float128() return a float '6.09999999999999983e-07'
Format
I tried 'format(6.1E-07, '.8f')' which works, as it return a string '0.00000061' but when I convert the string to a float (so it can pass cerberus validation) it revert back to '6.1E-7'.
Any help on this subject would be greatly appreciated.
Thanks

'6.1e-7' is a string:
>>> type('6.1e-7')
<class 'str'>
While 6.1e-7 is a float:
>>> type(6.1e-7)
<class 'float'>
0.00000061 is the same as 6.1e-7
>>> 0.00000061 == 6.1e-7
True
And, internally, this float is represented by 0's and 1's. That's just yet another representation of the same float.
However, when converted into a string, they're no longer compared as numbers, they are just characters:
>>> '0.00000061' == '6.1e-7'
False
And you can't compare strings with numbers either:
>>> 0.00000061 == '6.1e-7'
False

Your problem description is too twisted to be precisely understood but I'll try to get some telepathy for this.
In an internal format, numbers don't keep any formatting information, neither integers nor floats do. For an integer 123, you can't restore whether it was presented as "123", " 123 " (with tons of spaces before and after it), 000000123 or +0123. For a floating number, 0.1, +0.0001e00003, 1.000000e-1 and myriads of other forms can be used. Internally, all they shall result in the same number.
(There are some specifics with it when you use IEEE754 "decimal floating", but I am sure it is not your case.)
When saving to a database, internal representation stops having much sense. Instead, the database specifics starts playing role, and it can be quite different. For example, SQL suggests using column types like numeric(10,4), and each value will be converted to decimal format corresponding to the column type (typically, saved on disk as text string, with or without decimal point). In MongoDB, you can keep a floating value either as JSON number (IEEE754 double) or as text. Each variant has its own specifics, but, if you choose text, it is your own responsibility to provide proper formatting each time you form this text. You want to see a fixed-point decimal number with 8 digits after point? OK, no problems: you just shall format according to %.8f on each preparing of such representation.
The issues with representation selection are:
Uniqueness: no different forms should be available for the same value. Otherwise you can, for example, store the same contents under multiple keys, and then mistake older one for a last one.
Ordering awareness: DB should be able to provide natural order of values, for requests like "ceiling key-value pair".
If you always format values using %.8f, you will reach uniqueness, but not ordering. The same for %.g, %.e and really other text format except special (not human readable) ones that are constructed to keep such ordering. If you need ordering, just use numbers as numbers, and don't concentrate on how they look like in text forms.
(And, your problem is not tied with numpy.)

Why comparision doesn't work as I suspect? PL/I

This comparison prints '0'b. Don't understand why... As I know strings are converted automatically to float in PL/I if needed.
put skip list('-2.34e-1'=-2.34e-1);

I have tested this in our environment (Enterprise PL/I V4.5 on z/OS) and found the same behaviour - under certain compile-options.
Using the option FLOAT(NODFP) (i.e. do not use native support for decimal floating point, I think the option was introduced with Enterprise PL/I V4.4) the following happens:
the literal -2.34e-1 is converted to its internal representation as bin float(6), i.e. short binary floating point
the literal '-2.34e-1' is compared with a bin float(6) value, so it has to be converted to a bin float as well
since -0.234 does not have an exact representation as a binary fraction it seems the compiler converts it to a bin float(54), i.e. an extended binary floating point value, to get maximum precision.
So since -0.234 has an infinite number of digits after the decimal point in its binary representation but the two converted values preserve a different number of digits the values do not compare equal.
Under FLOAT(DFP) (i.e. when using the machines DFP support)
the internal representation of the literal -2.34e-1 is an actual decimal floating point and thus exact
as is the representation of '-2.34e-1'
so under this compile-option both compare equal and the output of your program is '1'b
So your problem is a combination of the compilers different choice of data-representation and resulting rounding-errors from using binary floating point of different precision.

JsonSlurper avoid trimming last zero in a string

I am using JsonSlurper in groovy to convert a json text to a map.
def slurper = new JsonSlurper();
def parsedInput = slurper.parseText("{amount=10.00}");
Result is
[amount:10.0]
I need result without trimming last zero. Like
[amount:10.00]
Have checked various solutions but this is not getting converted without trimming last zero. Am I missing something here.
One of the ways I have found is to give input as:
{amount="10.00"}

In numbers and maths, 10.00 IS 10.0
They are exactly the same number.
They just have different String representations.
If you need to display 10.0 to the user as 10.00 then that is a conversion thing, as you will need to convert it to a String with 2 decimal places
Something like:
def stringRepresentation = String.format("%.02f", 10.0)
But for any calculations, 10.0 and 10.00 are the same thing
Edit -- Try again...
Right so when you have the json:
{"amount"=10.00}
The value on the right is a floating point number.
To keep the extra zero (which is normally dropped by every sane representation of numbers), you will need to convert it to a String.
To do this, you can use the String.format above (other methods are available).
You cannot keep it as a floating point number with an extra zero.
Numbers don't work like that in every language I can think of... They might do in COBOL from the back of my memory, but that's way off track

The issue (GROOVY-6922) was fixed in Groovy version 2.4.6. With 2.4.6 the scale of the number should be retained.

In Python, how can I print a dictionary containing large numbers without an 'L' being inserted after the large numbers?

I have a dictionary as follows
d={'apples':1349532000000, 'pears':1349532000000}
Doing either str(d) or repr(d) results in the following output
{'apples': 1349532000000L, 'pears': 1349532000000L}
How can I get str, repr, or print to display the dictionary without it adding an L to the numbers?
I am using Python 2.7

You can't, because the L suffix denotes a 64-bit integer. Without it, those numbers are 32-bit integers. Those numbers don't fit into 32 bits because they are too large. If the L suffix was omitted, the result would not be valid Python, and the whole point of repr() is to emit valid Python.

Well this is a little embarrasing, I just found a solution after quite a few attempts :-P
One way to do this (which suits my purpose) is to use json.dumps() to convert the dictionary to a string.
d={'apples':1349532000000, 'pears':1349532000000}
import json
json.dumps(d)
Outputs
'{"apples": 1349532000000, "pears": 1349532000000}'

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S

To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.

What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)

You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());

I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.

Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.

You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string