JsonSlurper avoid trimming last zero in a string - groovy

I am using JsonSlurper in groovy to convert a json text to a map.
def slurper = new JsonSlurper();
def parsedInput = slurper.parseText("{amount=10.00}");
Result is
[amount:10.0]
I need result without trimming last zero. Like
[amount:10.00]
Have checked various solutions but this is not getting converted without trimming last zero. Am I missing something here.
One of the ways I have found is to give input as:
{amount="10.00"}

In numbers and maths, 10.00 IS 10.0
They are exactly the same number.
They just have different String representations.
If you need to display 10.0 to the user as 10.00 then that is a conversion thing, as you will need to convert it to a String with 2 decimal places
Something like:
def stringRepresentation = String.format("%.02f", 10.0)
But for any calculations, 10.0 and 10.00 are the same thing
Edit -- Try again...
Right so when you have the json:
{"amount"=10.00}
The value on the right is a floating point number.
To keep the extra zero (which is normally dropped by every sane representation of numbers), you will need to convert it to a String.
To do this, you can use the String.format above (other methods are available).
You cannot keep it as a floating point number with an extra zero.
Numbers don't work like that in every language I can think of... They might do in COBOL from the back of my memory, but that's way off track

The issue (GROOVY-6922) was fixed in Groovy version 2.4.6. With 2.4.6 the scale of the number should be retained.

Related

Python float() limitation on scientific notation

python 3.6.5
numpy 1.14.3
scipy 1.0.1
cerberus 1.2
I'm trying to convert a string '6.1e-7' to a float 0.00000061 so I can save it in a mongoDb field.
My problem here is that float('6.1e-7') doesn't work (it will work for float('6.1e-4'), but not float('6.1e-5') and more).
Python float
I can't seem to find any information about why this happen, on float limitations, and every examples I found shows a conversion on e-3, never up to that.
Numpy
I installed Numpy to try the float96()/float128() ...float96() doesn't exist and float128() return a float '6.09999999999999983e-07'
Format
I tried 'format(6.1E-07, '.8f')' which works, as it return a string '0.00000061' but when I convert the string to a float (so it can pass cerberus validation) it revert back to '6.1E-7'.
Any help on this subject would be greatly appreciated.
Thanks
'6.1e-7' is a string:
>>> type('6.1e-7')
<class 'str'>
While 6.1e-7 is a float:
>>> type(6.1e-7)
<class 'float'>
0.00000061 is the same as 6.1e-7
>>> 0.00000061 == 6.1e-7
True
And, internally, this float is represented by 0's and 1's. That's just yet another representation of the same float.
However, when converted into a string, they're no longer compared as numbers, they are just characters:
>>> '0.00000061' == '6.1e-7'
False
And you can't compare strings with numbers either:
>>> 0.00000061 == '6.1e-7'
False
Your problem description is too twisted to be precisely understood but I'll try to get some telepathy for this.
In an internal format, numbers don't keep any formatting information, neither integers nor floats do. For an integer 123, you can't restore whether it was presented as "123", " 123 " (with tons of spaces before and after it), 000000123 or +0123. For a floating number, 0.1, +0.0001e00003, 1.000000e-1 and myriads of other forms can be used. Internally, all they shall result in the same number.
(There are some specifics with it when you use IEEE754 "decimal floating", but I am sure it is not your case.)
When saving to a database, internal representation stops having much sense. Instead, the database specifics starts playing role, and it can be quite different. For example, SQL suggests using column types like numeric(10,4), and each value will be converted to decimal format corresponding to the column type (typically, saved on disk as text string, with or without decimal point). In MongoDB, you can keep a floating value either as JSON number (IEEE754 double) or as text. Each variant has its own specifics, but, if you choose text, it is your own responsibility to provide proper formatting each time you form this text. You want to see a fixed-point decimal number with 8 digits after point? OK, no problems: you just shall format according to %.8f on each preparing of such representation.
The issues with representation selection are:
Uniqueness: no different forms should be available for the same value. Otherwise you can, for example, store the same contents under multiple keys, and then mistake older one for a last one.
Ordering awareness: DB should be able to provide natural order of values, for requests like "ceiling key-value pair".
If you always format values using %.8f, you will reach uniqueness, but not ordering. The same for %.g, %.e and really other text format except special (not human readable) ones that are constructed to keep such ordering. If you need ordering, just use numbers as numbers, and don't concentrate on how they look like in text forms.
(And, your problem is not tied with numpy.)

Discrepencies in Python hard coding string vs str() methods

Okay. Here is my minimal working example. When I type this into python 3.6.2:
foo = '0.670'
str(foo)
I get
>>>'0.670'
but when I type
foo = 0.670
str(foo)
I get
>>>'0.67'
What gives? It is stripping off the zero, which I believe has to do with representing a float on a computer in general. But by using the str() method, why can it retain the extra 0 in the first case?
You are mixing strings and floats. The string is sequence of code points (one code point represents one character) representing some text and interpreter processing it as a text. The string is always inside single-quotes or double-quotes (e.g. 'Hello'). The float is a number and Python know it so it also know that 1.0000 is the same as 1.0.
In the first case you saved into foo a string. The str() call on string just take the string and return it as is.
In the second case you saved 0.670 as a float (because it's not wrapped in quotes). When Python converting float into a string it always tries create the shortest string possible.
Why Python automatically truncates the trailing zero?
When you try save some real number into computer's memory you have to convert it into binary representation. Usually (but there some exceptions) it's saved in format described in the standard IEEE 754 and Python uses it for floats too.
Let's go to the some example:
from struct import pack
x = -1.53
y = -1.53000
print("X:", pack(">d", x).hex())
print("Y:", pack(">d", y).hex())
The pack() function takes input and based on given format (>d) convert it into bytes. In this case it takes float number and give as how it is saved in memory. If you run the code you will see the x and y are saved in the memory in the same way. The memory doesn't contain information about the format of saved number.
Of course you can add some information about it but:
It would take another memory and it's good practice to use as much memory as you actually need and don't waste it.
What would be result of 0.10 + 0.1 should it be 0.2 or 0.20?
For scientific purposes and significant figures shouldn't it leave the value as the user defined it?
It doesn't matter how you defined the input number. The important is what format you want to use for presenting. As I said the str() always tries create the shortest string possible. str() is good for some simple scripts or tests. For scientific purposes (or for uses where some representation is required) you can convert your numbers to string as you want or need.
For example:
x = -1655484.4584631
y = 42.0
# always print number with sign and exactly 5 numbers from fractional part
print("{:+.5f}".format(x)) # -1655484.45846
print("{:+.5f}".format(y)) # +42.00000
# always print number in scientific format (sign is showed only when the number is negative)
print("{:-2e}".format(x)) # -1.66e+06
print("{:-2e}".format(y)) # 4.20e+01
For more information about formatting numbers and others types look at the Python's documentation.

SUM not working 'Invalid or missing field format'

I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.
If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.
After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)

Vectorize string concatenation in matlab

I am doing a project where I would like to vectorize (if possible), these lines of code in Matlab:
for j=1:length(image_feature(i,:))
string1b=strcat(num2str(j),':',num2str(image_feature(i,j)));
write_file1b=[write_file1b string1b ' '];
end
Basically, what I want to get is an output string in the following way:
1:Value1 2:Value2 3:Value3 ....
Note that ValueX is a number so a real example would be an output like this:
1:23.2 2:34.3 3:110.8
Is this possible? I was thinking about doing something like creating another vector with values from 1 to j, and another j-long vector with only ":", and a num2str(image_feature(i,:)), and then hopefully there is a function f (like a vectorized strcat) that if I do:
f(num2str(1:j),colon_vector,num2str(image_feature(i,:)))
will give me the output I mention above.
Im not sure I understood your question but perhaps this might help
val=[23.2 34.3 110.8]
output = [1:length(val); val]
sprintf('%i: %f ',output)
As output I get
1: 23.200000 2: 34.300000 3: 110.800000
You can vectorize all your array operations to create an array or matrix of numbers very efficiently, but by the very nature of strings in MATLAB, you cannot vectorize the creation of your output string. Theoretically, if you have a fixed length string like in C++, you could concurrently write to different memory locations within the string, but that is not something that is supported by MATLAB. Even if it were, it looks like you have variable-length numbers, so even that would be difficult (unless you were to allocate a specific amount of space per number pair, leading to variable-length spaces between number pairs. It doesn't look like you want to do that, since your examples have exactly one space between all number pairs).
If you'd be interested in efficiently creating a vector, the answer provided by twerdster would accomplish that, but even in that code, the sprintf statement is not concurrent. His code does get rid of the for-loop, which improves efficiency, so I prefer his code to yours.

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

Resources