extraneous digits formatting within dataframe - python-3.x

I am running into a formatting / precision issue which I'm hoping to control
I obtain a list of numbers such as:
x = [0.009947, 0.009447, 0.008947]
The finished product I'm after is a DataFrame with a column whose value is this list but multiplied by 100 with 3 decimal places, e.g.
[0.995, 0.945, 0.895]
I proceed as follows:
x = 100*np.around([0.009947, 0.009447, 0.008947],5)
this displays as
array([0.995, 0.945, 0.895])
When I build the DataFrame:
pd.DataFrame({'test':[x]})
I get for the value in the 'test' column:
[0.9950000000000001, 0.9450000000000001, 0.895]
This does not happen in other examples and I'm not sure how to control the behavior. Appreciate any suggestions

This is a general issue with the usage of floating points in computers, check this out
from the docs

Related

Counting if part of string is within interval

I am currently trying to check if a number in a comma-separated string is within a number interval. What I am trying to do is to check if an area code (from the comma-separated string) is within the interval of an area.
The data:
AREAS
Area interval
Name
Number of locations
1000-1499
Area 1
?
1500-1799
Area 2
?
1800-1999
Area 3
?
GEOLOCATIONS
Name
Areas List
Location A
1200, 1400
Location B
1020, 1720
Location C
1700, 1920
Location D
1940, 1950, 1730
The result I want here is the number of unique locations in the "Areas list" within the area interval. So Location D should only count ONCE in the 1800-1999 "area", and the Location A the same in the 1000-1499 location. But location B should count as one in both 1000-1499 and one in 1500-1799 (because a number from each interval is in the comma-separated string in "Areas list"):
Area interval
Name
Number of locations
1000-1499
Area 1
2
1500-1799
Area 2
3
1800-1999
Area 3
2
How is this possible?
I have tried with a COUNTIFS, but it doesnt seem to do the job.
Here is one option using FILTERXML():
Formula in C2:
=SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]"))
Where:
"<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>" - Is the part where we construct a valid piece of XML. The theory here is that we use three axes here. Each t-node will be named a literal 1 to make sure that once we return them with xpath we can sum the result. The outer x-nodes are there to make sure Excel will handle the inner axes correctly. If you are curious to know how this xml-syntax looks at the end, it's best to step through using the 'Evaluate Formula' function on the Data-tab;
//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]")) - Basically means that we collect all t-nodes where the count of child s-nodes that are >= to the leftmost number and <= to the rightmost number is larger than zero. For A2 the xpath would look like //t[count(.//*[.>=1000][.<=1499])>0]")) after substitution. In short: //t - Select t-nodes, where count(.//* select all child-nodes where count of nodes that fullfill both requirements [.>=1000][.<=1499] is larger than zero;
Since all t-nodes equal the number 1, the SUM() of these t-nodes equals the amount of unique locations that have at least one area in its Areas List;
Important to note that FILTERXML() will result into an error if no t-nodes could be found. That would mean we need to wrap the FILTERXML() in an IFERROR(...., 0) to counter that and make the SUM() still work correctly.
Or, wrap the above in BYROW():
Formula in C2:
=BYROW(A2:A4,LAMBDA(a,SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(a,"-","][.<=")&"])>0]"))))
Using MMULT and TEXTSPLIT:
=LET(rng,TEXTSPLIT(D2,"-"),
tarr,IFERROR(--TRIM(TEXTSPLIT(TEXTJOIN(";",,$B$2:$B$5),",",";")),0),
SUM(--(MMULT((tarr>=--TAKE(rng,,1))*(tarr<=--TAKE(rng,,-1)),SEQUENCE(COLUMNS(tarr),,1,0))>0)))
I am in very distinguished company but will add my version anyway as byrow probably is a slightly different approach
=LET(range,B$2:B$5,
lowerLimit,--#TEXTSPLIT(E2,"-"),
upperLimit,--INDEX(TEXTSPLIT(E2,"-"),2),
counts,BYROW(range,LAMBDA(r,SUM((--TEXTSPLIT(r,",")>=lowerLimit)*(--TEXTSPLIT(r,",")<=upperLimit)))),
SUM(--(counts>0))
)
Here the ugly way to do it, with A LOT of helper columns. But not so complicated 🙂
F4= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(B4;",";"</r><r>")&"</r></m>";"//r"))
F11= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(A11;"-";"</r><r>")&"</r></m>";"//r"))
F16= =SUM(F18:F21)
F18= =IF(SUM(($F4:$O4>=$F$11)*($F4:$O4<=$G$11))>0;1;"")
G18= =IF(SUM(($F4:$O4>=$F$12)*($F4:$O4<=$G$12))>0;1;"")
H18= =IF(SUM(($F4:$O4>=$F$13)*($F4:$O4<=$G$13))>0;1;"")

How to add a trailing zeros to a pandas dataframe column?

I have a pandas dataframe column that I would like to be formatted to two decimal places.
Such that:
10.1
Appears as:
10.10
How can I do that? I have already tried rounding.
This can be accomplished by mapping a string format object to the column of floats:
df.colName.map('{:.2f}'.format)
(Credit to exp1orer)
You can use:
pd.options.display.float_format = '{:,.2f}'.format
Note that this will only display two decimals for every float in your dataframes.
To go back to normal:
pd.reset_option('display.float_format')
From pyformat.info
Padding numbers
For floating points the padding value represents the length of the complete output. In the example below we want our output to have at least 6 characters with 2 after the decimal point.
'{:06.2f}'.format(3.141592653589793)
The :06 is the length of your output regardless of how many digits are in your input. The .2 indicates you want 2 places after the decimal point. The f indicates you want a float output.
Output
003.14
If you are using Python 3.6 or later you can use f strings. Check out this other answer: https://stackoverflow.com/a/45310389/12229158
>>> a = 10.1234
>>> f'{a:.2f}'
'10.12'

Python Array rounding

unable to get one column of an array with only 1 decimal
Hi. I am a newbie in Python 3.7 so I apologize beforehand. I have two arrays both with a shape of (1000,1) (Predict and Target). The first is a list of predicted values which I rounded to 1 decimal place and verified that the numbers have only 1 sigfig. When I concatenate both arrays into one to compare the two, the first column (the one from Predict) reverts back to its original format and has over 8 sigfigs.
predict = model.predict_on_batch(training_data['inputs']).round(1)
targets = training_data['targets'].round(1)
test=np.concatenate((predict,targets),axis = 1)
I get the following
array([[-5.69999981, -6.1 ],
[ 5.30000019, 5.5 ],
[37.59999847, 37.4 ],
I try adding
np.round(test[[0]],1)
but that doesn't work. Can anyone guide me on how to round the fist column to just 1 sf?
Many thanks
I think you are selecting just the first row. You have to select it like the below.
np.round(test[:,0],1)
Let me know if it worked

Excel: How to find closest number in table, many times

Excel
Need to find nearest float in a table, for each integer 0..99
https://www.excel-easy.com/examples/closest-match.html explains a great technique for finding the CLOSEST number from an array to a constant cell.
I need to perform this for many values (specifically, find nearest to a vertical list of integers 0..99 from within a list of floats).
Array formulas don't allow the compare-to value (integers) to change as we move down the list of integers, it treats it like a constant location.
I tried Tables, referring to the integers (works) but the formula from the above web site requires an Array operation (F2, control shift Enter), which are not permitted in Tables. Correction: You can enter the formula, control-enter the array function for one cell, copy the formulas, then insert table. Don't change the search cell reference!
Update:
I can still use array operations, but I manually have to copy the desired function into each 100 target cells. No biggie.
Fixed typo in formula. See end of question for details about "perfection".
Example code:
AI4=some integer
AJ4=MATCH(MIN(ABS(Table[float_column]-AI4)), ABS(Table[float_column]-AI4), 0)
repeat for subsequent integers in AI5...AI103
Example data:
0.1 <= matches 0
0.5
0.95 <= matches 1
1.51 <= matches 2
2.89
Consider the case where target=5, and 4.5, 5.5 exist in the list. One gives -0.5 and the other +0.5. Searching for ABS(-.5) will give the first one. Either one is decent, unless your data is non-monotonic.
This still needs a better solution.
Thanks in advance!
I had another problem, which pushed to a better solution.
Specifically, since the Y values for the X that I am interested in can be at varying distances in X, I will interpolate X between the X point before and after. Ie search for less than or equal, also greater than or equal, interpolate the desired X, then interpolate the Y values.
I could go a step further and interpolate N - 1 to N + 1, which will give cleaner results for noisy data.

VBA rounding problem

I have this obscure rounding problem in VBA.
a = 61048.4599674847
b = 154553063.208822
c = a + b
debug.print c
Result:
154614111.66879
Here is the question, why did VBA rounded off variable c? I didn't issued any rounding off function. The value I was expecting was 154614111.6687894847. Even if I round off or format variable c to 15 decimal places I still don't get my expected result.
Any explanation would be appreciated.
Edit:
Got the expected results using cDec. I have read this in Jonathan Allen's reply in Why does CLng produce different results?
Here is the result to the test:
a = cDec(61048.4599674847)
b = cDec(154553063.208822)
c = a + b
?c
154614111.6687894847
The reason is the limited precission that can be stored in a floating point variable.
For a complete explanation you shoud read the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys.
Link to paper
In VBA the default floating point type is Double which is a IEEE 64-bit (8-byte) floating-point number.
There is another type available: Decimal which is a 96-bit (12-byte) signed integers scaled by a variable power of 10
Put simply, this provides floating point numbers to 28 digit precission.
To use in your example:
a = CDec(61048.4599674847)
b = CDec(154553063.208822)
c = a + b
debug.print c
Result:
154614111.6687894847
Its not obscure, but its not necessarily obvious.
I think you've sort of answered it - but the basic problem is one of the "size" of the values that is how much data can be stored in a variable of a given type.
If (and this is very crude) you count the number of digits in each of the numbers in your first example you will see that you have 15 so whilst the range of values that a float (the default type) can represent is huge the precision is limited to 15 digits (I'm sure someone will be along to correct this, I'll tick the wiki box...)
So when you add the two numbers together it loses the least significant values in order to remain within the allowable precision for a flow.
By doing a cDec you're converting to a different type of variable (decimal) that is capable of greater precision

Resources