How to handle rouding of numbers in systems - decimal

We can take Java as a perspective. Suppose we have a system that has items with a price. The price will take several operations, let's say 15 operations. the items' price will be divided multiplied, summed, subtracted with decimals over and over. Know lets say that our system talks to another system. That other system also makes operations to itens prices. In the end the price values of the two systems have to match exactly(cents). We are hipothetically talking about accounting systems. The chance of the two match is very rare, according to my experience. How can we handle such situation. Is there a rule for rounding?

I'd say always calculate with the raw numbers (i.e. a lot of decimals) and transfer that number as well. Only for the comparison round to a less precise, previously agreed upon degree of precision (which is still precise enough for the purpose). That way you have matching results while maintaining high enough precision.
A factor to consider is the rounding method in case one side may differ. There are three I know about:
The one which is taught in school: 0.5 --> 1.0
Towards zero: 1.5 --> 1.0 but -2.5 --> -2.0
Towards even, also called "Banker's rounding": 1.5 --> 2.0 but 2.5 --> 2.0

Related

How can I better optimize a search in possible Fantasyland constructions in Pineapple poker?

So, a bit of explanation to preface the question. In variants of Open Face Chinese poker, you are dealt one and one card, which are to be placed into three different rows, and the goal is to make each row increasingly better, and of course getting the best hands possible. A difference from normal poker is that the top row only contains three cards, so three of a kind is the best possible hand you can get there. In a variant of this called Pineapple, which is what I'm working on a bot for, you are dealt three and three cards after the initial 5, and you discard one of those three cards each round.
Now, there's a special rule called Fantasyland, which means that if you get a pair of queens or better in the top row, and still manage to get successively better hands in the middle and top row, your next round becomes a Fantasyland round. This is a round where are dealt 15 cards at the same time, and are free to construct the best three rows possible (rows of 3, 5, and 5 cards, and discarding 2 of them). Each row yields a certain number of points (royalties, as they're called) depending on which hand is constructed, and each successive row needs better and better hands to yield the same amount of points.
Trying to optimize solutions for this seemed like a natural starting point, and one of the most interesting parts as well, so I started working on it. My first attempt, which is also where I'm stuck, was to use Simulated Annealing to do local search optimization. The energy/evaluation function is the amount of points, and at first I tried a move/neighbor function of simply swapping two cards at random, having first places them as they were drawn. This worked decently, managing to get a mean of around 6 points per hand, which isn't bad, but I often noticed that I could spot better solutions by swapping more than one pair of cards at the same time. Thus, I changed the move/neighbor function to swapping several pairs of cards at once, and also tried swapping a random amount of pairs between 1 and 3 through 5, which managed to yield slightly better results, but still I often spot better solutions by simply taking a look.
If anyone is reading this and understands the problem, any idea on how to better optimize this search? Should I use a different move/neighbor function, different Annealing parameters, or perhaps a different local search method, or even some kind of non-local search? All ideas are welcome and deeply appreciated.
You haven't indicated a performance requirement, so I'll assume that this should work quickly enough to be usable in a game with human players. It can't take an hour to find the solution, but you don't need it in a millisecond, either.
I'm wondering of simulated annealing is the right method. This might be an opportunity for brute force.
One can make a very fast algorithm for evaluating poker hands. Consider an encoding of the cards where 13 bits encode the card value and 4 bits encode the suit. OR together the cards in the hand and you can quickly identify pairs, triples, straights, and flushes.
At first glance, there would seem to be 15! (13,076,743,680,000) possible positions for all the cards which are dealt, but there are other symmetries and restrictions that reduce the meaningful combinations and limit the space that must be explored.
One important constraint is that the bottom row must have a higher score than the middle row and that the middle row must have a higher score than the top row.
There are 3003 sets of bottom cards, COMBINATIONS(15 cards, 5 at a time) = (15!)/(5!(15-5)!) = 3003. For each set of possible bottom cards, there are COMBINATIONS(10 cards, 5 at a time) = (10!)/(5!(10-5!)) = 252 sets of middle cards. The top row has COMBINATIONS(5 cards, 3 at a time) = (5!)/(3!*(5-3)!) = 10. With no further optimization, a brute force approach would require evaluating 3003*252*10 = 7567560 positions. I suspect that this can be evaluated within an acceptable response time.
A further optimization uses the constraint that each row must be worth less than the row below. If the middle row is worth more than the bottom row, the top row can be ignored by pruning the tree at that point, which removes a factor of 10 for those cases.
Also, since the bottom row must be work more than the middle and top rows, there may be some minimum score the bottom row must achieve before it is worth trying middle rows. Rejecting a bottom row prunes 2520 cases from the tree.
I understand that there is a way to use simulated annealing for estimating solutions for discrete problems. My use of simulated annealing has been limited to continuous problems with edge constraints. I don't have a good intuition for how to apply SA to discrete problems. Many discrete problems lend themselves to an exhaustive search, provided the search space can be trimmed by exploiting symmetries and constraints in the particular problem.
I'd love to know the solution you choose and your results.

How to deal with violation of proportional hazards assumption in Cox PH, R 3.1.3 survfit

I'm performing survival analysis in R using the 'survival' package and coxph. My goal is to compare survival between individuals with different chronic diseases. My data are structured like this:
id, time, event, disease, age.at.dx
1, 342, 0, A, 8247
2, 2684, 1, B, 3879
3, 7634, 1, A, 3847
where 'time' is the number of days from diagnosis to event, 'event' is 1 if the subject died, 0 if censored, 'disease' is a factor with 8 levels, and 'age.at.dx' is the age in days when the subject was first diagnosed. I am new to using survival analysis. Looking at the cox.zph output for a model like this:
combi.age<-coxph(Surv(time,event)~disease+age.at.dx,data=combi)
Two of the disease levels violate the PH assumption, having p-values <0.05. Plotting the Schoenfeld residuals over time shows that for one disease the hazard falls steadily over time, and with the second, the line is predominantly parallel, but with a small upswing at the extreme left of the graph.
My question is how to deal with these disease levels? I'm aware from my reading that I should attempt to add a time interaction to the disease whose hazard drops steadily, but I'm unsure how to do this, given that most examples of coxph I've come across only compare two groups, whereas I am comparing 8. Also, can I safely ignore the assumption violation of the disease level with the high hazard at early time points?
I wonder whether this is an inappropriate way to structure my data, because it does not preclude a single individual appearing multiple times in the data - is this a problem?
Thanks for any help, please let me know if more information is needed to answer these questions.
I'd say you have a fairly good understanding of the data already and should present what you found. This sounds like a descriptive study rather than one where you will be presenting to the FDA with a request to honor your p-values. Since your audience will (or should) be expecting that the time-course of risk for different diseases will be heterogeneous, I'd think you can just describe these results and talk about the biological/medical reasons why the first "non-conformist" disease becomes less important with time and the other non-conforming condition might become more potent over time. You already done a more thorough analysis than most descriptive articles in the medical literature exhibit. I rarely see description of the nature of non-proportionality.
The last question regarding data "does not preclude a single individual appearing multiple times in the data" may require some more thorough discussion. The first approach would be to stratify by patient ID with the cluster()-function.

Determine coefficients for some function

I have a task that is probably related to data analysis or even neural networks.
We have a data source of our partners, job portal. The source values are arrays of different attributes related to the particular employee:
His\her gender,
Age,
Years of experience,
Portfolio (number of the projects done),
Profession and specialization (web design, web programming, management etc.),
many other (around 20-30 totally)
Every employee has it's own salary (hourly) rate. So, mathematically, we have some function
F(attr1, attr2, attr3, ...) = A*attr1 + B*attr2 + C*attr3 + ...
With unknown coefficient. But we know the result of the function for the specified arguments (let's say, we know that a male programmer with 20 years of experience and 10 works in portfolio has a rate of $40 per hour).
So we have to find somehow these coefficients (A, B, C...), so we can predict the salary of any employee. This is the most important goal.
Another goal is to find which arguments are most important - in other words, which of them cause significant changes to the result of the function. So in the end we have to have something like this: "The most important attributes are years of experience; then portfolio; then age etc.".
There may be a situation when different professions vary too much from each other - for example, we simply may not be able to compare web designers with managers. In this case, we have to split them by groups and calculate these ratings for every group separately. But in the end we need to find 'shared' arguments that will be common for every group.
I'm thinking about neural networks because it's something they may deal with. But I'm completely new to them and have totally no idea what to do.
I'd very appreciate any help - which instruments to use, what algorithms, or even pseudo-code samples etc.
Thank you very much.
That is the most basic example of (linear) regression. You are using a linear function to model your data, and need to estimate the parameters.
Note that this is actually a part of classic mathematical statistics; not data mining yet but much much older.
There are various methods. Given that there likely will be outliers, I would suggest to use RANSAC.
As for the importance, doesn't this boil down to "which is largest, A B or C"?

Rounding Standards - Financial Calculations

I am curious about the existence of any "rounding" standards" when it comes to the calculation of financial data. My initial thoughts are to perform rounding only when the data is being presented to the user (presentation layer).
If "rounded" data is then used for further calculations, should be use the "rounded" figure or the "raw" figure? Does anyone have any advice?
Please note that I am aware of different rounding methods, i.e. Bankers Rounding etc.
The first and most important rule: use a decimal data type, never ever binary floating-point types.
When exactly rounding should be performed can be mandated by regulations, such as the conversion between the Euro and national currencies it replaced.
If there are no such rules, I'd do all calculations with high precision, and round only for presentation, i.e. not use rounded values for further calculations. This should yield the best overall precision.
I just asked a greybeard mainframe programmer at the financial software company I work for, and he said there is no well-known standard and it's up to programmer practice.
While statisticians have been aware of the rounding issue since at least 1906, it's difficult to find a financial standard endorsing it.
According to this site, the "European Commission report The Introduction of the Euro and the Rounding of Currency Amounts suggests that there had previously been no standard approach to rounding in banking."
In general, use a symmetric rounding mode no matter what base you are working in (base-2 or base-10).
This will avoid systematic bias during calculations.
Such a mode is Round-Half-To-Even, otherwise known as "bankers rounding".
Use language tools that allow you to specify the numeric context explicity, including the rounding and truncation modes. For example, Python's decimal module. The implicit assumptions made by the C library might not be appropriate for your computations.
http://en.wikipedia.org/wiki/Rounding#Rounding_to_integer
It's frustrating that there aren't clear standards on this, both to guide the programmer, and as a defense in court. Just doing "regular" rounding toward nearest for payroll can lead to underpayment by a few pennies on a paycheck here and there, which is something labor lawyers eat up like crack.
Though a base pay rate may well only be specified in two decimal places ("You're hired at $22.71/hour"), things like blended overtime (determined by averaging multiple pay rates in a period) end up with an effective hourly rate of $23.37183475/hr.
How do you pay overtime on that?
15 hours x 23.37183475 x 1.5 = $525.87 rounded from $525.86628187
15 hours x 23.37 x 1.5 = $525.82
WHY DID YOU STEAL FIVE CENTS FROM MY CLIENT? Sadly, I'm not joking about this.
This gets even more uncomfortable when you calculate at the full precision value but display a truncated version: you do the first calculation above, but only display $23.37 for the rate on the pay stub.
Now the pay stub calculations don't tie out to the penny, and now you have to explain it, but even if it's in the employee's favor, it can be enough for a labor lawyer to smell blood in the water and start looking for other stuff.
One approach is to always round in favor of the employee, not in the natural direction, so there cannot ever be an accusation of systematic wage theft.
Ive not seen the existence of "the one standard to rule them all" - there are any number of rounding rules (as you have referenced), and they seem to come into play based on industry/customer/and currency code (http://en.wikipedia.org/wiki/ISO_4217) - since not everyone uses 2 places after the decimal, the problem becomes even more complicated. At the end of the day, your customer needs to specify the rules they want to implement...
Consider using scaled integers.
In other words, store whole numbers of pennies instead of fractional numbers of dollars.

How was non-decimal money represented in software?

A lot of the answers to the questions about the accuracy of float and double recommend the use of decimal for monetary amounts. This works because today all currencies are decimal except MGA and MRO, and those have subunits of 1/5 so are still decimal-friendly.
But what about the software used in U.S. stock markets when prices were in 1/16ths of dollar? The accuracy of binary data types wouldn't have been an issue, right?
Going further back, how did pre-1971 British accounting software deal with pounds, shillings, and pence? Did their versions of COBOL have a special PIC clause for it? Were all amounts stored in pence? How was decimalisation handled?
PL/I had a type specifically for British currency - I don't know about COBOL. The British currency at one time incorporated farthings, or a quarter of a penny; I'm not sure though that computers had to deal with those, just with half pennies or ha'pennies.
Accurate accounting usually uses special types - representing decimals exactly. The new IEEE 754 has support for floating-point decimals, and some chips (notably IBM pSeries) have such support in hardware.
COBOL could do it, eg PICTURE 9(4)D88D6 DISPLAY-ST see http://www.computinghistory.org.uk/downloads/10924 page 117
1/16 can be represented in four digits as .0625. For fractions of that type you just add some additional decimal places.

Resources