Decimal Point Normalization in Python - python-3.x

I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling.
I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done -
Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856.
To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028.
So for all values: new value = old value/ 10^c
How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set?
Input:
A B C
30 90 75
56 168 140
28 84 70
369 1107 922.5
485 1455 1212.5
4856 14568 12140
40 120 100
56 168 140
45 135 112.5
78 234 195
899 2697 2247.5
Output:
A B C
0.003 0.0009 0.0075
0.0056 0.00168 0.014
0.0028 0.00084 0.007
0.0369 0.01107 0.09225
0.0485 0.01455 0.12125
0.4856 0.14568 1.214
0.004 0.0012 0.01
0.0056 0.00168 0.014
0.0045 0.00135 0.01125
0.0078 0.00234 0.0195
0.0899 0.02697 0.22475

Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like:
def Dec_scale(df):
for x in df:
p = df[x].max()
q = len(str(abs(p)))
df[x] = df[x]/10**q
I hope this solution looks agreeable!

def decimal_scaling (df):
df_abs = abs(df)
max_valus= df_abs.max()
log_num=[]
for i in range(max_valus.shape[0]):
log_num.append(int(math.log10(max_valus[i]))+1)
log_num = np.array(log_num)
log_num = [pow(10, number) for number in log_num]
X_full =df/log_num
return X_full

Related

Calculating AUDPC Using Spotfire

I was following this question to address a similar situation:
How to Calculate Area Under the Curve in Spotfire?
My data is in the following format:
PLANT
OBS_DATE_RECORDED
TRAIT_VALUE
period
A
3/16/2021
225
A3/16/2021
A
3/23/2021
227
A3/23/2021
A
3/30/2021
220
A3/30/2021
A
4/7/2021
240
A4/7/2021
A
4/13/2021
197
A4/13/2021
A
4/20/2021
197
A4/20/2021
A
4/27/2021
218
A4/27/2021
B
3/16/2021
253
B3/16/2021
B
3/23/2021
274
B3/23/2021
B
3/30/2021
271
B3/30/2021
B
4/7/2021
257
B4/7/2021
B
4/13/2021
250
B4/13/2021
A
4/20/2021
241
A4/20/2021
B
4/27/2021
255
B4/27/2021
Following the answer's formula as a calculated column:
([TRAIT_VALUE] + Avg([TRAIT_VALUE]) over (Intersect(NextPeriod([period]),[PLANT]))) / 2 * (Avg([OBS_DATE_RECORDED]) over (Intersect(NextPeriod([period]),[PLANT])) - [OBS_DATE_RECORDED])
However, the results don't appear correct.
AUDPC
1603.19:59:59.928
1608.17:59:59.956
2924.20:0:0.100
7732.21:0:0.000
1395.14:41:44.404
1461.23:30:0.050
-4393.7:59:59.712
I think the problem might be the date format but don't understand the formula well enough to troubleshoot. In Excel I usually compute the AUDPC by using the SUMPRODUCTS multiplying the days between two dates by the average TRAIT_VALUE between those two dates.

SAS Proc IML Simulate from empirical data with limits

This might sound bonkers, but looking to see if there are any ideas on how to do this.
I have N categories (say 7) where a set number of people (say 1000) have to be allocated. I know from historical data the minimum and maximum for each category (there is limited historical data, say 15 samples, so I have data that looks like this - if I had a larger sample, I would try to generate a distribution for each category from all the samples, but there isn't.
-Year 1: [78 97 300 358 132 35 0]
-Year 2: [24 74 346 300 148 84 22]
-.
-.
-Year 15:[25 85 382 302 146 52 8]
The min and max for each category over these 15 years of data is:
Min: [25 74 252 278 112 27 0 ]
Max: [132 141 382 360 177 84 22]
I am trying to scale this using simulation - by allocating 1000 to each category within the min and max limits, and repeating it. The only condition is that the sum of the allocation across the seven categories in each simulation has to sum to 1000.
Any ideas would be greatly appreciated!
The distribution you want is called the multinomial distribution. You can use the RandMultinomial function in SAS/IML to produce random samples from the multinomial distribution. To use the multinomial distribution, you need to know the probability of an individual in each category. If this probability has not changed over time, the best estimate of this probability is to take the average proportion in each category.
Thus, I would recommend using ALL the data to estimate the probability, not just max and min:
proc iml;
X = {...}; /* X is a 15 x 7 matrix of counts, each row is a year */
mean = mean(X);
p = mean / sum(mean);
/* simulate new counts by using the multinomial distribution */
numSamples = 10;
SampleSize = 1000;
Y = randmultinomial(numSamples, SampleSize, p);
print Y;
Now, if you insist on using the max/min, you could use the midrange to estimate the most likely value and use that to estimate the probabilty, as follows:
Min = {25 74 252 278 112 27 0};
Max = {132 141 382 360 177 84 22};
/* use midrange to estimate probabilities */
midrange = (Min + Max)/2;
p = midrange / sum(midrange);
/* now use RandMultinomial, as before */
If you use the second method, there is no guarantee that the simulated values will not exceed the Min/Max values, although in practice many of the samples will obey that criterion.
Personally, I advocate the first method, which uses the average count. Or you can use a time-weighted count, if you think recent observations are more relevant than observations from 15 years ago.

How to calculate confidence intervals for crude survival rates?

Let's assume that we have a survfit object as follows.
fit = survfit(Surv(data$time_12m, data$status_12m) ~ data$group)
fit
Call: survfit(formula = Surv(data$time_12m, data$status_12m) ~ data$group)
n events median 0.95LCL 0.95UCL
data$group=HF 10000 3534 NA NA NA
data$group=IGT 70 20 NA NA NA
fit object does not show CI-s. How to calculate confidence intervals for the survival rates? Which R packages and code should be used?
The print result of survfit gives confidnce intervals by group for median survivla time. I'm guessing the NA's for the estimates of median times is occurring because your groups are not having enough events to actually get to a median survival. You should show the output of plot(fit) to see whether my guess is correct.
You might try to plot the KM curves, noting that the plot.survfit function does have a confidence interval option constructed around proportions:
plot(fit, conf.int=0.95, col=1:2)
Please read ?summary.survfit. It is the class of generic summary functions which are typically used by package authors to deliver the parameter estimates and confidence intervals. There you will see that it is not "rates" which are summarized by summary.survfit, but rather estimates of survival proportion. These proportions can either be medians (in which case the estimate is on the time scale) or they can be estimates at particular times (and in that instance the estimates are of proportions.)
If you actually do want rates then use a functions designed for that sort of model, perhaps using ?survreg. Compare what you get from using survreg versus survfit on the supplied dataset ovarian:
> reg.fit <- survreg( Surv(futime, fustat)~rx, data=ovarian)
> summary(reg.fit)
Call:
survreg(formula = Surv(futime, fustat) ~ rx, data = ovarian)
Value Std. Error z p
(Intercept) 6.265 0.778 8.05 8.3e-16
rx 0.559 0.529 1.06 0.29
Log(scale) -0.121 0.251 -0.48 0.63
Scale= 0.886
Weibull distribution
Loglik(model)= -97.4 Loglik(intercept only)= -98
Chisq= 1.18 on 1 degrees of freedom, p= 0.28
Number of Newton-Raphson Iterations: 5
n= 26
#-------------
> fit <- survfit( Surv(futime, fustat)~rx, data=ovarian)
> summary(fit)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
rx=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 13 1 0.923 0.0739 0.789 1.000
115 12 1 0.846 0.1001 0.671 1.000
156 11 1 0.769 0.1169 0.571 1.000
268 10 1 0.692 0.1280 0.482 0.995
329 9 1 0.615 0.1349 0.400 0.946
431 8 1 0.538 0.1383 0.326 0.891
638 5 1 0.431 0.1467 0.221 0.840
rx=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
353 13 1 0.923 0.0739 0.789 1.000
365 12 1 0.846 0.1001 0.671 1.000
464 9 1 0.752 0.1256 0.542 1.000
475 8 1 0.658 0.1407 0.433 1.000
563 7 1 0.564 0.1488 0.336 0.946
Might have been easier if I had used "exponential" instead of "weibull" as the distribution type. Exponential fits have a single parameter that is estimated and are more easily back-transformed to give estimates of rates.
Note: I answered an earlier question about survfit, although the request was for survival times rather than for rates. Extract survival probabilities in Survfit by groups

How does Excel evaluate FACT(170)/FACT(169) correctly?

170! approaches the limit of a floating point double: 171! will overflow.
However 170! is over 300 digits long.
There is, therefore, no way that 170! can be represented precisely in floating point.
Yet Excel returns the correct answer for 170! / 169!.
Why is this? I'd expect some error to creep in, but it returns an integral value. Does Excel somehow know how to optimise this calculation?
If you find the closest doubles to 170! and 169!, they are
double oneseventy = 5818033100654137.0 * 256;
double onesixtynine = 8761273375102700.0;
times the same power of two. The closest double to the quotient of these is exactly 170.0.
Also, Excel may compute 170! by multiplying 169! by 170.
William Kahan has a paper called "How Futile are Mindless Assessments of Roundoff in Floating-Point Computation?" where he discusses some of the insanity that goes on in Excel. It may be that Excel is not computing 170 exactly, but rather it's hiding an ulp of reality from you.
The answer of tmyklebu is already perfect. But I wanted to know more.
What if implementation of n! was something trivial as return double(n)*(n-1)!...
Here is a Smalltalk snippet, but you can translate in many other languages, that's not the point:
(2 to: 170) count: [:n |
| num den |
den := (2 to: n - 1) inject: 1.0 into: [:p :e | p*e].
num := n*den.
num / den ~= n].
And the answer is 12
So you have not been particulary lucky, due to good properties of round to nearest even rounding mode, out of these 169 numbers, only 12 don't behave as expected.
Which ones? Replace count: by select: and you get:
#(24 47 59 61 81 96 101 104 105 114 122 146)
If I had an Excel handy, I would ask to evaluate 146!/145!.
Curiously (only apparently curiously), a less naive solution that computes the exact factorial with large integer arithmetic, then convert to nearest float, does not perform better !
(2 to: 170) reject: [:n |
n factorial asFloat / (n-1) factorial asFloat = n]
leads to:
#(24 31 34 40 41 45 46 57 61 70 75 78 79 86 88 92 93 111 115 116 117 119 122 124 141 144 147 164)

Extrapolate Linear Regression [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am new with excel, but how can i get an estimate for the values in 2013 of something like this:
I need an estimate which is the extrapolation of the value according to the linear regression the counterparts observed in recent years.
Thanks
To answer this, I plotted data in two ways: (a) showing each year separately, and (b) showing all the data as one line through time. The graphs are as follows:
Looking at the first graph, if there is any seasonality in the data, it's not very strong. However, looking at all the data plotted on one line through time, it looks as though there is an upward trend. So my suggestion is to do the most basic regression and fit a straight line to the data. The graph with the trend line added is as follows:
In numbers, the results are:
Data Best fit straight line
Jan-10 218 232.7
Feb-10 251 235.0
Mar-10 221 237.1
Apr-10 241 239.4
May-10 261 241.7
Jun-10 227 244.0
Jul-10 253 246.3
Aug-10 266 248.6
Sep-10 238 250.9
Oct-10 255 253.2
Nov-10 238 255.5
Dec-10 219 257.7
Jan-11 263 260.0
Feb-11 239 262.4
Mar-11 255 264.5
Apr-11 297 266.8
May-11 299 269.0
Jun-11 256 271.4
Jul-11 292 273.6
Aug-11 247 275.9
Sep-11 254 278.2
Oct-11 258 280.5
Nov-11 264 282.8
Dec-11 301 285.1
Jan-12 319 287.4
Feb-12 314 289.7
Mar-12 274 291.9
Apr-12 325 294.2
May-12 319 296.4
Jun-12 339 298.8
Jul-12 339 301.0
Aug-12 271 303.3
Sep-12 310 305.7
Oct-12 291 307.9
Nov-12 259 310.2
Dec-12 286 312.5
Jan-13 314.8
Feb-13 317.1
Mar-13 319.2
Apr-13 321.5
May-13 323.8
Jun-13 326.1
Jul-13 328.4
Aug-13 330.7
Sep-13 333.0
Oct-13 335.2
Nov-13 337.6
Dec-13 339.8
There are different ways you can apply linear regression. You could, for example, use all your data points to create an equation to calculate for all the subsequent months. However, if there are yearly cycles, you might just want to use the data for each January to estimate the next January; each month of February to estimate February; etc. To keep it simple, let's just work with January for now. In order to keep the numbers smaller, I'm just going to use the last two digits of the year:
X Y
10 218
11 263
12 319
Next calculate 4 different sums:
S[x] = Sum of all Xs = 33
S[y] = Sum of all Ys = 800
S[xx] = Sum of X squared = 100 + 121 + 144 = 365
S[xy] = Sum of X*Y = 2180 + 2893 + 3828 = 8901
Calculate slope and intercept:
N = Number of data points sampled = 3
M = Slope = (N*S[xy] - S[x]*S[y])/(N*S[xx] - S[x]^2)
M = (3*8901 - 33*800)/(3*365 - 33^2) = 303/6 = 50.5
B = Intercept = (S[y] - M*S[x])/N
B = (800 - 50.5*33)/3 = -866.5/3 = -289
Therefore the equation for January would be:
Y = M*X + B
Y = 50.5*X - 289
Calculate for the year 2013:
Y = 50.5*13 -289 = 368
Start by plotting your data. Decide what kind of function will be a good fit.
You can either create a fit for each month or try to create one that has both year and month as independent variables.
Let's assume that a polynomial fit for each month will work for you:
y = c0 + c1*m + c2*m^2
So for January:
218 = c0 + c1*2010 + c2*2010^2
263 = c0 + c1*2011 + c2*2011^2
319 = c0 + c1*2012 + c2*2012^2
So now you have three equations for three unknowns. Solve for (c0, c1, c2) and the substitute m = 2013 for your extrapolation.
Here are the results I get:
Month 2010 2011 2012 2013
1 218 263 319 386
2 251 239 314 476
3 221 255 274 278
4 241 297 325 325
5 261 299 319 321
6 227 256 339 476
7 253 292 339 394
8 266 247 271 338
9 238 254 310 406
10 255 258 291 354
11 238 264 259 223
12 219 301 286 174
See how you do.

Resources