How can I interpret the P-value of nonlinear term under rms: anova? - survival-analysis

Here I demonstrated a survival model with rcs term. I was wondering whether the anova()under rms package is the way to test the linearity association? And How can I interpret the P-value of the Nonlinear term (see 0.094 here), does that support adding a rcs() term in the cox model?
library(rms)
data(pbc)
d <- pbc
rm(pbc, pbcseq)
d$status <- ifelse(d$status != 0, 1, 0)
dd = datadist(d)
options(datadist='dd')
# rcs model
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
albumin 82.80 3 <.0001
Nonlinear 4.73 2 0.094
TOTAL 82.80 3 <.0001

The proper way to test is with model comparison of the log-likelihood (aka deviance) across two models: a full and reduced:
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
m <- cph(Surv(time, status) ~ albumin, data=d)
p.val <- 1- pchisq( (m2$loglik[2]- m$loglik[2]), 2 )
You can see the difference in the inference using the less accurate Wald statistic (which in your case was not significant anyway since the p-value was > 0.05) versus this more accurate method in the example that Harrell used in his ?cph help page. Using his example:
> anova(f)
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 57.75 3 <.0001
Nonlinear 8.17 2 0.0168
sex 18.75 1 <.0001
TOTAL 75.63 4 <.0001
You would incorrectly conclude that the nonlinear term was "significant" at conventional 0.05 level. This despite the fact that code creating the model was constructed as entirely linear in age (on the log-hazard scale):
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
Create a reduced mode and compare:
f0 <- cph(S ~ age + sex, x=TRUE, y=TRUE)
anova(f0)
#-------------
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 56.64 1 <.0001
sex 16.26 1 1e-04
TOTAL 75.85 2 <.0001
The difference in deviance is not significant with 2 degrees of freedom difference:
1-pchisq((f$loglik[2]- f0$loglik[2]),2)
[1] 0.1243212
I don't know why Harrell leaves this example in, because I've taken his RMS course and know that he endorses the cross-model comparison of deviance as the more accurate approach.

Related

Ryan Joiner Normality Test P-value

i tried a lot to calculate the Ryan Joiner(RJ) P-Value. is there any method or formula to calculate RJ P-Value.
i found how to calculate RJ value but unable to find the manual calcuation part for RJ P-Value. Minitab is calculating by some startegy . i want to know that how calculate it in manually.
please support me on this.
The test statistic RJ needs to be compared to a critical value CV in order to make a determination of whether to reject or fail to reject the null hypothesis.
The value of CV depends on the sample size and confidence level desired, and the values are empirically derived: generate large numbers of normally distributed datasets for each sample size n, calculate RJ statistic for each, then CV for a=0.10 is the 10th percentile value of RJ.
Sidenote: For some reason I'm seeing a 90% confidence level used many places for Ryan-Joiner, when a 95% confidence is commonly used for other normality tests. I'm not sure why.
I recommend reading the original Ryan-Joiner 1976 paper:
https://www.additive-net.de/de/component/jdownloads/send/70-support/236-normal-probability-plots-and-tests-for-normality-thomas-a-ryan-jr-bryan-l-joiner
In that paper, the following critical value equations were empirically derived (I wrote out in python for convenience):
def rj_critical_value(n, a=0.10)
if a == 0.1:
return 1.0071 - (0.1371 / sqrt(n)) - (0.3682 / n) + (0.7780 / n**2)
elif a == 0.05:
return 1.0063 - (0.1288 / sqrt(n)) - (0.6118 / n) + (1.3505 / n**2)
elif a == 0.01:
return 0.9963 - (0.0211 / sqrt(n)) - (1.4106 / n) + (3.1791 / n**2)
else:
raise Exception("a must be one of [0.10, 0.05, 0.01]")
The RJ test statistic then needs to be compared to that critical value:
If RJ < CV, then the determination is NOT NORMAL.
If RJ > CV, then the determination is NORMAL.
Minitab is going one step further - working backwards to determine the value of a at which CV == RJ. This value would be the p-value you're referencing in your original question.

retrieve original factor loadings using Factor Analysis in Scikit learn

So I am trying to do a toy example where I know factors in advance and I want to back them out using FactorAnalysis or PCA using SciKit learn.
Lets say I have defined 4 random X factors and 10 Y dependent variable:
# number of obs
N=10000
n_factors=4
n_variables=10
# 4 Random Factors ~N(0,1)
X=np.random.normal(size=(N,n_factors))
# Loadings for 10 Y dependent variables
loadings=pd.DataFrame(np.round(np.random.normal(0,2,size=(n_factors,n_variables)),2))
# Y without unique variance
Y_hat=X.dot(loadings)
There is no random noise here so if I run the PCA it will show that 4 factors explain all the variance as one would expect:
pca=PCA(n_components=n_factors)
pca.fit(Y_hat)
np.cumsum(pca.explained_variance_ratio_)
array([0.47940185, 0.78828548, 0.93573719, 1. ])
so far so good. In the next step I have ran the FA analysis and reconstituted the Y from the calculated loadings and factor scores:
fa=FactorAnalysis(n_components=n_factors, random_state=0,rotation=None)
X_fa = fa.fit_transform(Y_hat)
loadings_fa=pd.DataFrame(fa.components_)
Y_hat_fa=X_fa.dot(loadings_fa)+np.mean(Y_hat,axis=0)
print((Y_hat_fa-Y_hat).max())
print((Y_hat_fa-Y_hat).min())
6.039613253960852e-13
-5.577760475716786e-13
So the the original variables and reconstituted variables from FA match almost exactly.
However,
The loadings don't match at all and neither do factors:
loadings_fa-loadings
0 1 2 3 4 5 6 7 8 9
0 1.70402 -3.37357 3.62861 -0.85049 -6.10061 11.63636 3.06843 -6.89921 4.17525 3.90106
1 -1.38336 5.00735 0.04610 1.50830 0.84080 -0.44424 -1.52718 3.53620 3.06496 7.13725
2 1.63517 -1.95932 2.71208 -2.34872 -2.10633 4.50955 3.45529 -1.44261 0.03151 0.37575
3 0.27463 3.89216 2.00659 -2.18016 1.99597 -1.85738 2.34128 6.40504 -0.55935 4.13107
From quick calculations the factors from FA are not even well correlated with the original factors.
I am looking for a good theoretical explanation why I can't back out the original Factors and loadings, and not necessarily looking for code example

Why I can't fit Poisson distribution using Chisquare test ? Whats wrong is in fitting? [duplicate]

I want to fit poission distribution on my data points and want to decide based on chisquare test that should I accept or reject this proposed distribution. I only used 10 observations. Here is my code
#Fitting function:
def Poisson_fit(x,a):
return (a*np.exp(-x))
#Code
hist, bins= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02,2.67094169e-03,4.39345778e-04,6.59603327e-05,1.01518320e-05,
1.06301906e-06]
XX = np.arange(len(hist))
print("XX: ",XX)
#XX: [0 1 2 3 4 5 6 7 8 9]
plt.scatter(XX, hist, marker='.',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, XX, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',color='red',
label='Fit')
print("hist: ",hist)
plt.xlabel('s')
plt.ylabel('P(s)')
#Chisquare test:
f_obs =hist
#f_obs: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02, 2.67094169e-03, 4.39345778e-04, 6.59603327e-05,
1.01518320e-05, 1.06301906e-06]
f_exp= Poisson_fit(XX,*popt)
f_exp: [6.76613820e-01, 2.48912314e-01, 9.15697229e-02, 3.36866185e-02,
1.23926144e-02, 4.55898806e-03, 1.67715798e-03, 6.16991940e-04,
2.26978650e-04, 8.35007789e-05]
chi,p_value=chisquare(f_obs,f_exp)
print("chi: ",chi)
print("p_value: ",p_value)
chi: 0.4588956658201067
p_value: 0.9999789643475111`
I am using 10 observations so degree of freedom would be 9. For this degree of freedom I can't find my p-value and chi value on Chi-square distribution table. Is there anything wrong in my code?Or my input values are too small that test fails? if P-value >0.05 distribution is accepted. Although p-value is large 0.999 but for this I can't find chisquare value 0.4588 on table. I think there is something wrong in my code. How to fix this error?
Is this returned chi value is the critical value of tails? How to check proposed hypothesis?

JAGS: node inconsistent with parents

I am trying to apply Benedict Escoto's method from the paper "Bayesian Claim Severity with Mixed Distributions," published in Variance. I seem to be running into a JAGS simulation problem. When I run the code, JAGS gives me the following error:
Error in node ones[1] Node inconsistent with parents
The data is two-fold. First, there is a list of ground-up insurance claims. Information on their age in years, deductibles (truncation) and whether they have been capped (censoring, True or False) is provided. Second, there is a list of prior means and corresponding probability weights for a mixed exponential distribution.
One thing I noticed is that it works for one set of data of priors for mixed exponential distribution, but fails for another.
It works with:
Mean Weight
50 0.3
100 0.25
500 0.25
1500 0.1
5000 0.07
20000 0.03
But it fails with:
Mean Weight
3 0.72
14 0.19
42 0.05
138 0.02
503 0.01
1501 0.01
So it may have to do with parameter requirements for mixed exponential.
Data packet for JAGS is assembled as below
jags.data <- list(claims=(claim.df$x[!claim.df$capped]
- claim.df$truncation[!claim.df$capped]),
capped.claims=(claim.df$x[claim.df$capped]
- claim.df$truncation[claim.df$capped]),
alpha=alpha,
means=actual.means,
ones=rep(1, length(claim.df$[claim.df$capped])),
ages=claim.df$age[!claim.df$capped],
capped.ages=claim.df$age[claim.df$capped],
trend.shape=trend.shape,
trend.rate=1/trend.scale)
Notice that object "ones" is given values of 1 for each capped claim.
The initial values are supplied as below:
jags.init <- list(means=list(weights=prior.weights),
equal=list(weights=rep(1/m,m)))
Some miscellaneous values are provided as follows:
m <- length(actual.means)
alpha0 <- 20
alpha <- prior.weights * alpha0
trend.prior.mu <- .05
trend.prior.sigma <- .01
trend.scale <- trend.prior.sigma^2 / (1+trend.prior.mu)
trend.shape <- (1+trend.prior.mu)/trend.scale
The JAGS model is coded as below:
model <- "model {
weights ~ ddirch(alpha)
trend.factor ~ dgamma(trend.shape, trend.rate)
for (i in 1:length(claims)) {
buckets[i] ~ dcat(weights)
mu[i] <- means[buckets[i]] / trend.factor^ages[i]
claims[i] ~ dexp(1/mu[i])
}
for (i in 1:length(capped.claims)) {
capped.buckets[i] ~ dcat(weights)
capped.mu[i] <- means[capped.buckets[i]]/trend.factor^capped.ages[i]
prob.capped[i] <- exp(-capped.claims[i]/capped.mu[i])
ones[i] ~ dbern(prob.capped[i])
}
}"
Dirichlet, Categorical and Gamma distributions are used for priors. Ones is Bernoulli distributed to characterize claims as capped or uncapped.
Finally, the model is run in JAGS with the following:
model.out <- autorun.jags(model, data=jags.data, inits=jags.init,
monitor=c("weights","trend.factor"),
startburnin=1000, startsample=5000,
n.chains=n.chains, interactive=FALSE, thin=thin.factor)
Would anyone have an idea what goes wrong? Thanks

How to avoid impression bias when calculate the ctr?

When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!
You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""
If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.
There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)

Resources