How to calculate Log Hazard Ratio in SAS - statistics

proc phreg data = analysis ;
class c diabetes /descending;
model days*ind(0) = c diabetes id/ rl ;
id = c*diabetes;
hazardratio id;
run;
I am trying to run a simple Cox Regression in SAS. I cant seem to find a way to calculate log-hazard ratio for my variables in the model. The hazard ratio statement and the /rl options gives hazard ratio with 95% CI, but I want log-hazard ratio with 95% CI limits.
Please help.

You can get log-hazard ratio just by taking a log of the hazard ratio, the computations are done with the log hazards and exp is taken in the last step of the estimate (please, see an example here http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_phreg_sect030.htm). The same applies for Wald and PL CIs (https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_phreg_a0000000688.htm), e.g.
The profile-likelihood confidence limits for the hazard ratio are obtained by exponentiating these confidence limits

Related

outlier detection using 2D spatial information

I have a list of sensor measurements for air quality with geo-coordinates, and I would like to implement outlier detection. The list of sensors is relatively small (~50).
The air quality can gradually change with the distance, but abrupt local spikes are likely outliers. If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK.
Of course, I can ignore coordinates and do simple outlier detection assuming the normal distribution, but I was hoping to do something more sophisticated. What would be a good statistical way to model this and implement outlier detection?
The above statement, ("If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK."), would indicate that sensors that are closer to each other tend to have values that are more alike.
Tobler’s first law of geography - “everything is related to everything else, but near things are more related than distant things”
You can quantify an answer to this question. The focus is should not be on the location and values from outlier sensors. Use global spatial autocorrelation to answer the degree to which sensors that are near each other tend to be more alike.
As a start, you will first need to define neighbors for each sensor.
I'd calculate a cost function, consisting of two costs:
1: cost_neighbors: Calculates the deviance from the sensor value of an expected value. The expected value is calculated by summing up all the values and weighting them by their distance.
2: cost_previous_step: Check how much the value of the sensor changed compared to the last time step. Large change in value leads to a large cost.
Here is some pseudo code describing how to calculate the costs:
expected_value = ((value_neighbor_0 / distance_neighbor_0)+(value_neighbor_1 / distance_neighbor_1)+ ... )/nb_neighbors
cost_neighbors = abs(expected_value-value)
cost_previous_timestep = value#t - value#t-1
total_cost = a*cost_neighbors + b*cost_previous_timestep
a and b are parameters that can be tuned to give each of the costs more or less impact. The total cost is then used to determine if a sensor value is an outlier, the larger it is, the likelier it is an outlier.
To figure out the performance and weights, you can plot the costs of some labeled data points, of which you know if they are an outlier or not.
cost_neigbors
| X
| X X
|
|o o
|o o o
|___o_____________ cost_previous_step
X= outlier
o= non-outlier
You can now either set the threshold by hand or create a small dataset with the labels and costs, and apply any sort of classifier function (e.g. SVM).
If you use python, an easy way to find neighbors and their distances is scipy.spatial.cKDtree

Correlation Coefficient over Correlation Determination in linear regression

i am new to machine learning and i am using housing price dataset from kaggle.com to solve regression problem. i want to know the difference between Correlation Coefficient and Correlation Determination and why people use one over the other. for instance, i can see the relation between YearBuild and SalePrice like this
now, what is the use of Coefficient Determination, why is it used
if R= Coeffiecient Corellation
then Coefficient Determination = R x R
is the percentage view of the Corellation Coeffiecient?
is it the relation of an individual feature with the rest of the feature?
The coefficient R squared tells you how much of the variance the regression model explains. If it is equal to 0.01 for example, it means that you have explained one percent of the variance. This is useful to know for obvious reasons. Unlike the correlation coefficient, R squared is always positive so just tells you that there is (or is not) a linear relationship, but not what its form is.

How do I calculate confidence interval with only sample size and confidence level

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

Log transforming predictor variables in survival analysis

I am running shared gamma frailty models (i.e., Coxph survival analysis models with a random effect) and want to know if it is "acceptable" to log transform one of your continuous predictor variables. I found a website (http://www.medcalc.org/manual/cox_proportional_hazards.php) that said "The Cox proportional regression model assumes ... there should be a linear relationship between the endpoint and predictor variables. Predictor variables that have a highly skewed distribution may require logarithmic transformation to reduce the effect of extreme values. Logarithmic transformation of a variable var can be obtained by entering LOG(var) as predictor variable".
I would really appreciate a second opinion from someone with more statistical knowledge on this topic. In a nutshell: Is it OK/commonplace/etc to transform (specifically log transform) predictor variables in a survival analysis model (e.g., Coxph model).
Thanks.
You can log transform any predictor in Cox regression. This is frequently necessary but has some drawbacks.
Why log transform? There are a number of good reasons why. You decrease the extent and effect of outliers, data becomes more normally distributed etc.
When possible? I doubt that there are circumstances when you can not do it. I find it hard to believe that it would compromise the precision of your estimates.
Why not do it always? Well it becomes difficult to interpret the results for a predictor which have been log transformed. If you don't log transform, and your predictor is, for example, blood pressure and you obtain a hazard ratio of 1.05, meaning a 5% increase in risk of event for 1 unit increase in blood pressure. IF you log transform blood pressure, the hazard ratio of 1.05 (it would most likely not land on 1.05 again after log transform but we'll stick to 1.05 for simplicity) means 5% increase for each log unit increase in blood pressure. Now thats more difficult to grasp.
But, if you are not interested in the particular variable that you think about log transforming (i.e you just need to adjust for it as a covariate), go ahead do it.

How to identify data points that are significantly smaller than the others in a data set?

I have an array of data points of real value. I wish to identify those data points whose values are significantly smaller than others. Are there any well-known algorithms?
For example, the data set can be {0.01, 0.32, 0.45, 0.68, 0.87, 0.95, 1.0}. I can manually tell that 0.01 is significantly smaller than the others. However, I would like to know are there any analysis method for this purpose in statistics area? I tried outlier detection in my data set, but it cannot find any outliers (such as detecting 0.01 as outlier).
I have deleted a segment I wrote explaining the use of zscores for your problem but it was incorrect, I hope the information below is accurate, just in case, use it as a guide only...
The idea is to build a z-distribution from the scores you are testing, minus the test score, and then use that distribution to get a zscore of the test score. Any z greater than 1.96 is unlikely to belong to your test population.
I am not that this works properly because you remove your tests score' influence from the distribution, thus large scores will have inflated zscores because they contribute to a greater variance (the denominator in the zscore equation).
This could be a start till someone with a modicum of expertise sets us straight :)
e.g.
for i = 1:length(data_set)
test_score = data_set(i)
sample_pop = data_set(data_set~=test_score)
sample_mean = mean(sample_pop)
sample_stdev = std(sample_pop)
test_z(i) = (i-sample_mean)/sample_stdev
end
This can be done for higher dimensions by using the dim input for mean.

Resources