Calculate risk using Cox model coefficients and mean values - survival-analysis

I'm trying to understand the example presented in Appendix C here
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6481149/
Equation C1 is clear to me.
But in Equation C2 they use the mean values.
Such mean values are clear to me in the case of categorical variables for example 1.548 is the mean value of the Sex variable (as shown in the Table 3). Please correct me if I'm wrong.
But in numerical variables I don't understand which mean values are they using. For example for the Age variable they use 3.768, if I understand right, that value is the log of the mean age, should be log(44.15)=1.64. Instead the used value is 3.768.
Please could anybody clarify where does this value come from?

In statistics log often means the natural logarithm, sometimes denoted ln. The four values they take the logarithms of are:
Variable
Reported Mean
ln(Mean)
Reported
Age
44.15
3.788
3.768
BMI
25.61
3.243
3.230
BP Syst
138.6
4.932
4.913
Pulse Rate
75.61
4.326
4.311
The calculated values are not exactly equal to the reported values. But it looks close enough that this is probably the calculation they used. Without the data and/or code they used it's hard to say why the results are different. The study mentions excluding 130 participants because of ethics protections. So, perhaps one table was calculated using a slightly different group of participants than the other table?

Related

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

What test can I use to calculate significance of optimization lift?

Given the following data for 12 users:
username, number of deals for control, revenue from test, revenue from control
Here's an example of how the data looks like
Can you help me figure out how I can calculate the significance of the hypothesis that the test is more profitable (preferably using excel)?
The measure I was thinking of using was the % of lift in revenues for each customer.
P.s. I have a background in statistics but not an expert so please keep it as simple as possible.
Since each pair of incomes refers to the same individual, you can perform a paired t-test.
Variable 1: Control income
Variable 2: Deals income
Then follow these instructions (copied here for posterity):
In Excel, click Data Analysis on the Data tab.
From the Data Analysis popup, choose t-Test: Paired Two Sample for Means.
Under Input, select the ranges for both Variable 1 and Variable 2.
In Hypothesized Mean Difference, you’ll typically enter zero. This value is the null hypothesis value, which represents no effect. In
this case, a mean difference of zero represents no difference between
the two methods, which is no effect.
Check the Labels checkbox if you have meaningful variables labels in row 1. This option helps make the output easier to interpret. Ensure
that you include the label row in step #3.
Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha is the significance level. Change this value only when
you have a specific reason for doing so.
Click OK.
Alternatively, you can indeed calculate the difference between the two incomes, and then perform a one sample t-test (assuming that the difference is zero). However, such a test is not available out-of-the-box in Excel; the procedure is described here.

Removing Lower/Upper Fence of outliers from input data to then be evaluated

What I have attempted:
AVERAGEIF(B11:V11,">+MEDIAN(B11:V11)")
What I am trying to do:
I would like to take the average of the upper half of given data. Elaborating more. I would like to find a formula that will allow me to remove a given lower fence of outliers and dissect the data then given to me. I would greatly prefer to maintain this formula within one cell "not grabbing different results from formulas within multiple cells".
Update:
Following through I found the solution.. I think.
One thing I should have explained further:
The data coming in replicating a typical sqrt function.
What I wanted to achieve is to capture the mean of the "plateau" of the data.
The equation I used was:
=AVERAGEIF(B3:B62,(">"&+TRIMMEAN(B3:B62,0.8)),B3:B62)
This was something I just copied and pasted. of course "B3" and "B62" are significant only for my application.
My rough explanation of the equation:
TRIMMEAN will limit the AVERAGE to the top 20%(">")(0.8) of the data selected. So for my application, this SHOULD give me a rough mean of the "plateau" of the data i would like to find the mean for.
This formula calculates the Median() of the range, then AverageIf() uses the median and only grabs values that are greater than or equal to >= the median ~ giving you the average of the 'top-half' of your values.
AVERAGEIF(A1:A10,">="&MEDIAN(A1:A10))
Hope this help!

Time-Series Process, Definition

The deifinition of time series is as follows :
Lets say for example there is a data (lets say monthly sale) collected per month over 20 years. How many random variables are there? Is it 12?
The phrase SEQUENCE OF RANDOM VARIABLES are very confusing. Someone please explain.
A sequence is just an ordered, countable set. As such it can be indexed (i.e. mapped 1 to 1 and onto) by integers. So a "sequence of FOO" is just set of FOO such that you have FOO[1], FOO[2], FOO[3], ... FOO[n] (and potentially going through negative indices too).
A reasonable model for monthly sales is that each month's sales is a random variable, so in 20 years you have 240 variables. However, bear in mind that identifying each month as a separate variable is a modeling choice, not a mathematical choice. Depending on the problem to be solved, maybe each calendar month is a variable, so you have 12 variables and 20 observations for each variable. Or maybe all that matters is the annual sum, so effectively you have 20 variables. Whether or not any of these is appropriate, or none, can only be answered by considering the problem you are trying to solve -- there is no way to prove one modeling choice is better than another.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources