Kaplan Meier Estimator with a second dimension - statistics

I succeed to implement the Kaplan Meier estimator inside a line chart in Qlik Sense
like this
To do that, I write this expression which is the exact transcription of KM Estimator
= if(RowNo() = 1, 1,
(1 - (count({<Analyse_Type = {'Churn'}>}%Key_Contract) /
count({<Analyse_Type = {'Parc'}>}%Key_Contract))) * above(Column(1))
)
Everything works fine but I'd like to add a second dimension in the graph and when I do that, the recursive above seems to get muddle up.
I try to aggregate the above by my second dimension but it is not working.
Does someone have an idea to do that? Or another way to write the Kaplan Meier estimator without the using of a recursion?

I find a solution to my issue.
I switch the way to make a accumulation of product (the recursive above) by the mathematical logic
exp(rangeSum(log())). I aggregate the rangeSum by my second dimension ordered by my first dimension (the interval) and everything works fine.
Here the final expression of the Kaplan Meier Estimator:
exp(aggr(Rangesum(Above(log(fabs(
(1 - (count({<Analyse_Type = {'Churn'}>}%Key_Contract) / count({<Analyse_Type
{'SurvivalParc'}>}%Key_Contract)))) ),0, Rowno()))
, REGION, (Delivered_Days_5, NUMERIC, ASCENDING)))
And here is the visual result:

Related

importing data and fitting survival model

I am trying to import data from Stata to R and fit a survival model. I did the following:
library(haven)
data <- read_dta("C:/Users/user/Desktop/data.dta")
View(data)
install.packages(c("survival", "survminer"))
library("survival")
library("survminer")
It worked well. However, I got errors:
data("data")
Warning message:
In data("data") : data set ‘data’ not found
fit <- survfit(Surv(data$finaltime, data$GSTATUS_DTHCNS_KI) , data = data)
Error in survfit.Surv(Surv(data$finaltime, data$GSTATUS_DTHCNS_KI), data = data) :
the survfit function requires a formula as its first argument
I wonder if you can tell me how to fix this.
The issue is you aren't supplying a formula. As noted in the documentation for survfit one must now supply a formula:
Older releases of the code also allowed the specification for a
single curve to omit the right hand of the formula, i.e.,
survfit(Surv(time, status)), in which case the formula argument is not
actually a formula. Handling this case required some non-standard and
fairly fragile manipulations, and this case is no longer supported.
Here in an example of a fix, where ~ 1 would be replaced by the formula that fits your research question:
fit <- survfit(Surv(data$finaltime, data$GSTATUS_DTHCNS_KI) ~ 1 , data = data)
summary(fit)
See help("survfit.formula") for more information.

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

How to create a first derivative calculated column in spotfire

I am trying to use calculated columns in Spotfire to calculate the first derivative (x,y)for individual time series IDs (Z)
My data looks like this,
x,y,z
0,0,A
1,1,A
2,1.5,A
3,1,A
4,1,A
5,.9,A
6,.5,A
7,.1,A
8,1.1,A
9,11,A
1,1,B
2,1.5,B
3,1,B
4,1,B
5,.9,B
6,.5,B
7,.1,B
8,1.1,B
9,11,B
10,12,B
I was using this:
([y] - Min([y]) OVER (Previous([x])))
/
([x] - Min([x]) OVER (Previous([x])))
but (1) it doesn't seem right; and (2) how do i then do this OVER every [Z]
This should work:
([y] - min([y]) Over (Intersect([z],Previous([x])))) / ([x] - min([x]) Over (Intersect([z],Previous([x]))))
however the first point is going to be blank for each z, and it might not be very stable if your data has lots of oscillations. For more sophisticated options, you could look into splines (see a number of SO answers) and using a TERR function (not data function, the functions starting with TERR_ if they are available to you) for the calculated column.
Gaia

code produces a 2d histogram but the results dont match with hist2d

I am trying to write a histogram builder to construct a 2d histogram for my assignment work. This is [my code][1]:
def Build2DHistogramClassifier(X1,X2,T,B,x1min,x1max,x2min,x2max):
HF=np.zeros((B,B),dtype='int');#initialising a empty array of integer type
HM=np.zeros((B,B),dtype='int');
bin_row_indices=(np.round(((B-1)*(X1-x1min)/(x1max-x1min)))).astype('int32');"""this logic decides which bin the value goes into"""
bin_column_indices=(np.round(((B-1)*(X2-x2min)/(x2max-x2min)))).astype('int32');"""np.round-->applies the formula to all the values in the array"""
for i,(r,c) in enumerate(zip(bin_row_indices, bin_column_indices)):
"""enumerate-->if we put array or list into it gives output with index/count i """
if T[i]=='Female':
HF[r,c]+=1;
else:
HM[r,c]+=1;
return [HF, HM]
but the problem is that the results( count in each bin) i am getting is not matching the what i get from using hist2d function in numpy( i passed the same bin size)
i am sorry if my code is not in the right format. Please click on the hyperlink to a gist i created with the same code.
what is the mistake in my code?
how do i correct it?
thanks
By rounding when assigning to bins you are treating the bins as bin centers. The numpy convention is to use them as bin edges.
Remove the two calls to round() from your code and change B-1 to B. You should now get the same results with your function and with np.histogram2d.

Tensorflow clamp values outside specific range

I have been using tensorflow to implement a Convolutional neural network,
I have a requirement that the the output values be less than a given value MAX_VAL
I tried creating a matrix filled with MAX_VAL and then using tf.select and tf.greater :
filled = tf.fill(output.get_shape(),MAX_VAL)
modoutput = tf.select(tf.greater(output, filled), filled, output)
But this doesn't work because the shape of output is not known statically:
It is [?, 30] and tf.fill requires an explicit shape.
Any idea how do i implement this?
There is an alternative solution that uses tf.fill() like your initial version. Instead of using Tensor.get_shape() to get the static shape of output, use the tf.shape() operator to get the dynamic shape of output when the step runs:
output = ...
filled = tf.fill(tf.shape(output), MAX_VAL)
modoutput = tf.select(tf.greater(output, filled), filled, output)
(Note also that the tf.clip_by_value() operator might be useful for your purposes.)
I figured out a way to do it.
Instead of using tf.fill I used tf.ones_like
filled = MAX_VAL*tf.ones_like(output)
modoutput = tf.select(tf.greater(output, filled), filled, output)
Please mention if there is a faster or better way to do this is possible.

Resources