Calibrate with cph function (with external validation) - survival-analysis

I have two questions for calibrate with cph function.
My data have 5 independent variables(from BMI to RT), and 2 dependent variables (time, event).
> head(data)
BMI Taxanes Surgery LND RT Event Time
1 19 0 0 2 5 0 98
2 20 0 0 3 3 0 97
3 21 0 0 8 2 0 17
4 18 0 0 1 3 0 35
5 20 1 0 3 1 0 27
6 20 1 0 2 3 1 2
> str(data)
$ BMI : num 19 20 21 18 20 20 20 ...
$ Taxanes: int 0 0 0 0 1 1 1 0 0 0 ...
$ Surgery: num 0 0 0 0 0 0 1 0 0 0 ...
$ LND : int 2 3 8 1 3 2 2 2 5 2 ...
$ RT : Factor w/ 7 levels "0","1","2","3",..: 5 3 2 3 1 3 ...
$ Event : int 0 0 0 0 0 1 0 0 0 0 ...
$ Time : num 98 97 17 35 27 2 22 ...
(1) With this data, I did survival analysis with cph model. And I want to make a calibration plot using this data. But I got an error which "Error in x(x) : argument "y" is missing, with no default". I was finding lots of material. But I don't know the reason for this error. Even if I found the calibrate function in web, But I can't find for the element 'y'. please help me for this question.
> ddist <- datadist(data)
> options(datadist='ddist')
>
> fit = cph(Surv(Time,Event) ~ BMI + Surgery + Taxanes + RT + LND, data=data, x=TRUE, y=TRUE, surv=TRUE, dxy=TRUE, time.inc=36)
> plot(calibrate(fit))
Using Cox survival estimates at 36 Days
**Error in x(x) : argument "y" is missing, with no default**
(2) Eventually I want to do external validation for this cph model(fit).
If new data name is kind of dat2 (which has the same variable with data), then what is the observed and predicted survival? I know that the predicted value calculate like this code
val<-val.surv(fit, newdata=dat2, S=Surv(dat2$Time,dat2$Event))
But how I get a actual(observed) survival in new data(dat2)? Please help for this problem. Thank you so much in advance!

Related

Writing Function on Data Frame in Pandas

I have data in excel which have two columns 'Peak Value' & 'Label'. I want to add value in 'Label' column based on 'Peak Value' column.
So, Input looks like below
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 0 0 0 0 0 0 0 0
Input
Whenever the value in 'Peak Value' is greater than zero then it add 1 in 'Label' and replace all the zeros below it. For the next value greater than zero it should get incremented to 2 and replace all the zeros by 2.
So, the output will look like this:
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 1 1 1 2 2 2 3 3
Output
and so on....
I tried writing function but I am only able to add 1 when the value is greater than 0 in 'Peak Value'.
def funct(row):
if row['Peak Value']>0:
val = 1
else:
val = 0
return val
df['Label']= df.apply(funct, axis=1)
May be you could try using cumsum and ffill:
import numpy as np
df['Labels'] = (df['Peak Value'] > 0).groupby(df['Peak Value']).cumsum()
df['Labels'] = df['Labels'].replace(0, np.nan).ffill().replace(np.nan, 0).astype(int)
Output:
Peak Value Labels
0 0 0
1 0 0
2 0 0
3 88 1
4 0 1
5 0 1
6 88 2
7 0 2
8 0 2
9 88 3
10 0 3

Splitting a each column value into different columns [duplicate]

This question already has answers here:
Convert pandas DataFrame column of comma separated strings to one-hot encoded
(3 answers)
Closed 4 years ago.
I have a survey response sheet which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the response sheet and import it into pandas I get this:
Timestamp Sports you like Age
0 23/11/2013 13:22:30 Football, Chess, Cycling 15
1 23/11/2013 13:22:34 Football 25
2 23/11/2013 13:22:39 Swimming,Football 22
3 23/11/2013 13:22:45 Chess, Soccer 27
4 23/11/2013 13:22:48 Soccer 30
There can be any number of sport values in sports column (further rows has basketball,volleyball etc.) and there are still some other columns. I'd like to do statistics on the results of the question (how many people liked Football,etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Sports-Football, Sports-Volleyball, Sports-Basketball, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this
What I need is a new dataframe that looks like this (along with Age column) -
Timestamp Sports-Football Sports-Chess Sports-Cycling ....
0 23/11/2013 13:22:30 1 1 1
1 23/11/2013 13:22:34 1 0 0
2 23/11/2013 13:22:39 1 0 0
3 23/11/2013 13:22:45 0 1 0
I tried till this point can't proceed further.
df['Sports you like'].str.split(',\s*')
which splits into different columns but the first column may have any sport, I need only 1 in first column if the user likes Football or 0.
Problem is separator ,\s*, so solution is add str.split with str.join before str.get_dummies:
df1 = (df.pop('Sports you like').str.split(',\s*')
.str.join('|')
.str.get_dummies()
.add_prefix('Sports-'))
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0
Or use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = df.pop('Sports you like').str.split(',\s*')
df1 = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_).add_prefix('Sports-')
print (df1)
Sports-Chess Sports-Cycling Sports-Football Sports-Soccer \
0 1 1 1 0
1 0 0 1 0
2 0 0 1 0
3 1 0 0 1
4 0 0 0 1
Sports-Swimming
0 0
1 0
2 1
3 0
4 0
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0

Parsing table with streaks of binaries to select larger group element

I have a table like the following (only much longer):
# time binary frequency
0 2.1 0 0.65
1 3.2 1 0.72
2 5.8 0 0.64
3 7.1 0 0.63
4 9.5 1 0.72
5 14.1 1 0.74
6 21.5 0 0.62
7 27.3 0 0.61
8 29.5 1 1.00
9 32.1 1 1.12
10 35.5 1 0.99
I want to collect all the times correspondent to only binary == 1 and, among the small groups, those whose correspondent frequency value is higher. In the table above, this would result in:
times = 3.2, 14.1, 32.1
I am not sure how to approach the sequentiality of the table on the first place, and then how to compare the values among them returning only the correspondent time (and not, for example, the largest frequency). Time hides a periodicity, so I would avoid to build another table with only binary == 1 elements.
Having my time, binary, and frequency arrays, I can isolate relevant elements by:
condition = (binary == 1)
time1 = time(condition)
frequency1 = frequency(condition)
but I do not know how to proceed to isolate the various streaks. What are useful functions I can use?
I don't know that there are any clever functions to use for this. Here's some code that will do the job. Please note that I removed the headers from your file.
binary is either zero or one, depending on whether the rows other values are to be included in a group. Initially in_group is set to False to indicate that no group has started. As rows are read, when binary is zero, if the code has been reading rows for a group and, therefore, in_group is True, in_group is set to False because now that a zero has been encountered that group has come to an end. Since processing of the group has ended, it's time to print results for it. As rows are read, when binary is one, if in_group is True then the code has already started processing rows are a group and the code checks whether the newest frequency is greater than what has been see before. If so, it updates both rep_time and rep_frequency. If in_group is False then this is the first row of a new group and in_group is set True and initial values of rep_time and rep_frequency are set.
with open('pyser.txt') as pyser:
in_group = False
for line in pyser:
_, time, binary, frequency = [float(_) for _ in line.rstrip().split()]
if binary == 0:
if in_group:
in_group = False
print (rep_time)
else:
if in_group:
if frequency > rep_frequency:
rep_time, rep_frequency = time, frequency
else:
in_group = True
rep_time, rep_frequency = time, frequency
if in_group:
print (rep_time)
Output:
3.2
14.1
32.1
Edit: We seem to be using different definitions of the problem.
In the first group, we agree. But, in the second group, the maximum amplitude is about 4.07E-01, which corresponds to a time of about 5.4740E+04.
I've also written code in Pandas:
>>> import pandas as pd
>>> df = pd.read_csv('Gyd9P1rb.txt', sep='\s+', skiprows=2, header=None, names='Row TSTOP PSRTIME DETECTED FDOTMAX AMPLITUDE AMPLITUDE_ERR'.split())
>>> del df['Row']
>>> del df['TSTOP']
>>> del df['FDOTMAX']
>>> del df['AMPLITUDE_ERR']
>>> groups = []
>>> in_group = False
>>> group_number = 1
>>> for b in df['DETECTED']:
... if b:
... if not in_group:
... group_number +=1
... in_group = True
... groups.append(group_number)
... else:
... groups.append(0)
... in_group = False
...
>>> df['groups'] = pd.Series(groups, index=df.index)
>>> df.head()
PSRTIME DETECTED AMPLITUDE groups
0 54695.471283 1 0.466410 2
1 54698.532412 1 0.389607 2
2 54701.520814 1 0.252858 2
3 54704.557583 0 0.103460 0
4 54707.557563 0 0.088215 0
>>> gb = df.groupby(by=df['groups'])
>>> def f(x):
... the_max = x['AMPLITUDE'].idxmax()
... print ( x['groups'][the_max], x['PSRTIME'][the_max])
...
>>> gb.apply(f)
0 58064.3656376
0 58064.3656376
2 54695.4712834
3 54740.4917137
4 54788.477571
5 54836.472922
6 54881.4605511
7 54926.4664883
8 54971.4932866
9 55019.5021472
10 55064.5029133
11 55109.4948108
12 55154.414381
13 55202.488766
14 55247.4721132
15 55292.5301332
16 55340.4728542
17 55385.5229596
18 55430.5332147
19 55478.4812671
20 55523.4894451
21 55568.4626766
22 55616.4630348
23 55661.4969604
24 55709.4504634
25 55754.4711994
26 55799.4736923
27 55844.5050404
28 55892.4699313
29 55937.4721754
30 55985.4677572
31 56030.5119765
32 56075.5517149
33 56168.4447074
34 56213.507484
35 56306.5133063
36 56351.4943058
37 56396.579122
38 56441.5683651
39 56489.5321173
40 56534.4838082
41 56582.469025
42 56627.4135202
43 56672.4926625
44 56720.582296
45 56768.5232469
46 56813.4997925
47 56858.3890558
48 56903.5182596
49 56951.4892721
50 56996.5787435
51 57086.3948136
52 57179.5421833
53 57272.5059448
54 57362.452523
55 57635.5013047
56 57728.4925251
57 57773.5235416
58 57821.5390364
59 57866.5205882
60 57911.5590132
61 57956.5699637
62 58001.4331976
Empty DataFrame
Columns: []
Index: []
The results of the two methods are the same, up to differences in presentation precision.
I also created a small set of data that would give easily calculable results. This is it. The original program performed correctly.
0 -1 0 -1
1 0 1 2
2 -1 0 -1
3 -1 0 -1
4 0 1 0
5 1 1 1
6 -1 0 -1
7 -1 0 -1
8 -1 0 -1
9 0 1 4
10 1 1 3
11 2 1 2
12 -1 0 -1
13 -1 0 -1
14 -1 0 -1
15 -1 0 -1
16 0 1 0
17 1 1 1
18 2 1 2
19 3 1 3
20 -1 0 -1
21 -1 0 -1
22 -1 0 -1
23 -1 0 -1
24 -1 0 -1
25 0 1 6
26 1 1 5
27 2 1 4
28 3 1 3
29 4 1 2
30 -1 0 -1
31 -1 0 -1
32 -1 0 -1
33 -1 0 -1
34 -1 0 -1
35 -1 0 -1
36 0 1 0
37 1 1 1
38 2 1 2
39 3 1 3
40 4 1 4
41 5 1 5
41 -1 0 -1
41 -1 0 -1

Gnuplot draw logical gate output in time

I am working on a school project, which is a simulation of logical gates. I can implement and run the simulation with ease, but i need help with showing the output.
Right now, i print everything to the console, like this:
sample frequency: 50
###############################################
IN NOT(1) OUT
IN1:0 IN1:3 IN1:5
IN2:0 IN2:0 IN2:0
OUT:3 OUT:5 OUT:0
0 1 -1 -1
50 1 -1 -1
100 1 0 0
150 0 0 0
200 1 1 1
250 1 0 0
300 1 0 0
350 1 0 0 (IN = 1, delay is 1 so we can see
400 0 0 0 the correct output of NOT element in line 400 <-> 350 + 1*50)
450 1 1 1
500 1 0 0
550 1 0 0
600 1 0 0
650 0 0 0
700 0 1 1
750 1 1 1
800 1 0 0
850 1 0 0
900 1 0 0
950 1 0 0
1000 1 0 0
on the left, there is the simulation time (step). In each step, the values are printed out and new set of inputs is generated.
where there is -1, this means undefined output.
The 3rd row ( IN NOT(1) OUT ) means that there are 3 elements, 1 input, 1 NOT gate and an output. The value in brackets means the delay of the element, so an element with delay value of X will show the correct output after X*sample_freq (excluding the 0 time).
The rows after mean:
IN1 - the index of the node that is read as input 1
IN2 - the index of the node that is read as input 2
OUT - the index of the output node
In this situation, the IN is giving its output to node #3. The NOT element reads its input from node #3 and gives some output to node #5. The overall output of this system is the OUT element, which reads from #5.
Here is the file that specifies the topology:
3 (number of elems)
IN 0 0 3 (no inputs for input element obviously)
NOT 3 0 5 (reads from #3 and outputs to #5)
OUT 5 0 0 (reads from #5 and this is the end point of the system)
There can obviously be more elements, IN's and OUT's, but lets stick to this for the sake of simplicity.
And what i want to see as the result is: X-axis tells the simulation time (0 - 1000, step is 50), y axis tells the output value of each element in the system and the elements write their output one above the other, see this picture as an example.
Can you tell me how to create this kind of gnuplot script, that transforms the output of my application into the desired plot?
Thank you!
ok, I have found a solultion myself, here it is:
first, I had to transform the output of the app a bit, so that it looks like this:
0 1 2 4
49 1 2 4
50 1 2 4
99 1 2 4
100 0 2 4
149 0 2 4
150 0 3 5
199 0 3 5
200 1 3 5
249 1 3 5
250 1 2 4
299 1 2 4
300 0 2 5
349 0 2 5
350 1 3 5
399 1 3 5
400 0 2 4
449 0 2 4
450 1 3 5
499 1 3 5
the extra sim time steps make the edges look almost square, I also separated each column by 2 (added 0 to column #2, added 2 to column #3, added 4 to column #4 and so on), so that it is drawn one above each other and the simple command to plot this is:
plot 'out.txt' using 1:2 with lines, 'out.txt' using 1:3 with lines, 'out.txt' using 1:4 with lines
plus some set xtics, set ytics and other cosmetic stuff
now I have to deal with naming the lines with the names of the elements and voila.

String reformatting with levels , I am stuck

I have below string that needs to be reformatted
"itemStockDetailsMap_506540 = {\"506540_Navy\":{\"24DUMMY\":{\"count\":0.0,\"type\":2},\"18DUMMY\":{\"count\":0.0,\"type\":2},\"16DUMMY\":{\"count\":0.0,\"type\":2},\"8DUMMY\":{\"count\":0.0,\"type\":2},\"20DUMMY\":{\"count\":0.0,\"type\":2},\"10DUMMY\":{\"count\":0.0,\"type\":2},\"12DUMMY\":{\"count\":0.0,\"type\":2},\"22DUMMY\":{\"count\":0.0,\"type\":2},\"14DUMMY\":{\"count\":0.0,\"type\":2}},
\"506540_Mocha\":{\"20DUMMY\":{\"count\":0.0,\"type\":2},\"22DUMMY\":{\"count\":0.0,\"type\":2},\"10DUMMY\":{\"count\":0.0,\"type\":2},\"8DUMMY\":{\"count\":0.0,\"type\":2},\"12DUMMY\":{\"count\":0.0,\"type\":2},\"14DUMMY\":{\"count\":0.0,\"type\":2},\"16DUMMY\":{\"count\":0.0,\"type\":2},\"24DUMMY\":{\"count\":0.0,\"type\":2},\"18DUMMY\":{\"count\":0.0,\"type\":2}}
,\"506540_Grey\":{\"18DUMMY\":{\"count\":0.0,\"type\":2},\"12DUMMY\":{\"count\":0.0,\"type\":2},\"10DUMMY\":{\"count\":0.0,\"type\":2},\"20DUMMY\":{\"count\":0.0,\"type\":2},\"14DUMMY\":{\"count\":0.0,\"type\":2},\"22DUMMY\":{\"count\":0.0,\"type\":2},\"24DUMMY\":{\"count\":0.0,\"type\":2},\"16DUMMY\":{\"count\":0.0,\"type\":2},\"8DUMMY\":{\"count\":0.0,\"type\":2}}}"
I want to represent it in
colour size count
Navy 18 0.0
Navy 8 0.0
......
Grey 10 0.0
........
Please guide me me if there are any cool tricks to get this reformatted ..
Many Thanks,
Let's say your string is in variable a:
require(rjson)
a <- sub(".*?(\\{.*\\})", "\\1", a)
a <- fromJSON(a)
a <- do.call(rbind, a)
cbind.data.frame(
colour = rep(sub("\\d+_(\\w+)", "\\1", rownames(a)), each=dim(a)[2]),
size = as.numeric(rep(sub("(\\d+)\\w+", "\\1", colnames(a)), dim(a)[1])),
count = as.numeric(unname(unlist(sapply(a, "[", "count")))))
# colour size count
# 1 Navy 24 2
# 2 Navy 18 1
# 3 Navy 16 0
# 4 Navy 8 0
Try the following:
install.packages("rjson")
# 'yourData' is the JSON string in the OP
do.call(rbind, lapply(rjson::fromJSON(yourData), function(xx) do.call(rbind, xx)))
count type
24DUMMY 0 2
18DUMMY 0 2
16DUMMY 0 2
8DUMMY 0 2
20DUMMY 0 2
10DUMMY 0 2
12DUMMY 0 2
22DUMMY 0 2
14DUMMY 0 2
20DUMMY 0 2
22DUMMY 0 2
10DUMMY 0 2
8DUMMY 0 2
12DUMMY 0 2
14DUMMY 0 2
16DUMMY 0 2
24DUMMY 0 2
18DUMMY 0 2
18DUMMY 0 2
12DUMMY 0 2
10DUMMY 0 2
20DUMMY 0 2
14DUMMY 0 2
22DUMMY 0 2
24DUMMY 0 2
16DUMMY 0 2
8DUMMY 0 2

Resources