This is odd . . .
I have a Chart with ten sets of "x-y" data.
But now I have tried to put in two more - and Excel ignores the first column which it regards as sequential, i.e.
Instead of
4.41 1103
6.91 1262
7.06 1275
7.89 1329
it treats it as
1 1103
2 1262
3 1275
4 1329
How do I prevent this ?
Related
I was following this question to address a similar situation:
How to Calculate Area Under the Curve in Spotfire?
My data is in the following format:
PLANT
OBS_DATE_RECORDED
TRAIT_VALUE
period
A
3/16/2021
225
A3/16/2021
A
3/23/2021
227
A3/23/2021
A
3/30/2021
220
A3/30/2021
A
4/7/2021
240
A4/7/2021
A
4/13/2021
197
A4/13/2021
A
4/20/2021
197
A4/20/2021
A
4/27/2021
218
A4/27/2021
B
3/16/2021
253
B3/16/2021
B
3/23/2021
274
B3/23/2021
B
3/30/2021
271
B3/30/2021
B
4/7/2021
257
B4/7/2021
B
4/13/2021
250
B4/13/2021
A
4/20/2021
241
A4/20/2021
B
4/27/2021
255
B4/27/2021
Following the answer's formula as a calculated column:
([TRAIT_VALUE] + Avg([TRAIT_VALUE]) over (Intersect(NextPeriod([period]),[PLANT]))) / 2 * (Avg([OBS_DATE_RECORDED]) over (Intersect(NextPeriod([period]),[PLANT])) - [OBS_DATE_RECORDED])
However, the results don't appear correct.
AUDPC
1603.19:59:59.928
1608.17:59:59.956
2924.20:0:0.100
7732.21:0:0.000
1395.14:41:44.404
1461.23:30:0.050
-4393.7:59:59.712
I think the problem might be the date format but don't understand the formula well enough to troubleshoot. In Excel I usually compute the AUDPC by using the SUMPRODUCTS multiplying the days between two dates by the average TRAIT_VALUE between those two dates.
I need to make a diagram which shows the lines of different ceramic firing schedules. I want them to be plotted in one diagram and they need to be plotted in time-relative ax. It needs to show the different durations in a right way. I don't seem to be able to achieve this.
What I have is the following:
First table:
Pendelen
Temp. per uur
Stooktemp.
Stooktijd 4
Stooktijd Cum.4
95
120
1:15:47
1,26
205
537
2:02:03
3,30
80
620
1:02:15
4,33
150
1075
3:02:00
7,37
50
1196
2:25:12
9,79
10
1196
0:10:00
9,95
Total
9:57:17
Second table:
Pendelen
Temp. per uur
Stooktemp.
Stooktijd 5
Stooktijd Cum.5
140
540
3:51:26
3,86
65
650
1:41:32
5,55
140
1095
3:10:43
8,73
50
1222
2:32:24
11,27
Total
11:16:05
The lines to be shown in a diagram should represent the 'stooktijd cum.' for both programs 4 and 5 (which is a cumulation of the time needed to fire up the kiln from it's previous temp. in the schedule). One should be able to see in the diagram that program 5 takes more time to reach it's endtemp.
What I achieved is nothing more than a diagram with two lines, but only plotted in the 'stooktijd cum.4' points from program 4. The image shows a screenshot of this diagram.
But as you can see, this doesn't look like program 5 takes more time to reach it's end. I would like it to show something like this:
Create this table :
p4
p5
0
10
3.86
540
5.55
650
8.73
1095
11.27
1222
0
0
1.26
120
3.3
537
4.33
620
7.37
1075
9.79
1196
9.95
1196
Select all > F11 > Design > Chg Chart type > scatter with straight line and marker
Here's my tryout :
Please share if it works/not. ( :
Let's assume that we have a survfit object as follows.
fit = survfit(Surv(data$time_12m, data$status_12m) ~ data$group)
fit
Call: survfit(formula = Surv(data$time_12m, data$status_12m) ~ data$group)
n events median 0.95LCL 0.95UCL
data$group=HF 10000 3534 NA NA NA
data$group=IGT 70 20 NA NA NA
fit object does not show CI-s. How to calculate confidence intervals for the survival rates? Which R packages and code should be used?
The print result of survfit gives confidnce intervals by group for median survivla time. I'm guessing the NA's for the estimates of median times is occurring because your groups are not having enough events to actually get to a median survival. You should show the output of plot(fit) to see whether my guess is correct.
You might try to plot the KM curves, noting that the plot.survfit function does have a confidence interval option constructed around proportions:
plot(fit, conf.int=0.95, col=1:2)
Please read ?summary.survfit. It is the class of generic summary functions which are typically used by package authors to deliver the parameter estimates and confidence intervals. There you will see that it is not "rates" which are summarized by summary.survfit, but rather estimates of survival proportion. These proportions can either be medians (in which case the estimate is on the time scale) or they can be estimates at particular times (and in that instance the estimates are of proportions.)
If you actually do want rates then use a functions designed for that sort of model, perhaps using ?survreg. Compare what you get from using survreg versus survfit on the supplied dataset ovarian:
> reg.fit <- survreg( Surv(futime, fustat)~rx, data=ovarian)
> summary(reg.fit)
Call:
survreg(formula = Surv(futime, fustat) ~ rx, data = ovarian)
Value Std. Error z p
(Intercept) 6.265 0.778 8.05 8.3e-16
rx 0.559 0.529 1.06 0.29
Log(scale) -0.121 0.251 -0.48 0.63
Scale= 0.886
Weibull distribution
Loglik(model)= -97.4 Loglik(intercept only)= -98
Chisq= 1.18 on 1 degrees of freedom, p= 0.28
Number of Newton-Raphson Iterations: 5
n= 26
#-------------
> fit <- survfit( Surv(futime, fustat)~rx, data=ovarian)
> summary(fit)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
rx=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 13 1 0.923 0.0739 0.789 1.000
115 12 1 0.846 0.1001 0.671 1.000
156 11 1 0.769 0.1169 0.571 1.000
268 10 1 0.692 0.1280 0.482 0.995
329 9 1 0.615 0.1349 0.400 0.946
431 8 1 0.538 0.1383 0.326 0.891
638 5 1 0.431 0.1467 0.221 0.840
rx=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
353 13 1 0.923 0.0739 0.789 1.000
365 12 1 0.846 0.1001 0.671 1.000
464 9 1 0.752 0.1256 0.542 1.000
475 8 1 0.658 0.1407 0.433 1.000
563 7 1 0.564 0.1488 0.336 0.946
Might have been easier if I had used "exponential" instead of "weibull" as the distribution type. Exponential fits have a single parameter that is estimated and are more easily back-transformed to give estimates of rates.
Note: I answered an earlier question about survfit, although the request was for survival times rather than for rates. Extract survival probabilities in Survfit by groups
I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)
I have a bunch of data, ordered below one another in excel (actually openoffice). I need them to be consolidated by the first column, but so that all data are still shown:
From (actually goes until 100):
1 283.038 244
1 279.899 494
1 255.139 992
1 254.606 7329
1 254.5 17145
1 251.008 23278
1 250.723 28758
1 247.753 92703
1 243.43 315278
1 242.928 485029
1 237.475 1226549
1 233.851 2076295
1 232.833 9826327
1 229.656 15965410
1 229.656 30000235
2 286.535 231
2 275.968 496
2 267.927 741
2 262.647 2153
2 258.925 3130
2 253.954 4857
2 249.551 9764
2 244.725 36878
2 243.825 318455
2 242.86 921618
2 238.401 1405028
2 234.984 3170031
2 233.168 4403799
2 229.317 8719139
2 224.395 26986035
2 224.395 30000056
3 269.715 247
3 268.652 469
3 251.214 957
3 249.04 30344
3 245.883 56115
3 241.753 289668
3 241.707 954750
3 240.684 1421766
3 240.178 1865750
3 235.09 2626524
3 233.579 5129755
3 232.517 7018880
3 232.256 18518741
3 228.75 19117443
3 228.75 30000051
to:
1 2 3
283.038 244 286.535 231 269.715 247
279.899 494 275.968 496 268.652 469
255.139 992 267.927 741 251.214 957
254.606 7329 262.647 2153 249.04 30344
254.5 17145 258.925 3130 245.883 56115
251.008 23278 253.954 4857 241.753 289668
250.723 28758 249.551 9764 241.707 954750
247.753 92703 244.725 36878 240.684 1421766
243.43 315278 243.825 318455 240.178 1865750
242.928 485029 242.86 921618 235.09 2626524
237.475 1226549 238.401 1405028 233.579 5129755
233.851 2076295 234.984 3170031 232.517 7018880
232.833 9826327 233.168 4403799 232.256 18518741
229.656 15965410 229.317 8719139 228.75 19117443
229.656 30000235 224.395 26986035 228.75 30000051
224.395 30000056
This must be really simple. But I couldn't find it. I tried a pivot table, but that only allows me to summarise or count etc the fields, while I want them all to be displayed. Any ideas?
To elaborate on the pivot table. I put column 1 as row, column 2 as colum and column 3 in the middle, but that comes out with a lot of empty cells and summarised.
I am not sure on which search terms to look, unconsolidised pivot tables haven't provided an answer.
After some discussions with my collegues, this seemed undoable in excel. So I just created a short script to save each "run" (which is the first column) in a seperate file, from csv:
awk -F, '{print > $1}' rw.txt