how to convert PPER format to SPELL format for CoxPH model - survival-analysis

CoxPH Survival Analysis
I have a data set in PPER ( person period) format like :
Machine_id,Timestamp,Event,TDV1,TDV2,TDV3,TDV4
TDV1/2 are factors(brand , location) TDV3/4 are continous(temprature, humidity)
Need to convert to SPELL format like:
Machine_id,start.time,stop.time,event,TDV1,TDV2,TDV3,TDV4
I was able to convert from SPELL to PPER by using seqdef() & toPersonPeriod() in TraMineRextras
Help needed to do the reverse. Also how to treat continous variables while going from PPER to SPELL format?

Person-period data is just a special case of spell data where the start and end times are identical. You get spell data by duplicating your Timestamp variable and renaming the first start.time and the second stop.time.
Possibly, you can aggregate your records using list=c(Machine_id, event, TDV1, TDV2, TDV3, TDV4) as by argument. Proceeding twice, once with FUN="min" and once with FUN="max", you should be able to find the start and end times of the spells with unchanged values of event and the covariates.
I illustrate here with an example
## creating example data
p.df <- data.frame(scan(what=list(Id=0, timestamp=0, event="", work="", income=0)))
1 2000 S working 100
1 2001 S working 100
1 2002 M working 100
1 2003 M working 100
1 2004 M jobless 80
1 2005 M jobless 70
2 2000 S jobless 10
2 2001 S working 100
2 2002 S working 100
## leave previous line blank to end scan
p.df$start <- p.df$timestamp
p.df$end <- p.df$timestamp
p.df <- p.df[,-2] ## deleting timestamp variable
bylist <- list(id = p.df$Id, event=p.df$event,
work=p.df$work, income=p.df$income)
spell1 <- aggregate(p.df[,c("start","end")], by=bylist, FUN="min")
spell2 <- aggregate(p.df[,c("start","end")], by=bylist, FUN="max")
## reordering columns
spell <- spell1[,c(1,5,6,2,3,4)]
spell[,3] <- spell2[,6] ## taking end value from spell2
spell <- spell[order(spell$id,spell$start),] ## sorting rows
spell
## id start end event work income
## 5 1 2000 2001 S working 100
## 4 1 2002 2003 M working 100
## 3 1 2004 2004 M jobless 80
## 2 1 2005 2005 M jobless 70
## 1 2 2000 2000 S jobless 10
## 6 2 2001 2002 S working 100
Hope this helps.

Related

Trying to sort two different columns of a text file, (one asc, one desc) in the same awk script

I have tried to do it separately, and I am getting the right result, but I need help to combine the two.
This is the csv file:
maruti swift 2007 50000 5
honda city 2005 60000 3
maruti dezire 2009 3100 6
chevy beat 2005 33000 2
honda city 2010 33000 6
chevy tavera 1999 10000 4
toyota corolla 1995 95000 2
maruti swift 2009 4100 5
maruti esteem 1997 98000 1
ford ikon 1995 80000 1
honda accord 2000 60000 2
fiat punto 2007 45000 3
I am using this script to sort by first field:
BEGIN { print "========Sorted Cars by Maker========"
}
{arr[$1]=$0}
END{
PROCINFO["sorted_in"]="#val_str_desc"
for(i in arr)print arr[i]
}
I also want to run a sort on the year($3) ascending in the same script.
I have tried many ways but to no avail.
A little help to do that would be appreciated..
One in GNU awk:
$ gawk '
{
a[$1][$3][++c[$1,$3]]=$0
}
END {
PROCINFO["sorted_in"]="#ind_str_desc"
for(i in a) {
PROCINFO["sorted_in"]="#ind_str_asc"
for(j in a[i]) {
PROCINFO["sorted_in"]="#ind_num_asc"
for(k in a[i][j])
print a[i][j][k]
}
}
}' file
Output:
toyota corolla 1995 95000 2
maruti esteem 1997 98000 1
maruti swift 2007 50000 5
...
Assumptions:
individual fields do not contain white space
primary sort: 1st field in descending order
secondary sort: 3rd field in ascending order
no additional sorting requirements provided in case there's a duplicate of 1st + 3rd fields (eg, maruti + 2009) so we'll maintain the input ordering
One idea using sort:
sort -k1,1r -k3,3n auto.dat
Another idea using GNU awk (for arrays of arrays and PROCINFO["sorted_in"]):
awk '
{ cars[$1][$3][n++]=$0 } # "n" used to distinguish between duplicates of $1+$3
END { PROCINFO["sorted_in"]="#ind_str_desc"
for (make in cars) {
PROCINFO["sorted_in"]="#ind_num_asc"
for (yr in cars[make])
for (n in cars[make][yr])
print cars[make][yr][n]
}
}
' auto.dat
Both of these generate:
toyota corolla 1995 95000 2
maruti esteem 1997 98000 1
maruti swift 2007 50000 5
maruti dezire 2009 3100 6
maruti swift 2009 4100 5
honda accord 2000 60000 2
honda city 2005 60000 3
honda city 2010 33000 6
ford ikon 1995 80000 1
fiat punto 2007 45000 3
chevy tavera 1999 10000 4
chevy beat 2005 33000 2

How to split data and assign it into designated variables?

I have data in Stata regarding the feeling of the current situation. There are seven types of feeling. The data is stored in the following format (note that the data type is a string, and one person can respond to more than 1 answer)
feeling
4,7
1,3,4
2,5,6,7
1,2,3,4,5,6,7
Since the data is a string, I tried to separate it by
split feeling, parse (,)
and I got the result
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
However, this is not the result I want. which is that the representative number of feelings should go into the correct variable. For instance.
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
I am not sure if there is any built-in command or function for this kind of problem. I am thinking about using forval in looping through every value in each variable and try to juggle it around into the correct variable.
A loop over the distinct values would be enough here. I give your example in a form explained in the Stata tag wiki as more helpful and then give code to get the variables you want as numeric variables.
* Example generated by -dataex-. For more info, type help dataex
clear
input str13 feeling
"4,7"
"1,3,4"
"2,5,6,7"
"1,2,3,4,5,6,7"
end
forval j = 1/7 {
gen wanted`j' = `j' if strpos(feeling, "`j'")
gen better`j' = strpos(feeling, "`j'") > 0
}
l feeling wanted1-better3
+---------------------------------------------------------------------------+
| feeling wanted1 better1 wanted2 better2 wanted3 better3 |
|---------------------------------------------------------------------------|
1. | 4,7 . 0 . 0 . 0 |
2. | 1,3,4 1 1 . 0 3 1 |
3. | 2,5,6,7 . 0 2 1 . 0 |
4. | 1,2,3,4,5,6,7 1 1 2 1 3 1 |
+---------------------------------------------------------------------------+
If you wanted a string result that would be yielded by
gen wanted`j' = "`j'" if strpos(feeling, "`j'")
Had the number of feelings been 10 or more you would have needed more careful code as for example a search for "1" would find it within "10".
Indicator (some say dummy) variables with distinct values 1 or 0 are immensely more valuable for most analysis of this kind of data.
Note Stata-related sources such as
this FAQ
this paper
and this paper.

VBA solution of VIF factors [EXCEL]

I have several multiple linear regressions to carry out, I am wondering if there is a VBA solution for getting the VIF of regression outputs for different equations.
My current data format:
i=1
Year DependantVariable Variable2 Variable3 Variable4 Variable5 ....
2009 100 10 20 -
2010 110 15 25 -
2011 115 20 30 -
2012 125 25 35 -
2013 130 25 40 -
I have the above table, with the value of i determining the value of the variables (essentially, different regression input tables in place for every value of i)
I am looking for a VBA that will check every value of i (stored in a column), calculate the VIF for every value of i and output something like below
ivalue variable1VIF variable2VIF ...
1 1.1 1.3
2 1.2 10.1

Remove n rows and iterate it n times in dataframe

I have 31 million of values in txt file. I need to remove values between 21600 to 61200, which I did through the code below and now I have to use this logic to remove for every 86400 values between above specified ones. This means remove values between 21600+86400 to 61200+86400, then remove 21600+86400+86400 to 61200+86400+86400 and so on applying same logic until the end of data. I tried many options, even using linked list, but I could not apply it to my large dataset. How shall it be done?
Visual example for values 1 to 24, remove values from 6 to `17:
1 2 3 4 5 6 - - - - - - - - - - 17 18 19 20 21 22 23 24
then apply to the next set of rows who follow this structure as below (start 6+24=30 and stop 17+24=41):
25 26 27 28 29 30 - - - - - - - - - - 41 42 43 44 45 46 47 48
and so on until the end of data (remove between 30+24 and 41+24 for the next set).
I limited the code below for the first 250000 of values for simplicity.
import numpy as np
import pandas as pd
sample = np.arange(0, 259201, 1).tolist()
df = pd.DataFrame(sample)
df = df.drop(df.index[21601:61200])
Basically, I need to apply something like this below, but I am not sure how to do it for my case.
for day in reverse(range(366)):
df.drop(df.index[21601+day*86400:61200+day*86400])
You can use the modulo operator to do so (% symbol in python and pandas).
Here is how your last piece of code can be re-written:
df[~(df.index.to_series() % 86400).between(21601, 61200)]
I used to_series() because between() is not defined for Index objects.

Count Occurrences for Objects in a Column of Lists for Really Large CSV File

I have a huge CSV file (8gb) containing multiple columns. One of the columns are a column of lists that looks like this:
YEAR WIN_COUNTRY_ISO3
200 2017 ['BEL', 'FRA', 'ESP']
201 2017 ['BEL', 'LTU']
202 2017 ['POL', 'BEL']
203 2017 ['BEL']
204 2017 ['GRC', 'DEU', 'FRA', 'LVA']
205 2017 ['LUX']
206 2017 ['BEL', 'SWE', 'LUX']
207 2017 ['BEL']
208 2017 []
209 2017 []
210 2017 []
211 2017 ['BEL']
212 2017 ['SWE']
213 2017 ['LUX', 'LUX']
214 2018 ['DEU', 'LUX']
215 2018 ['ESP', 'PRT']
216 2018 ['AUT']
217 2018 ['DEU', 'BEL']
218 2009 ['ESP']
219 2009 ['BGR']
Each of the 3-letter code represents a country. I would like to create a frequency table for each country so i can count the occurrences of each country in the entire column. Since the file is really large and my PC can't handle to load the whole CSV as dataframes, I try to read the file lazily and iterate through the line --> getting the last column and add the object in each row of the WIN_COUNTRY_ISO3 column (which happens to be the last column) to a set of dictionary.
import sys
from itertools import islice
n=100
i = 0
col_dict={}
with open(r"filepath.csv") as file:
for nline in iter(lambda: tuple(islice(file, n)), ()):
row = nline.splitline
WIN_COUNTRY_ISO3 = row[-1]
for iso3 in WIN_COUNTRY_ISO3:
if iso3 in col_dict.keys():
col_dict[iso3]+=1
else:
col_dict[iso3]=1
i+=1
sys.stdout.write("\rDoing thing %i" % i)
sys.stdout.flush()
print(col_dict)
However, this process takes a really long time. I tried through iterate through multiple lines by using the code
for nline in iter(lambda: tuple(islice(file, n)), ())
Q1:
However, this doesn't seem to work and python process the file one by one. Does anybody know the most any
efficient way for me to generate the count of each country for a really large file like mine?
The resulting table would look like this:
Country Freq
BEL 4543
FRA 4291
ESP 3992
LTU 3769
POL 3720
GRC 3213
DEU 3119
LVA 2992
LUX 2859
SWE 2802
PRT 2584
AUT 2374
BGR 1978
RUS 1770
TUR 1684
I would also like to create the frequency table by each year (in the YEAR column) if anybody can help me with this. Thank you.
Try this:
from collections import defaultdict
import csv
import re
result = defaultdict(int)
f = open(r"filepath.csv")
next(f)
for row in f:
data = re.sub(r'[\s\d\'\[\]]', '', row)
if data:
for x in data.split(','):
result[x] += 1
print(result)
If you can handle awk, here's one:
$ cat program.awk
{
while(match($0,/'[A-Z]{3}'/)) {
a[substr($0,RSTART+1,RLENGTH-2)]++
$0=substr($0,RSTART+RLENGTH)
}
}
END {
for(i in a)
print a[i],i
}
Execute it:
$ awk -f program.awk file
Output:
1 AUT
3 DEU
3 ESP
1 BGR
1 LTU
2 FRA
1 PRT
5 LUX
8 BEL
1 POL
1 GRC
1 LVA
2 SWE
$0 processes the whole record (row) of data, so it might include false hits from elsewhere in the record. You can enhance that with proper field separation but as it wasn't available I can't help any further. See gnu awk, FS and maybe FPAT in google.

Resources