How to deal with violation of proportional hazards assumption in Cox PH, R 3.1.3 survfit - survival-analysis

I'm performing survival analysis in R using the 'survival' package and coxph. My goal is to compare survival between individuals with different chronic diseases. My data are structured like this:
id, time, event, disease, age.at.dx
1, 342, 0, A, 8247
2, 2684, 1, B, 3879
3, 7634, 1, A, 3847
where 'time' is the number of days from diagnosis to event, 'event' is 1 if the subject died, 0 if censored, 'disease' is a factor with 8 levels, and 'age.at.dx' is the age in days when the subject was first diagnosed. I am new to using survival analysis. Looking at the cox.zph output for a model like this:
combi.age<-coxph(Surv(time,event)~disease+age.at.dx,data=combi)
Two of the disease levels violate the PH assumption, having p-values <0.05. Plotting the Schoenfeld residuals over time shows that for one disease the hazard falls steadily over time, and with the second, the line is predominantly parallel, but with a small upswing at the extreme left of the graph.
My question is how to deal with these disease levels? I'm aware from my reading that I should attempt to add a time interaction to the disease whose hazard drops steadily, but I'm unsure how to do this, given that most examples of coxph I've come across only compare two groups, whereas I am comparing 8. Also, can I safely ignore the assumption violation of the disease level with the high hazard at early time points?
I wonder whether this is an inappropriate way to structure my data, because it does not preclude a single individual appearing multiple times in the data - is this a problem?
Thanks for any help, please let me know if more information is needed to answer these questions.

I'd say you have a fairly good understanding of the data already and should present what you found. This sounds like a descriptive study rather than one where you will be presenting to the FDA with a request to honor your p-values. Since your audience will (or should) be expecting that the time-course of risk for different diseases will be heterogeneous, I'd think you can just describe these results and talk about the biological/medical reasons why the first "non-conformist" disease becomes less important with time and the other non-conforming condition might become more potent over time. You already done a more thorough analysis than most descriptive articles in the medical literature exhibit. I rarely see description of the nature of non-proportionality.
The last question regarding data "does not preclude a single individual appearing multiple times in the data" may require some more thorough discussion. The first approach would be to stratify by patient ID with the cluster()-function.

Related

ANOVA test on time series data

In below post of Analytics Vidya, ANOVA test has been performed on COVID data, to check whether the difference in posotive cases of denser region is statistically significant.
I believe ANOVA test can’t be performed on this COVID time series data, atleast not in way as it has been done in this post.
Sample data has been consider randomly from different groups(denser1, denser2…denser4). The data is time series so it is more likely that number of positive cases in random sample of groups will be from different point of time.
There might be the case denser1 has random data from early covid time and another region has random data from another point of time. If this is the case, then F-Statistics will high certainly.
Can anyone explain if you have other opinions?
https://www.analyticsvidhya.com/blog/2020/06/introduction-anova-statistics-data-science-covid-python/
ANOVA should not be applied to time-series data, as the independence assumption is violated. The issue with independence is that days tend to correlate very highly. For example, if you know that today you have 1400 positive cases, you would expect tomorrow to have a similar number of positive cases, regardless of any underlying trends.
It sounds like you're trying to determine causality of different treatments (ie mask mandates or other restrictions etc) and their effects on positive cases. The best way to infer causality is usually to perform A-B testing, but obviously in this case it would not be reasonable to give different populations different treatments. One method that is good for going back and retro-actively inferring causality is called "synthetic control".
https://economics.mit.edu/files/17847
Above is linked a basic paper on the methodology. The hard part of this analysis will be in constructing synthetic counterfactuals or "controls" to test your actual population against.
If this is not what you're looking for, please reply with a clarifying question, but I think this should be an appropriate method that is well-suited to studying time-series data.

Procedural modelling classical Chinese visions of political order

The problem I’m dealing with at the moment involves a system described in the Guanzi. A large section of the book is about how governments should work to extract a surplus from the economy which they can redistribute to ensure the loyalty of existing followers and gain new ones. Under this system, whoever can redistribute the most wealth becomes the overall leader. However, he also has to out-compete the other individuals in the system: they are all busy trying to establish their own redistribution networks.
The result is a series of pyramid-shaped redistribution networks, both independent and nested.
Simplified visual representation of the expected outcome
These are dynamic across time and space. Gaining resources lets you acquire more followers, which in turn gives you access to more resources. There is also a random component involved: a bad harvest or a war may wipe out your resources. If one leader runs out of resources (whether as a result of a disaster or because he redistributed them too generously among his followers), he will either be supplanted by a follower or his network will collapse and its members leave to join other networks.
I think it is possible to model this algorithmically.
We can assume that willingness to share resources is innate.
Generosity = propensity score
An individual acquires followers as a function of both the surplus resources he possesses and his willingness to share them.
Followers[tn] = Surplus[t-1] * Generosity
It is worth noting that growth is endogenous in this model. It is a product of whatever economic growth coefficient is deemed realistic given technology and natural resources (a), as well as of the previous cycle’s surplus and the number of followers an individual has, on the basis that these constitute factors of production. (Note: I'm not interested in getting actual monetary values out of this, just modelling the relationships. I understand that if you plugged real numbers into it people would end up redistributing more than they own.)
Growth = a (Surplus[t-1] * Followers[t-1])
At T=0 the surplus enjoyed by each individual in the system must be generated randomly.
Surplus[t0] = randomly generated number
Followers generate additional resources for their leader, but they also need to be remunerated, meaning that they simultaneously deplete their leader’s resources, proportional to his generosity propensity score. A random component must also be included, as mentioned above, to account for famines, bumper crops, wars etc.
Surplus[tn] = Random Component (Surplus[t-1] + Growth) – (Followers[t-1] * Generosity)
Once these relationships have been defined, then the algorithm is relatively simple:
T1:
Each individual checks the Surplus*Generosity score of the nearest individual who is not already following him. If Individual A’s SG > Individual B’s SG, then Individual B moves closer to Individual A and becomes his follower. (Note: If individual B has followers of his own, he carries them with him. Also: Followers automatically re-check their leader's SG in every round, since he is the closest individual to them. They will leave his network to become free agents once more if his SG drops below their own.)
Otherwise, he does nothing.
T2 :
Each individual’s stats (Followers, Surplus) are recalculated based on the new situation.
Step 1 is repeated.
T3 :
Repeat previous step
One would expect the individuals with the optimal generosity score to build the biggest networks, as they acquire followers without completely depleting their resources.
I suspect – but am not sure – that this model’s characteristics are similar to those of an L-system model.
Individuals are programmed with a simple instruction: “If the person closest to you has a higher S*G score than you do, approach and follow him.”
On the basis of this the individuals form structures (from the perspective of the individual with the optimal S*G score, they appear to cluster around him in a semi-structured way)
These structures grow with every successive time period
They collapse after depleting their own resources, or when a random disaster strikes.
After a collapse, the process automatically begins again.
However, I'm not a maths or a computing guy (I'm a Chinese philosophy guy) so I'm not sure if I'm just being fooled by a superficial resemblance or not. Is this a genuine example of string rewriting or am I just convincing myself it is because you get tree-like structures out of it? Is this even a model that can work at all? Have I totally messed up my equations? (I haven't done this since high school, so it's highly probable.)
All help is gratefully received.

Finding probabilities of patterns in asset price movements based on multiple variables

I am seeking a method to allow me to analyse/search for patterns in asset price movements using 5 variables that move and change with price (from historical data).
I'd like to be able to assign a probability to a forecasted price move when for example, var1 and var2 do this and var3..5 do this, then price should do this with x amount of certainty.
Q1: Could someone point me in the right direction as to what framework / technique can help me achieve this?
Q2: Would this be a multivariate continuous random series analysis?
Q3: A Hidden Markov modelling?
Q4: Or perhaps is it a data-mining problem?
I'm looking for what rather then how.
One may opt to use Machine-Learning tools to build a learner to either
both classify of what kind the said "asset price movement" will beand serve also statistical probability measures for such a Classifier prediction
both regress a real target value, to which the asset price will moveandserve also statistical probability measures for such a Regressor prediction
A1: ( while StackOverflow strongly discourages users to ask about an opinion about a tool or a particular framework ) there would be not much damages or extra time to be spent, if one performs academia papers research and there would be quite a remarkable list of repeatedly used tools, used for ML in the context of academic R&D. For a reason, there would not be a surprise to meet scikit-learn ML-classes a lot, some other papers may work with R-based quantitative finance / statistical libraries. The tools, however, with all due respect, are not the core to answer all the doubts and inital confusion present in a mix of your questions. The subject confusion is.
A2: No, it would not. Well, unless you beat all the advanced quantitative research and happen to prove that the Market exhibits a random behaviour ( which it is not and for which it would be waste of time to re-cite remarkable research published about why it is not indeed a random process ).
A3: Do not try to jump on any wagon just because of it's attractive Tag or "contemporary popularity" in marketing minded texts. With all due respect, understanding HMM is outside of your sight while you now appear to move just to the nearest horizons to first understand what to look for.
A4: This is a nice proof of a missed target. Your question shows in this particular point better than in others, how small amount of own research efforts were put into covering the problem-domain and acquiring at least some elementary knowledge before typing the last two questions.
StackOverflow encourages users to ask high quality questions, so do not hesitate to re-edit your post to add some polishing efforts to this subject.
If in a need for an inspiration, try to review a nice and a powerful approach for a fast Machine Learning process, where both Classification and Regression tasks obtain also probability estimates for each predicted target value.
To have some idea about highly performant ML-predictors, these typically operate on much more than a set of 5 variables ( called in the ML-domain "features" ) . ( Think rather about some large hundreds to small thousands features, typically heavily non-linear transformations from the original TimeSeries' data ).
There you go, if indeed willing to master ML for algorithmic trading.
May like to read about a state-of-art research in this direction:
[1] Mondrian Forests: Efficient Online Random Forests
>>> arXiv:1406.2673v2 [stat.ML] 16 Feb 2015
[2] Mondrian Forests for Large-Scale Regression when Uncertainty Matters
>>> arXiv:1506.03805v4 [stat.ML] 27 May 2016 >>>
May also enjoy other posts on subject: >>> StackOverflow Algorithmic-Trading >>>

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

Determine coefficients for some function

I have a task that is probably related to data analysis or even neural networks.
We have a data source of our partners, job portal. The source values are arrays of different attributes related to the particular employee:
His\her gender,
Age,
Years of experience,
Portfolio (number of the projects done),
Profession and specialization (web design, web programming, management etc.),
many other (around 20-30 totally)
Every employee has it's own salary (hourly) rate. So, mathematically, we have some function
F(attr1, attr2, attr3, ...) = A*attr1 + B*attr2 + C*attr3 + ...
With unknown coefficient. But we know the result of the function for the specified arguments (let's say, we know that a male programmer with 20 years of experience and 10 works in portfolio has a rate of $40 per hour).
So we have to find somehow these coefficients (A, B, C...), so we can predict the salary of any employee. This is the most important goal.
Another goal is to find which arguments are most important - in other words, which of them cause significant changes to the result of the function. So in the end we have to have something like this: "The most important attributes are years of experience; then portfolio; then age etc.".
There may be a situation when different professions vary too much from each other - for example, we simply may not be able to compare web designers with managers. In this case, we have to split them by groups and calculate these ratings for every group separately. But in the end we need to find 'shared' arguments that will be common for every group.
I'm thinking about neural networks because it's something they may deal with. But I'm completely new to them and have totally no idea what to do.
I'd very appreciate any help - which instruments to use, what algorithms, or even pseudo-code samples etc.
Thank you very much.
That is the most basic example of (linear) regression. You are using a linear function to model your data, and need to estimate the parameters.
Note that this is actually a part of classic mathematical statistics; not data mining yet but much much older.
There are various methods. Given that there likely will be outliers, I would suggest to use RANSAC.
As for the importance, doesn't this boil down to "which is largest, A B or C"?

Resources