Statistical method to compare urban vs rural matched siblings - statistics

I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?

Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.

Related

2*2 repeated measures ANOVA with an unbalanced number of observations

sorry for this very basic question. I've trawled through previous pages and cannot quite find a case that corresponds to our situation.
320 individuals rated two types of films. The rating was provided on a 1-11 scale.There are many films of each type. In short the DV is a continuous variable.
20 individuals have a particular disease that we now consider of interest. We would like to examine the effect of the disease on the rating.
We conducted a 2-way repeated measures ANOVA, using 'situation type' as a within-subject factor, and 'disease status' as a between-subject factor, using SPSS. The design is obviously unbalanced with more observations in the healthy group. The data appeared to be normally distributed. Levine test suggested equality of variance. Does that mean it is appropriate to use ANOVA for this analysis?

Violation of PH assumption

Running a survival analysis, assume the p-value regarding a variable is statistically significant - let's say with a positive association with the outcome. However, according to the Schoenfeld residuals, the proportional hazard (PH) assumption has is violated.
Which scenario among below could possibly happen after correcting for PH violations?
The p-value may not be significant anymore.
p-value still significant, but the size of HR may change.
p-value still significant, but the direction of association may be altered (i. e. a positive association may end up being negative).
The PH assumption violation usually means that there is an interaction effect that needs to be included in the model. In the simple linear regression, including a new variable may alter the direction of the existing variables' coefficients due to the collinearity. Can we use the same rationale in the case above?
Therneau and Gramsch have written a very useful text, "Modeling Survival Data" that has an entire chapter on testing proportionality. At the end of the chapter is a section on causes and modeling alternatives, which I think can be used for answering this question. Since you mention interactions it makes your question about a particular p-value rather ambiguous and vague.
1) Certainly if you have chosen a particular measurement as the subject of your interest and it turns out the all of the effects are due to its interaction with another variable that you happened to also measure, then you may be in a position where the variable-of-interest's p-value will decrease, possibly to zero.
2) It's almost certain that modification of a model with a different structure (say will the addition of time-varying covariates or a different treatment of time) will result in a different estimated HR for a particular covariate and I think it would be impossible to predict the direction of the change.
3) As to whether to sign of the coefficient could change, I'm quite sure that would be possible as well. The scenario I'm thinking of would have a mixture of two groups say men and women and one of the groups had a sub-group whose early mortality was greatly increased, e.g. breast cancer, while the surviving members of that group would have a more favorable survival expectation. The base model might show a positive coefficient (high risk) while a model that was capable of identifying the subgroup at risk would then allow the gender-related coefficient to become negative (lower risk).

Bayesian t-test assumptions

Good afternoon,
I know that the traditional independent t-test assumes homoscedasticity (i.e., equal variances across groups) and normality of the residuals.
They are usually checked by using levene's test for homogeneity of variances, and the shapiro-wilk test and qqplots for the normality assumption.
Which statistical assumptions do I have to check with the bayesian independent t test? How may I check them in R with coda and rjags?
For whichever test you want to run, find the formula and plug in using the posterior draws of the parameters you have, such as the variance parameter and any regression coefficients that the formula requires. Iterating the formula over the posterior draws will give you a range of values for the test statistic from which you can take the mean to get an average value and the sd to get a standard deviation (uncertainty estimate).
And boom, you're done.
There might be non-parametric Bayesian t-tests. But commonly, Bayesian t-tests are parametric, and as such they assume equality of relevant population variances. If you could obtain a t-value from a t-test (just a regular t-test for your type of t-test from any software package you're comfortable with), use levene's test (do not think this in any way is a dependable test, remember it uses p-value), then you can do a Bayesian t-test. But remember the point that the Bayesian t-test, requires a conventional modeling of observations (Likelihood), and an appropriate prior for the parameter of interest.
It is highly recommended that t-tests be re-parameterized in terms of effect sizes (especially standardized mean difference effect sizes). That is, you focus on the Bayesian estimation of the effect size arising from the t-test not other parameter in the t-test. If you opt to estimate Effect Size from a t-test, then a very easy to use free, online Bayesian t-test software is THIS ONE HERE (probably one of the most user-friendly package available, note that this software uses a cauchy prior for the effect size arising from any type of t-test).
Finally, since you want to do a Bayesian t-test, I would suggest focusing your attention on picking an appropriate/defensible/meaningful prior rather then levenes' test. No test could really show that the sample data may have come from two populations (in your case) that have had equal variances or not unless data is plentiful. Note that the issue that sample data may have come from populations with equal variances itself is an inferential (Bayesian or non-Bayesian) question.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources