Acquiring probabilities from frequency distribution to test individual - statistics

If I have a frequency distribution describing the frequency of normal patients whose heart presents with a degree of a symptom (let's say 50 % of them present with a "3" 20% presenting with a "2" and 30% presenting with a "4".
How would I use this data test whether a new patient is healthy or unhealthy? A relative frequency would say that if a patient presents with a "3" then there is a 50% chance they are healthy, isn't it the case that they are more likely to be healthy if they sit in the middle of the distribution?
Also, how would you then go on to combine this probability with 24 other tests on the same patient to acquire an overall probability?
Many thanks

I'm assuming by "normal patients", you mean healthy.
To use p(3|healthy) to get p(healthy|3), you could invoke Bayes' Rule:
p(A|B) = p(B|A)p(A)/p(B)
in which p( ) denotes probability, and | denotes conditionality, such that "A|B" means "A given B", or "A conditional on B".
This would necessitate knowing the overall proportion of healthy patients and the overall proportion of patients presenting with a "3". Putting it all together gives
p(healthy|3) = p(3|healthy)p(healthy)/p(3)
In the simplest case, you could use Bayes' rule with B in the form above denoting the intersection of all test results (a given result in test 1 AND a given result in test 2, etc), but it would depend on what data you have available.
Letting X, Y, and Z denote given results from three tests gives
p(healthy|(X&Y&Z)) = p((X&Y&Z)|healthy)p(healthy)/p(X&Y&Z)
This can be simplified to the following IF AND ONLY IF results X, Y, and Z are independent, as are X|healthy, Y|healthy, and Z|healthy
p(healthy|(X&Y&Z)) = p(X|healthy)p(Y|healthy)p(Z|healthy)p(healthy)/(p(X)p(Y)p(Z))

Related

estimate standard error from sample means

Random sample of 143 girl and 127 boys were selected from a large population.A measurement was taken of the haemoglobin level(measured in g/dl) of each child with the following result.
girl n=143 mean = 11.35 sd = 1.41
boys n=127 mean 11.01 sd =1.32
estimate the standard error of the difference between the sample means
In essence, we'd pool the standard errors by adding them. This implies that we´re answering the question: what is the vairation of the sampling distribution considering both samples?
SD = sqrt( (sd₁**2 / n₁) + (sd₂**2 / n₂) \
SD = sqrt( (1.41**2 / 143) + (1.32**2 / 127) ≈ 0.1662
Notice that the standrad deviation squared is simply the variance of each sample. As you can see, in our case the value is quite small, which indicates that the difference between sampled means doesn´t need to be that large for there to be a larger than expected difference between obervations.
We´d calculate the difference between means as 0.34 (or -0.34 depending on the nature of the question) and divide this difference by the standrad error to get a t-value. In our case 2.046 (or -2.046) indicates that the observed difference is 2.046 times larger than the average difference we would expect given the variation the variation that we measured AND the size of our sample.
However, we need to verify whether this observation is statistically significant by determining the t-critical value. This t-critical can be easily calculated by using a t-value chart: one needs to know the alpha (typically 0.05 unless otherwise stated), one needs to know the original alternative hypothesis (if it was something along the lines of there is a difference between genders then we would apply a two tailed distribution - if it was something along the lines of gender X has a hameglobin level larger/smaller than gender X then we would use a single tailed distribution).
If the t-value > t-critical then we would claim that the difference between means is statistically significant, thereby having sufficient evident to reject the null hypothesis. Alternatively, if t-value < t-critical, we would not have statistically significant evidence against the null hypothesis, thus we would fail to reject the null hypothesis.

Flipping a three-sided coin

I have two related question on population statistics. I'm not a statistician, but would appreciate pointers to learn more.
I have a process that results from flipping a three sided coin (results: A, B, C) and I compute the statistic t=(A-C)/(A+B+C). In my problem, I have a set that randomly divides itself into sets X and Y, maybe uniformly, maybe not. I compute t for X and Y. I want to know whether the difference I observe in those two t values is likely due to chance or not.
Now if this were a simple binomial distribution (i.e., I'm just counting who ends up in X or Y), I'd know what to do: I compute n=|X|+|Y|, σ=sqrt(np(1-p)) (and I assume my p=.5), and then I compare to the normal distribution. So, for example, if I observed |X|=45 and |Y|=55, I'd say σ=5 and so I expect to have this variation from the mean μ=50 by chance 68.27% of the time. Alternately, I expect greater deviation from the mean 31.73% of the time.
There's an intermediate problem, which also interests me and which I think may help me understand the main problem, where I measure some property of members of A and B. Let's say 25% in A measure positive and 66% in B measure positive. (A and B aren't the same cardinality -- the selection process isn't uniform.) I would like to know if I expect this difference by chance.
As a first draft, I computed t as though it were measuring coin flips, but I'm pretty sure that's not actually right.
Any pointers on what the correct way to model this is?
First problem
For the three-sided coin problem, have a look at the multinomial distribution. It's the distribution to use for a "binomial" problem with more then 2 outcomes.
Here is the example from Wikipedia (https://en.wikipedia.org/wiki/Multinomial_distribution):
Suppose that in a three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?
Note: Since we’re assuming that the voting population is large, it is reasonable and permissible to think of the probabilities as unchanging once a voter is selected for the sample. Technically speaking this is sampling without replacement, so the correct distribution is the multivariate hypergeometric distribution, but the distributions converge as the population grows large.
Second problem
The second problem seems to be a problem for cross-tabs. Then use the "Chi-squared test for association" to test whether there is a significant association between your variables. And use the "standardized residuals" of your cross-tab to identify which of the assiciations is more likely to occur and which is less likely.

Is the akaike information criterion (AIC) unit-dependent?

One formula for AIC is:
AIC = 2k + n*Log(RSS/n)
Intuitively, if you add a parameter to your model, your AIC will decrease (and hence you should keep the parameter), if the increase in the 2k term due to the new parameter is offset by the decrease in the n*Log(RSS/n) term due to the decreased residual sum of squares. But isn't this RSS value unit-specific? So if I'm modeling money, and my units are in millions of dollars, the change in RSS with adding a parameter might be very small, and won't offset the increase in the 2k term. Conversely, if my units are pennies, the change in RSS would be very large, and could greatly offset the increase in the 2k term. This arbitrary change in units would lead to a change in my decision whether to keep the extra parameter.
So: does the RSS have to be in standardized units for AIC to be a useful criterion? I don't see how it could be otherwise.
No, I don't think so (partially rowing back from what I said in my earlier comment). For the simplest possible case (least squares regression for y = ax + b), from wikipedia, RSS = Syy - a x Sxy.
From their definitions given in that article, both a and Sxy grow by a factor of 100 and Syy grows by a factor of 1002 if you change the unit for y from dollars to cents. So, after rescaling, the new RSS for that model will be 1002 times the the old one. I'm quite sure that the same result holds for models with k <> 2 parameters.
Hence nothing changes for the AIC difference where the key part is log(RSSB/RSSA). After rescaling both RSS will have grown by the same factor and you'll get the exact same AIC difference between model A and B as before.
Edit:
I've just found this one:
"It is correct that the choice of units introduces a multiplicative
constant into the likelihood. Thence the log likelihood has an
additive constant which contributes (after doubling) to the AIC. The difference of AICs is unchanged."
Note that this comment even talks about the general case where the exact log-likelihood is used.
I had the same question, and I felt like the existing answer above could have been clearer and more direct. Hopefully the following clarifies it a bit for others as well.
When using the AIC to compare models, it is the difference that is of interest. The portion in question here is the n*log(RSS/n). When we compare this for two different models, we will get:
n1*log(RSS1/n1) + 2k1 - n2*log(RSS2/n2) - 2k2
From our logarithmic identities, we know that log(a) - log(b) = log(a/b). AIC1 - AIC2 therefore simplifies to:
2k1 - 2k2 + log(RSS1*n2/(RSS2*n1))
If we add a gain factor G to represent a change in units, that difference becomes:
2k1 - 2k2 + log(G*RSS1*n2/(G*RSS2*n1)) = 2k1 - 2k2 + log(RSS1*n2/(RSS2*n1))
As you can see, we are left with the same AIC difference, regardless of which units we choose.

intensity of point process - weights with covariate - spatstat

I am trying spatstat for a specific case. In my shapefile of roads, i have attributes speed and % of heavy vehicles on each road. It is an observation that severe accidents are likely to happen on roads with high speeds and more heavy vehicles (because road is not properly access controlled and pedestrians cross the road). We know that there are accidents at a rate (per 5km stretch).
I would like to generate a random poisson with that rate, but giving weight that the points happen more on roads with high speed ( or high % truck)
and if possible also to include the second variable % of trucks
What is the best way to model the two aspects to make a small proof of concept? I have read (portions of) the spatstat book and section on influence of covariates on intensity, but this is still unclear to me.
Thanks
The spatstat function rpoislpp generates a Poisson random point pattern on the network with a given intensity. In this case, you want a spatially-varying intensity, which can be specified by a function of spatial location. That is, you want something like rpoislpp(f, L) where L is the linear network and f is the intensity function.
I assume you have obtained values of the covariate (like speed limit and fraction of trucks) for each road. Then you need to build a function that looks up these values at any spatial location on the network. Once you have this, you can write the intensity function in terms of it.
To start, suppose you have a network L (object of class linnet). The segments of the network can be indexed in the original order given when you specified them: or you can extract these segments by S <- as.psp(L). We need a vector z giving the covariate values for each of these segments (so this will be a numeric vector of length n=nsegments(S)). Then z[i] is the covariate value along segment i. (Note: if you have covariate values for each road, where a road consists of multiple segments of L, then you first need to figure out which segments of L belong to each road, and construct z.)
Next do the following:
Zfun <- linfun(function(x,y,seg,tp) { z[seg] }, L)
This creates a function on the linear network (class linfun) that evaluates the covariate at any spatial location on L. To check it's built correctly, type plot(Zfun).
Now suppose you want the point process intensity to be lambda = exp(3*Z+2). Then do
lam <- function(x,y,seg,tp) { exp(3 * z[seg] + 2) }
lambda <- linfun(lam, L)
(Needless to say, you can write any mathematical expression in the braces; and you can have more than one covariate, etc.)
Finally generate the random points:
X <- rpoislpp(lambda, L)

Standardization for PCA/FA

I did PCA/FA analysis with and without standardization and end up with different results. For standardization, I just divided each input variable by its corresponding standard deviation. However, I have not subtracted the mean (as in case of Z-scores). My question is how important it is to subtract the mean in case of PCA/FA?
I found on another blog that dividing by std dev is another way of standardizing the data-set. Is this superior to z-scores in any sense? Thanks.
By definition, principal components try to capture highest variation in the data; The important point is that, variation in here is defined as the 2nd norm; not variance and not standard deviation;
For example the first principal component is the linear combination of data in the direction given by:
This matters a lot because
unlike variance, 2nd norm is sensitive to location; in other words, if you add a constant to a vector, the variance will not change but the 2nd norm will change;
unlike standard deviation, 2nd norm is sensitive to scale; i.e. if a vector is multiplied by a constant factor, 2nd norm will scale by that factor;
There are at least two problems if an analysis is impacted by location and scale of explanatory factors:
In reality, observations represent different phenomena, so they have different and incomparable scale and average; for example the variations and average income values are not comparable with variations and average age of a sample population;
You do not want the model results conceptually change if for example incomes are quoted in cents as opposed to dollars, or measurements are done in inches and feet as opposed to meters;
But, plain PCA is sensitive to scale and location; for example, this is a PCA analysis on two dimensional standard normal variables with correlation .4;
The red lines represents the direction of loading vectors; Obviously the first principal component is capturing the highest variation in the joint data, and correctly gives equal shares to each vector;
But things will change dramatically if we move the population 2 units to the right; (equivalent of increasing the average of the first vector by 2 units):
Technically we have the same data as before, but now the first principal component is basically capturing the fact that the first vector has non-zero mean;
Similarly, if the first vector is scaled by a factor of 2:
As can be seen, the first vector has got 4 times more weight than the second vector, simply driven by the fact that it has higher variance.
This shows the importance of normalizing scale and removing mean value from the data before doing PCA;
That said, still one can come up with certain situations that the relative location and scale of the explanatory factors have useful information in the analysis and they should not be wiped out of the data.

Resources