Evaluating Models that Return Percentage Present of Multiple Classes - modeling

If there is a model that returns a vector of the amount of different classes present in the data as percentages, what would be a good way to evaluate it (with charts and/or statistics)?
Say, for example, that a batch of pond water contains 30% Bacteria1 and 70% Bacteria2 (data is [0.3, 0.7]. Our model returns 35% Bacteria1 and 65% Bacteria2 (output is [0.35, 0.65]). How would we evaluate the accuracy of this model?
Am I right in thinking that we can't use things like confusion matrices or ROC/AUC curves because this isn't a classification problem? I'm not sure if there exist other metrics like these ones for this kind of problem though.

Related

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Scale before PCA

I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?
Moreover, which of the two should I perform, if so, and why is this step needed?
I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.
Before PCA, you should,
Mean normalize (ALWAYS)
Scale the features (if required)
Note: Please remember that step 1 and 2 are not the same technically.
This is a really non-technical answer but my method is to try both and then see which one accounts for more variation on PC1 and PC2. However, if the attributes are on different scales (e.g. cm vs. feet vs. inch) then you should definitely scale to unit variance. In every case, you should center the data.
Here's the iris dataset w/ center and w/ center + scaling. In this case, centering lead to higher explained variance so I would go with that one. Got this from sklearn.datasets import load_iris data. Then again, PC1 has most of the weight on center so patterns I find in PC2 I wouldn't think are significant. On the other hand, on center | scaled the weight is split up between PC1 and PC2 so both axis should be considered.

Log transforming predictor variables in survival analysis

I am running shared gamma frailty models (i.e., Coxph survival analysis models with a random effect) and want to know if it is "acceptable" to log transform one of your continuous predictor variables. I found a website (http://www.medcalc.org/manual/cox_proportional_hazards.php) that said "The Cox proportional regression model assumes ... there should be a linear relationship between the endpoint and predictor variables. Predictor variables that have a highly skewed distribution may require logarithmic transformation to reduce the effect of extreme values. Logarithmic transformation of a variable var can be obtained by entering LOG(var) as predictor variable".
I would really appreciate a second opinion from someone with more statistical knowledge on this topic. In a nutshell: Is it OK/commonplace/etc to transform (specifically log transform) predictor variables in a survival analysis model (e.g., Coxph model).
Thanks.
You can log transform any predictor in Cox regression. This is frequently necessary but has some drawbacks.
Why log transform? There are a number of good reasons why. You decrease the extent and effect of outliers, data becomes more normally distributed etc.
When possible? I doubt that there are circumstances when you can not do it. I find it hard to believe that it would compromise the precision of your estimates.
Why not do it always? Well it becomes difficult to interpret the results for a predictor which have been log transformed. If you don't log transform, and your predictor is, for example, blood pressure and you obtain a hazard ratio of 1.05, meaning a 5% increase in risk of event for 1 unit increase in blood pressure. IF you log transform blood pressure, the hazard ratio of 1.05 (it would most likely not land on 1.05 again after log transform but we'll stick to 1.05 for simplicity) means 5% increase for each log unit increase in blood pressure. Now thats more difficult to grasp.
But, if you are not interested in the particular variable that you think about log transforming (i.e you just need to adjust for it as a covariate), go ahead do it.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

AIC values in model comparison

I was comparing two models using the AIC. However, I realized that both AIC values are too small (-4752.66, and the other is close to that). I was wondering if that is normal or I did something wrong while calculating it.
Its ok to have negative aic values! (Ref-https://stats.stackexchange.com/questions/84076/negative-values-for-aic-in-general-mixed-model)
As the values are too close to each other, the one with smaller delta is your choice. If both deltas are smaller than two, look for evidence ratio . Evidence ratio smaller than 2.7 indicates that booth models are good ->(serch for "comparing models" at : http://theses.ulaval.ca/archimede/fichiers/21842/apa.html#d0e5831).
IF this is your case, you can use model averaging->(Symonds, Matthew RE, and Adnan Moussalli. "A brief guide to model selection, multimodel inference and model averaging in behavioural ecology using Akaike’s information criterion." Behavioral Ecology and Sociobiology 65.1 (2011): 13-21.)

Resources