Dataset properties for acceptable confirmatory factor analysis - statistics

I don't exactly figure out the data requirement for confirmatory factor analysis. what kind of dataset gives acceptable performance metrics for the confirmatory factor analysis? For example: if there are three factors, all the items that belong the same factor has high covariance is enough for good results? Thanks.

Related

logistic regression, model performance

I am interested to understand in which scenarios person should use sensitivity, specificity, and when should person opt for precision recall.
On a high level I understand for a balanced data set we should use precision, recall and if dataset is imbalanced we should use sensitivity and specificity. but I am not sure why they say it.
If you people have different perspective, pls throw some light on how to perceive these.
Thanks

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.
You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.
My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

Sensitivity Vs Positive Predicted Value - which is best?

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Linear SVM vs Nonlinear SVM high dimensional data

I am working on a project where I use Spark Mllib Linear SVM to classify some data (l2 regularization). I have like 200 positive observation, and 150 (generated) negative observation, each with 744 features, which represent the level of activity of a person in different region of a house.
I have run some tests and the "areaUnderROC" metric was 0.991 and it seems that the model is quite good in classify the data that I provide to it.
I did some research and I found that the linear SVM is good in high dimensional data, but the problem is that I don't understand how something linear can divide my data so well.
I think in 2D, and maybe this is the problem but looking at the bottom image, I am 90% sure that my data looks more like a non linear problem
So it is normal that I have good results on the tests? Am I doing something wrong? Should I change the approach?
I think you question is about 'why linear SVM could classfy my hight Dimensions data well even the data should be non-linear'
some data set look like non-linear in low dimension just like you example image on right, but it is literally hard to say the data set is definitely non-linear in high dimension because a nD non-linear may be linear in (n+1)D space.So i dont know why you are 90% sure your data set is non-linear even it is a high Dimension one.
At the end, I think it is normal that you have a good test result in test samples, because it indicates that your data set just is linear or near linear in high Dimension or it wont work so well.Maybe cross-validation could help you comfirm that your approach is suitable or not.

Similarity score for mixed (binary & numerical) vectors

I have a dataset which the instances are of about 200 features, about 11 of these features are numerical (integer) and the rest are binary (1/0) , these features may be correlated and they are of different probability distributions ,
It's been a while that I've been for a good similarity score which works for a mixed vector and takes into account the correlation between the features,
Do you know such similarity score?
Thanks,
Arian
In your case, the similarity function relies heavily on the input data patterns. You might benefit from learning a distance metric for the input space of data from a given collection
of pair of similar/dissimilar points that preserves the distance relation among the
training data.
Here is a nice survey paper.
The numerous types of distance measures, Euclidean, Manhattan, etc are going provide different levels of accuracy depending on the dataset. Best to read papers covering your method of data fitting and see what heuristics they use. Not to mention that some methods require only homogeneous data that scale accordingly. Here is a paper that talks about a whole host of measures that you might find attractive.
And as always, test and cross validate to see if there really is an impact from the mixing of feature types.

Resources