PCA results on imbalanced data with duplicates - statistics

I am using sklearn IPCA decomposition and surprised that if I delete duplicates from my dataset, the result differs from the "unclean" one.
What is the reason? As I think, the variance is the same.

The answer is simple. The duplicates from the dataset change the variance.
https://stats.stackexchange.com/a/381983/230117

Related

Confusion matrix 5x5 formula for finding accuracy, precision, recall ,and f1-score

im try to study confusion matrix. i know about 2x2 confusion matrix but i still don't understand how to count 5x5 confusion matrix for finding accuracy, precision, recall and, f1 - score. Can anyone help me with this ? i appreciate every help.
See my answer here: Calculating Equal error rate(EER) for a multi class classification problem
In short, one strategy is to split the multiclass problem into a set of binary classification, for each class a "one vs. all others" classification. Then for each binary problem you can calculate F1, precision and recall, and if you want you can average (uniformly or weighted) the scores of each class to get one F1 score which will represent the multiclass problem.
As for confusion matrix larger than 2x2: the rows are the true labels and the columns are predicated labels. Then the number in cell (i,j) is the number of samples from class i which were classified as class j (note that i=j corresponds to correct prediction). The accuracy is the trace of the confusion matrix divided by the number of samples.

Performing a Chi-square test on the logarithm of counts for large count data

I am just curious-- the chi square test-statistic can be sensitive to large count data. Has anyone ever seen/heard of someone:
1. Counting the raw frequencies
2. Taking the logarithm of the frequencies
3. Rounding that to be an integer
4. Performing the chi-square test on the modified count data
?
I think a better approach would be through an ANOVA model or linear regression with interactions, but still am curious.
Thanks!
I've tried the problem using the above approach.

Remove outliers in multiple columns from a spark dataframe

I have a dataset of around 10 integer features and I wish to remove outliers from my dataset, from each feature.
What I have done in the past, is compute average and standard deviation for each feature and do a pass on the dataset, with discarding rows that qualify as outliers. Doing it on each column/ feature, helps me get rid of rows having at least one outlier feature.
Since parsing the dataset multiple times is not the optimal way, I was looking for ways to do this in a computation efficient manner. Can someone propose a better way so that the dataset can be parsed once and one can get rid of all outlier rows?

Is the loss in keras in percentage?

I am trying to implement VGGNet-16 for depth map prediction from single image. In the training the RMSE loss comes out to be 0.1599.
That loss value, is it in percentage or not?
No, if you want a percentage of a correctly classified data you can look at a value of accuracy.
Definition of RMSE from Wikipedia:
The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed.
It's always non-negative, and values closer to zero are better.

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

Resources