Performing a Chi-square test on the logarithm of counts for large count data - statistics

I am just curious-- the chi square test-statistic can be sensitive to large count data. Has anyone ever seen/heard of someone:
1. Counting the raw frequencies
2. Taking the logarithm of the frequencies
3. Rounding that to be an integer
4. Performing the chi-square test on the modified count data
?
I think a better approach would be through an ANOVA model or linear regression with interactions, but still am curious.
Thanks!
I've tried the problem using the above approach.

Related

Why there is a need of approximate solution for 0-1 Knapsack, if input values are high?

On Geeks for Geeks Link, It is mentioned that " if input values are high, then the solution for 0-1 Knapsack becomes infeasible and there is a need of approximate solution."
And in the approximate solution i.e FPTAS solution, the values corresponding to weights are modified in this way :-
k = (maxVal * ε) / n
val'[i] = floor(val[i] / k)
And then the same DP based solution is applied.
My Doubt is that, the complexity of actual DP based solution of 0-1 knapsack depends on weight of knapsack and No. of items. It doesn't involve values, then why if values are high then we need approximate solution ? And is doing this makes the complexity of approximate solution better than the actual solution ?
You're right. I think the correct sentence is "if input weights are high, then the solution for 0-1 Knapsack becomes infeasible and there is a need of an approximate solution."
In my opinion, in the GeeksforGeeks post "input values" means "all numbers which are given in the input"; it includes both weights and values.

Goodness of fit for Gaussian process output using matlab?

I used fitrgp from gaussian process matlab toolbox and calculated the predicted values for a given observation. I calculated in three different cases and got three predicted values arrays say ypred1,ypred2 and ypred3. Now I want to test the goodness of fit for these outputs in order to judge which algorithm values gives more accurate result. The details of fitrgp is given below link,
https://uk.mathworks.com/help/stats/gaussian-process-regression-models.html
It would be grateful if anyone help me in this regard. Thank you in advace

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Resources