For independent Bernoulli X_1, X_2, ..., X_n, the maximum distribution is
Y=max{X_1,X_2,⋯,X_n }~Bernoulli(1-∏_(i=1)^n▒(1-p_i ) )
But for correlated Bernoulli, what is the maximum distribution? any advice? Many thanks.
Related
For example, X is Multivariate normally distributed with mean vector equal to zero. Is X′X/(n − 1) an unbiased estimator of Σ. If the answer is Yes or No, can you explain or prove it mathematically?
The K-means method cannot deal with anistropic points. The DBSCAN and Gaussian Mixture model seems that they can work with this according to scikit-learn. I have tried to use both approaches, but they are not working for my dataset.
DBSCAN
I used the following code:
db = DBSCAN(eps=0.1,min_samples=5 ).fit(X_train,Y_train)
labels_train=db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels_train)) - (1 if -1 in labels_train else 0)
print('Estimated number of clusters: %d' % n_clusters_)
and only 1 cluster (Estimated number of clusters: 1) was detected as shown here.
Gaussian Mixture model
The code was as follows:
gmm = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmm.fit(X_train,Y_train)
labels_train=gmm.predict(X_train)
print(gmm.bic(X_train))
The two clusters could not be distinguished as shown here.
How can i detect two clusters?
Read the documentation.
fit(X, y=None, sample_weight=None)
X : array or sparse (CSR) matrix of shape (n_samples, n_features) [...]
...
y : Ignored
So your invocation ignores the y coordinate.
Don't we all love python/sklearn, because it doesn't even warn you of this, but silently ignores y?
X should be the entire data, not just the x coordinates.
The notion of "train" and "predict" does not make sense for clustering. Don't use it. Only use fit_predict.
I have a histogram of sorted random numbers and a Gaussian overlay. The histogram represents observed values per bin (applying this base case to a much larger dataset) and the Gaussian is an attempt to fit the data. Clearly, this Gaussian does not represent the best fit to the histogram. The code below is the formula for a Gaussian.
normc, mu, sigma = 30.845, 50.5, 7 # normalization constant, avg, stdev
gauss = lambda x: normc * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )
I calculated the expectation values per bin (area under the curve) and calculated the number of observed values per bin. There are several methods to find the 'best' fit. I am concerned with the best fit possible by minimizing Chi-Squared. In this formula for Chi-Squared, the expectation value is the area under the curve per bin and the observed value is the number of occurrences of sorted data values per bin. So I want to fluctuate normc, mu, and sigma near their given values to find the right combination of normc, mu, and sigma that produce the smallest Chi-Square, as these will be the parameters I can plug into the code above to overlay the best fit Gaussian on my histogram. I am trying to use the scipy module to minimize my Chi-Square as done in this example. Since I need to fluctuate parameters, I will use the function gauss (defined above) to plot the Gaussian overlay, and will define a new function to find the minimum Chi-Squared.
def gaussmin(var,data):
# var[0] = normc
# var[1] = mu
# var[2] = sigma
# data is the sorted random numbers, represents unbinned observed values
for index in range(len(data)):
return var[0] * exp( (-1) * (data[index] - var[1])**2 / ( 2 * (var[2] **2) ) )
# I realize this will return a new value for each index of data, any guidelines to fix?
After this, I am stuck. How can I fluctuate the parameters to find the normc, mu, sigma that produced the best fit? My last attempt at a solution is below:
var = [normc, mu, sigma]
result = opt.minimize(chi2, [normc,mu,sigma])
# chi2 is the chisquare value obtained via scipy
# chisquare input (a,b)
# where a is number of occurences per bin, b is expected value per bin
# b is dependent upon normc, mu, sigma
print(result)
# data is a list, can I keep it as a constant and only fluctuate parameters in var?
There are plenty of examples online for scalar functions but I cannot find any for variable functions.
PS - I can post my full code so far but it's bit lengthy. If you would like to see it, just ask and I can post it here or provide a googledrive link.
A Gaussian distribution is completely characterized by its mean and variance (or std deviation). Under the hypothesis that your data are normally distributed, the best fit will be obtained by using x-bar as the mean and s-squared as the variance. But before doing so, I'd check whether normality is plausible using, e.g., a q-q plot.
I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.
I was searching for applications for random forests, and I found the following knowledge competition on Kaggle:
https://www.kaggle.com/c/forest-cover-type-prediction.
Following the advice at
https://www.kaggle.com/c/forest-cover-type-prediction/forums/t/8182/first-try-with-random-forests-scikit-learn,
I used sklearn to build a random forest with 500 trees.
The .oob_score_ was ~2%, but the score on the holdout set was ~75%.
There are only seven classes to classify, so 2% is really low. I also consistently got scores near 75% when I cross validated.
Can anyone explain the discrepancy between the .oob_score_ and the holdout/cross validated scores? I would expect them to be similar.
There's a similar question here:
https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests
Edit: I think it might be a bug, too.
The code is given by the original poster in the second link I posted. The only change is that you have to set oob_score = True when you build the random forest.
I didn't save the cross validation testing I did, but I could redo it if people need to see it.
Q: Can anyone explain the discrepancy ...
A: The sklearn.ensemble.RandomForestClassifier object and it's observed .oob_score_ attribute value is not a bug-related issue.
First, RandomForest-based predictors { Classifier | Regressor } belong to rather specific corner of so called ensemble methods, so be informed, that typical approaches, incl. Cross-Validation, do not work the same way as for other AI/ML-learners.
RandomForest "inner"-logic works heavily with RANDOM-PROCESS, by which the Samples ( DataSET X ) with known y == { labels ( for Classifier ) | targets ( for Regressor ) }, gets split throughout the forest generation, where trees get bootstrapped by RANDOMLY split DataSET into part, that the tree can see and a part, the tree will not see ( forming thus an inner-oob-subSET ).
Besides other effects on a sensitivity to overfitting et al, the RandomForest ensemble does not have a need to get Cross-Validated, because it does not over-fit by design. Many papers and also Breiman's (Berkeley) empirical proofs have provided support for such statement, as they brought evidence, that CV-ed predictor will have the same .oob_score_
import sklearn.ensemble
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators = 10, # The number of trees in the forest.
criterion = 'mse', # { Regressor: 'mse' | Classifier: 'gini' }
max_depth = None,
min_samples_split = 2,
min_samples_leaf = 1,
min_weight_fraction_leaf = 0.0,
max_features = 'auto',
max_leaf_nodes = None,
bootstrap = True,
oob_score = False, # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET
n_jobs = 1, # { 1 | n-cores | -1 == all-cores }
random_state = None,
verbose = 0,
warm_start = False
)
aRF_PREDICTOR.estimators_ # aList of <DecisionTreeRegressor> The collection of fitted sub-estimators.
aRF_PREDICTOR.feature_importances_ # array of shape = [n_features] The feature importances (the higher, the more important the feature).
aRF_PREDICTOR.oob_score_ # float Score of the training dataset obtained using an out-of-bag estimate.
aRF_PREDICTOR.oob_prediction_ # array of shape = [n_samples] Prediction computed with out-of-bag estimate on the training set.
aRF_PREDICTOR.apply( X ) # Apply trees in the forest to X, return leaf indices.
aRF_PREDICTOR.fit( X, y[, sample_weight] ) # Build a forest of trees from the training set (X, y).
aRF_PREDICTOR.fit_transform( X[, y] ) # Fit to data, then transform it.
aRF_PREDICTOR.get_params( [deep] ) # Get parameters for this estimator.
aRF_PREDICTOR.predict( X ) # Predict regression target for X.
aRF_PREDICTOR.score( X, y[, sample_weight] ) # Returns the coefficient of determination R^2 of the prediction.
aRF_PREDICTOR.set_params( **params ) # Set the parameters of this estimator.
aRF_PREDICTOR.transform( X[, threshold] ) # Reduce X to its most important features.
One shall be also informed, that default values do not serve best, the less serve well under any circumstances. One shall take care to the problem-domain so as to propose a reasonable set of ensemble parametrisation, before moving further.
Q: What is a good .oob_score_ ?
A: .oob_score_ is RANDOM! . . . . . . .....Yes, it MUST ( be random )
While this sound as a provocative epilogue, do not throw your hopes away.
RandomForest ensemble is a great tool. Some problems may come with categoric-values in features ( DataSET X ), however the costs of processing the ensemble are still adequate once you need not struggle with neither bias nor overfitting. That's great, isn't it?
Due to the need to be able to reproduce same results on subsequent re-runs, it is a recommendable practice to (re-)set numpy.random & .set_params( random_state = ... ) to a know-state before the RANDOM-PROCESS ( embedded into every bootstrapping of the RandomForest ensemble ). Doing that, one may observe a "de-noised" progression of the RandomForest-based predictor in a direction of better .oob_score_ rather due to truly improved predictive powers introduced by more ensemble members ( n_estimators ), less constrained tree-construction ( max_depth, max_leaf_nodes et al ) and not just stochastically by just "better luck" during the RANDOM-PROCESS of how to split the DataSET...
Going closer towards better solutions typically involves more trees into the ensemble ( RandomForest decisions are based on a majority vote, so 10-estimators is not a big basis for making good decisions on highly complex DataSETs ). Numbers above 2000 are not uncommon. One may iterate over a range of sizings ( with RANDOM-PROCESS kept under state-full control ) to demonstrate the ensemble "improvements".
If initial values of .oob_score_ fall somewhere around about 0.51 - 0.53 your ensemble is 1% - 3% better than a RANDOM-GUESS
Only after you make your ensemble-based predictor to something better, you may move into some additional tricks on feature engineering et al.
aRF_PREDICTOR.oob_score_ Out[79]: 0.638801 # n_estimators = 10
aRF_PREDICTOR.oob_score_ Out[89]: 0.789612 # n_estimators = 100