Sklearn.mixture.dpgmm not functioning correctly - scikit-learn

I'm having trouble with sklearn.mixture.dpgmm. The main issue is that it is not returning correct covariances for synthetic data (2 separated 2D gaussians), where it really should have no issue. In particular, when I do dpgmm._get_covars(), the covariance matrices have diagonal elements that are always exactly 1.0 too large, regardless of the input data distributions. This seems like a bug, as gmm works perfectly (when limiting to known exact number of groups)
Another issue is that dpgmm.weights_ makes no sense, they sum to one but the values appear meaningless.
Does anyone have a solution to this or see something clearly wrong with my example?
Here is the exact script I'm running:
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
import pdb
from sklearn import mixture
# Generate 2D random sample, two gaussians each with 10000 points
rsamp1 = np.random.multivariate_normal(np.array([5.0,5.0]),np.array([[1.0,-0.2],[-0.2,1.0]]),10000)
rsamp2 = np.random.multivariate_normal(np.array([0.0,0.0]),np.array([[0.2,-0.0],[-0.0,3.0]]),10000)
X = np.concatenate((rsamp1,rsamp2),axis=0)
# Fit a mixture of Gaussians with EM using 2
gmm = mixture.GMM(n_components=2, covariance_type='full',n_iter=10000)
gmm.fit(X)
# Fit a Dirichlet process mixture of Gaussians using 10 components
dpgmm = mixture.DPGMM(n_components=10, covariance_type='full',min_covar=0.5,tol=0.00001,n_iter = 1000000)
dpgmm.fit(X)
print("Groups With data in them")
print(np.unique(dpgmm.predict(X)))
##print the input and output covars as example, should be very similar
correct_c0 = np.array([[1.0,-0.2],[-0.2,1.0]])
print "Input covar"
print correct_c0
covars = dpgmm._get_covars()
c0 = np.round(covars[0],decimals=1)
print "Output Covar"
print c0
print("Output Variances Too Big by 1.0")

According to the dpgmm docs this Class is Deprecated in version 0.18 and will be removed in version 0.20
You should use BayesianGaussianMixture Class instead, with parameter weight_concentration_prior_type set with option "dirichlet_process"
Hope it helps

instead of writing
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
you should write:
from sklearn.mixture import BayesianGaussianMixture
gmm = BayesianGaussianMixture(2, covariance_type='full', random_state=0)

Related

A function to insert data in dataset using python

I create a program that predict digits from in a dataset. I want when it predict data their should be two cases if it predict right then data should added automatically in dataset otherwise it takes right answer throw user and insert to dataset.
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as pt
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("train.csv").values
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label=data[0:21000,0]
clf.fit(xtrain,train_label)
xtest = data[21000: ,1:]
actual_label=data[21000:,0]
d = xtest[9]
d.shape = (28,28)
pt.imshow(d,cmap='gray')
print(clf.predict([xtest[9]]))
pt.show()
I'm not sure I'm following your question, but if you want to distinguish between good and wrong predictions and take different ways, you should specific do that.
predictions = clf.predict(xtest)
good_predictions = xtest[pd.Series(predictions == actual_label)]
bad_predictions = xtest[pd.Series(predictions != actual_label)]
So, in good_predictions will be all the rows in xtest that where predicted right.

randomly sample a vector-ARMA model

I intend to randomly sample a VARMA model but I cannot seem to see a function in statsmodels for this, I studied the example on the ARMA and can replicate this successfully for a 1 variable.
# for the ARMA
import numpy as np
from statsmodels.tsa.arima_model import ARMA
import statsmodels.api as sm
arparams=np.array([.9,-.7])
maparams=np.array([.5,.8])
ar=np.r_[1,-arparams]
ma=np.r_[1,maparams]
obs=10000
sigma=1
# for the VARMA
import numpy as np
from statsmodels.tsa.statespace.varmax import VARMAX
# generate a a 2-D correlated normal series
mean = [0,0]
cov = [[1,0.9],[0.9,1]]
data = np.random.multivariate_normal(mean,cov,100)
# fit the data into a VARMA model
model = VARMAX(data, order=(1,1)).fit()
`enter code here`
# I cant seem to find a way to randomly sample the VARMA
Results objects from fitting a VARMAX model have a simulate method which can be used to generate a random sample. For example:
mod = VARMAX(data, order=(1,1))
res = mod.fit()
# to generate a time series of length 100 following the VARMAX process described by `res`:
sample = res.simulate(100)
This is true of any state space model, including SARIMAX, UnobservedComponents, VARMAX, and DynamicFactor.
(Also, the model class has a simulate method. The main difference is that since model objects don't have associated parameter values, you need to pass a particular parameter vector in that case).

Effects of channel_shift_range in ImageDataGenerator (Keras image augmentation)

Maybe I'm misunderstanding. If I implement channel_shift_range in my ImageDataGenerator, the output should have "scrambled" color values, right? I would like to use it to make my model more robust to variance in color.
However, when I test it, I'm not seeing any effect. Am I using it wrong? Here's my code:
from keras.preprocessing.image import ImageDataGenerator
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
path = '/mnt/Project/Imaging/samples'
datagen = ImageDataGenerator(channel_shift_range=0.9)
genObject = datagen.flow_from_directory(path,
batch_size=1)
augs = []
i = 0
for batch in genObject:
augs.append(batch)
i += 1
if i > 10:
break
for item in augs:
plt.imshow(item[0][0].astype('uint8'))
plt.show()
Environment:
Jupyter Lab
Python 3.6.6
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
tensorboard==1.9.0
tensorflow-gpu==1.9.0
Thanks in advance for the help!
The values of the image data are in the range [0..255], so a shift of 0.. 0.9 is hardly visible. Try a much large shift to see any effect.
Note that using rescale=1./255. does not help as the rescaling is applied after the transformation.

Unstable behavior of OneClassSVM by changing 'nu'

In the example above, I'm using my dataset to identify outliers. After making slight changes to the nu parameter, there is a huge difference in the number of anomalies identified.
Could this be just a particularity of the dataset? Or a bug in scikit-learn?
P.S. Unfortunately I cannot share the dataset.
If you decrease the value of the tol parameter of the OneClassSVM the result is better although not completely as expected for low values of nu.
import numpy as np
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
X = np.random.rand(100, 1)
nus = np.geomspace(0.0001, 0.5, num=100)
outlier_fraction = np.zeros(len(nus))
for i, nu in enumerate(nus):
outlier_fraction[i] = (OneClassSVM(nu=nu, tol=1e-12).fit_predict(X) == -1).mean()
plt.plot(nus, outlier_fraction)
plt.xlabel('nu')
plt.ylabel('Outlier fraction')
plt.show()
With the default tol you obtain the following
NOTE: not an answer. Offering a MCVE.
I also recently came across this. I would like to understand the inflection point at the low values
import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM
X = np.random.rand(100, 1)
nu = np.geomspace(0.0001, 1, num=100)
df = pd.DataFrame(data={'nu': nu})
for i in range(0, len(X)):
df.loc[i, 'anom_count'] = (OneClassSVM(nu=df.loc[i, 'nu']).fit_predict(X) == -1).sum()
df.set_index('nu').plot();
df.set_index('nu').plot(xlim=(0, 0.2));
df.anom_count.min() # 3
df.anom_count.idxmin() # 62
df.loc[df.anom_count.idxmin(), 'nu'] # 0.031

How to visualize an sklearn GradientBoostingClassifier?

I've trained a gradient boost classifier, and I would like to visualize it using the graphviz_exporter tool shown here.
When I try it I get:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'tree_'
this is because the graphviz_exporter is meant for decision trees, but I guess there's still a way to visualize it, since the gradient boost classifier must have an underlying decision tree.
Does anybody know how to do that?
The attribute estimators contains the underlying decision trees. The following code displays one of the trees of a trained GradientBoostingClassifier. Notice that although the ensemble is a classifier as a whole, each individual tree computes floating point values.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import export_graphviz
import numpy as np
# Ficticuous data
np.random.seed(0)
X = np.random.normal(0,1,(1000, 3))
y = X[:,0]+X[:,1]*X[:,2] > 0
# Classifier
clf = GradientBoostingClassifier(max_depth=3, random_state=0)
clf.fit(X[:600], y[:600])
# Get the tree number 42
sub_tree_42 = clf.estimators_[42, 0]
# Visualization
# Install graphviz: https://www.graphviz.org/download/
from pydotplus import graph_from_dot_data
from IPython.display import Image
dot_data = export_graphviz(
sub_tree_42,
out_file=None, filled=True, rounded=True,
special_characters=True,
proportion=False, impurity=False, # enable them if you want
)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())
Tree number 42:

Resources