KMeans clustering - Value error: n_samples=1 should be >= n_cluster - python-3.x

I am doing an experiment with three time-series datasets with different characteristics for my experiment whose format is as the following.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
The first column is a timestamp. For reproducibility reasons, I am sharing the data here. From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, here is the code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
plt.figure(); plt.clf()
plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
plt.ylim(0, 1.0001)
plt.title(protname)
plt.xlabel("time")
plt.ylabel("quotient")
plt.legend()
plt.show()
And this produces the following three points - one for each dataset I shared.
As we can see from the points in the plots based on the code given above, data1 is pretty consistent whose value is around 1, data2 will have two quotients (whose values will concentrate either around 0.5 or 0.8) and the values of data3 are concentrated around two values (either around 0.5 or 0.7). This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times. I am trying it with KMeans clustering as the following
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
But this is giving me an error: ValueError: n_samples=1 should be >= n_clusters=3. How can we fix this error?
Update: samlpe quotient data = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129,
0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 ,
0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581,
0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])

As is, your quotient variable is now one single sample; here I get a different error message, probably due to different Python/scikit-learn version, but the essence is the same:
import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
This gives the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[0.7 0.7 0.4973262 0.7008547 0.71287129 0.704
0.49723757 0.49723757 0.70676692 0.5 0.5 0.70754717
0.5 0.49723757 0.70322581 0.5 0.49723757 0.49723757
0.5 0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
which, despite the different wording, is not different from yours - essentially it says that your data look like a single sample.
Following the first advice(i.e. considering that quotient contains a single feature (column) resolves the issue:
k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)

Please try the code below. A brief explanation on what I've done:
First I built the dataset sample = np.vstack((quotient_times, quotient)).T and standardized it, so it would become easier to cluster. Following, I've applied DBScan with multiple hyperparameters (eps and min_samples) until I've found the one that separated the points better. Finally, I've plotted the data with its respective labels, since you are working with 2 dimensional data, it's easy to visualize how good the clustering is.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
dataset = np.empty((0, 2))
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
sample = np.vstack((quotient_times, quotient)).T
dataset = np.append(dataset, sample, axis=0)
scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)
k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)
colors = [i for i in k_means.labels_]
plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()

You are trying to make 3 clusters, while you have only 1 np.array i.e n_samples.
Try increasing the no. of arrays.
Decreasing no. of clusters.
Reshaping the array (not sure)

Related

How to statistically compare the intercept and slope of two different linear regression models in Python?

I have two series of data as below. I want to create an OLS linear regression model for df1 and another OLS linear regression model for df2. And then statistically test if the y-intercepts of these two linear regression models are statistically different (p<0.05), and also test if the slopes of these two linear regression models are statistically different (p<0.05). I did the following
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
fig, ax = plt.subplots()
df1.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
df2.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
plt.show()
model1 = sm.OLS(df1, df1.index)
model2 = sm.OLS(df2, df2.index)
results1 = model1.fit()
results2 = model2.fit()
print(results1.summary())
print(results2.summary())
Results #1
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.625
Model: OLS Adj. R-squared (uncentered): 0.563
Method: Least Squares F-statistic: 10.02
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0194
Time: 20:34:34 Log-Likelihood: -29.262
No. Observations: 7 AIC: 60.52
Df Residuals: 6 BIC: 60.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 5.6703 1.791 3.165 0.019 1.287 10.054
==============================================================================
Omnibus: nan Durbin-Watson: 2.956
Prob(Omnibus): nan Jarque-Bera (JB): 0.769
Skew: 0.811 Prob(JB): 0.681
Kurtosis: 2.943 Cond. No. 1.00
==============================================================================
Results #2
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.692
Model: OLS Adj. R-squared (uncentered): 0.641
Method: Least Squares F-statistic: 13.50
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0104
Time: 20:39:14 Log-Likelihood: -5.8073
No. Observations: 7 AIC: 13.61
Df Residuals: 6 BIC: 13.56
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.2308 0.063 3.674 0.010 0.077 0.384
==============================================================================
Omnibus: nan Durbin-Watson: 0.148
Prob(Omnibus): nan Jarque-Bera (JB): 0.456
Skew: 0.000 Prob(JB): 0.796
Kurtosis: 1.750 Cond. No. 1.00
==============================================================================
This is as far I have got, but I think something is wrong. Neither of these regression outcome seems to show the y-intercept. Also, I expect the coef in results #2 to be 0 since I expect the slope to be 0 when all the values are 1, but the result shows 0.2308. Any suggestions or guiding material will be greatly appreciated.
In statsmodels an OLS model does not fit an intercept by default (see the docs).
exog array_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
The documentation on the exog argument of the OLS constructor suggests using this feature of the tools module in order to add an intercept to the data.
To perform a hypothesis test on the values of the coefficients this question provides some guidance. This unfortunately only works if the variances of the residual errors is the same.
We can start by looking at whether the residuals of each distribution have the same variance (using Levine's test) and ignore coefficients of the regression model for now.
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
The output of the levene test looks like this:
LeveneResult(statistic=7.317386741297884, pvalue=0.019129208414097015)
So we can reject the null hypothesis that the residual distributions have the same variance at alpha=0.05.
There is no point to testing the linear regression coefficients now because the residuals don't have don't have the same distributions. It is important to remember that in a regression problem it doesn't make sense to compare the regression coefficients independent of the data they are fit on. The distribution of the regression coefficients depends on the distribution of the data.
Lets see what happens when we try the proposed test anyways. Combining the instructions above with this method from the OLS package yields the following code:
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
The result of the f-test looks like this:
<F test: F=array([[4.01995453]]), p=0.05233934453138028, df_denom=10, df_num=2>
The result of the f-test means that we just barely fail to reject any of the null hypotheses specified in the hypotheses variable namely that the coefficient of the indicator variable 'c' and interaction term 'int' are zero.
From this example it is clear that the f test on the regression coefficients is not very powerful if the residuals do not have the same variance.
Note that the given example has so few points it is hard for the statistical tests to clearly distinguish the two cases even though to the human eye they are very different. This is because even though the statistical tests are designed to make few assumptions about the data but those assumption get better the more data you have. When testing statistical methods to see if they accord with your expectations it is often best to start by constructing large samples with little noise and then see how well the methods work as your data sets get smaller and noisier.
For the sake of completeness I will construct an example where the Levene test will fail to distinguish the two regression models but f test will succeed to do so. The idea is to compare the regression of a noisy data set with its reverse. The distribution of residual errors will be the same but the relationship between the variables will be very different. Note that this would not work reversing the noisy dataset given in the previous example because the data is so noisy the f test cannot distinguish between the positive and negative slope.
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
n_samples = 6
noise = np.random.randn(n_samples) * 5
data1 = np.linspace(0, 30, n_samples) + noise
data2 = data1[::-1] ## reverse the time series
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
The result of Levene test and the f test follow:
LeveneResult(statistic=5.451203655948632e-31, pvalue=1.0)
<F test: F=array([[10.62788052]]), p=0.005591319998324387, df_denom=8, df_num=2>
A final note since we are doing multiple comparisons on this data and stopping if we get a significant result, i.e. if the Levene test rejects the null we quit, if it doesn't then we do the f test, this is a stepwise hypothesis test and we are actually inflating our false positive error rate. We should correct our p-values for multiple comparisons before we report our results. Note that the f test is already doing this for the hypotheses we test about the regression coefficients. I am a bit fuzzy on the underlying assumptions of these testing procedures so I am not 100% sure that you are better off making the following correction but keep it in mind in case you feel you are getting false positives too often.
from statsmodels.sandbox.stats.multicomp import multipletests
print(multipletests([1, .005591], .05)) ## correct out pvalues given that we did two comparisons
The output looks like this:
(array([False, True]), array([1. , 0.01115074]), 0.025320565519103666, 0.025)
This means we rejected the second null hypothesis under the correction and that the corrected p-values looks like [1., 0.011150]. The last two values are corrections to your significance level under two different correction methods.
I hope this helps anyone trying to do this type of work. If anyone has anything to add I would welcome comments. This isn't my area of expertise so I could be making some mistakes.

Why does kmeans give exactly the same results everytime?

I have re-run kmeans 4 times and get
From other answers, I got that
Everytime K-Means initializes the centroid, it is generated randomly.
Could you please explain why the results are exactly the same each time?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
They are not the same. They are similar. K-means is an algorithm that is in a way moving centroids iteratively so that they become better and better at splitting data and while this process is deterministic, you have to pick initial values for those centroids and this is usually done at random. Random start, doesn't mean that final centroids will be random. They will converge to something relatively good and often similar.
Have a look at your code with this simple modification:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
if you have a look at exact values of those centroids, they will look like that:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
Similar, but different sets of values.
Also:
Have a look at default arguments to KMeans. There is one called n_init:
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
By default it is equal to 10. Which means every time you run k-means it actually run 10 times and picked the best result. Those best results will be even more similar, than results of a single run of k-means.
I post #AEF's comment to remove this question from unanswered list.
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.
Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.
The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). random_state’s value may be:
for reference :
https://scikit-learn.org/stable/glossary.html#term-random_state

How to bin a netcdf data using xarray

I have some spatiotemporal data derived from the CHIRPS Database. It is a NetCDF that contains daily precipitation for all over the world with a spatial resolution of 1x1km2. The DataSet possesses 3 dimensions ('time', 'longitude', 'latitude').
I would like to bin this precipitation data according to each pixel's coordinate ('latitude' & 'longitude') temporal distribution. Therefore, the dimension I wish to apply the binnarization is the 'time' domain.
This is a similar question already discussed in StackOverflow (see in here). The difference between their Issue and mine is that, in my case, I need to binnarize the data according to each specific pixel's temporal distribution, instead of applying a single range of values for binnarization for all my coordinates (pixels). As a consequence, I expect to have different binning thresholds ('n' sets of thresholds), one for each of the 'n' pixels in my dataset.
As far as I understand, the simplest and fastest way to apply a function over each of the coordinates (pixels) of a Xarray's DataArray/DataSet is to use the xarray.apply_ufunc.
For the binnarization, I am using the pandas qcut method, which only requires an array of values and some given relative frequency (i.e.: [0.1%, 0.5%, 25%, 99%]) in order for it to work.
Since pandas binning function requires an array of data, and it also returns another array of binnarized data, I understand that I have to use the argument "vectorize"=True in the U_function (described in here).
Finally, when I run the analysis, The resulted Xarray DataSet ends up losing the 'time' dimension after the processing. Also, I get unsure whether that processing truly returned an Xarray DataSet with data properly classified.
Here is a reproducible snippet code. Notice that the 'time' dimension of the "ds_binned" is lost. Therefore, I have to later insert the binned data back to the original xarray dataset (ds). Also notice that the dimensions are not set in proper order. That also is causing problems for my analysis.
import pandas as pd
pd.set_option('display.width', 50000)
pd.set_option('display.max_rows', 50000)
pd.set_option('display.max_columns', 5000)
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
ds = xr.tutorial.open_dataset('rasm').load()
def parse_datetime(time):
return pd.to_datetime([str(x) for x in time])
ds.coords['time'] = parse_datetime(ds.coords['time'].values)
def binning_function(x, distribution_type='Positive', b=False):
y = np.where(np.abs(x)==np.inf, 0, x)
y = np.where(np.isnan(y), 0, y)
if np.all(y) == 0:
return x
else:
Classified = pd.qcut(y, np.linspace(0.01, 1, 10))
return Classified.codes
def xarray_parse_extremes(ds, dim=['time'], dask='allowed', new_dim_name=['classes'], kwargs={'b': False, 'distribution_type':'Positive'}):
filtered = xr.apply_ufunc(binning_function,
ds,
dask=dask,
vectorize=True,
input_core_dims=[dim],
#exclude_dims = [dim],
output_core_dims=[new_dim_name],
kwargs=kwargs,
output_dtypes=[float],
join='outer',
dataset_fill_value=np.nan,
).compute()
return filtered
with ProgressBar():
da_binned = xarray_parse_extremes(ds['Tair'] ,
['time'],
dask='allowed')
da_binned.name = 'classes'
ds_binned = da_binned.to_dataset()
ds['classes'] = (('y', 'x', 'time'), ds_binned['classes'].values)
mask = (ds['classes'] >= 5) & (ds['classes'] != 0)
ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair'].isel({'time':-1}).plot()
print(ds)
(ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair']
.to_dataframe().dropna().sort_values('Tair', ascending=False)
)
delayed_to_netcdf = ds.to_netcdf(r'F:\Philipe\temp\teste_tutorial.nc',
engine='netcdf4',
compute =False)
print('saving data classified')
with ProgressBar():
delayed_to_netcdf.compute()

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

I am plotting 2D plot for SVC Bernoulli output.
converted to vectors from Avg word2vec and standerdised data
split data to train and test.
Through grid search found the best C and gamma(rbf)
clf = SVC(C=100,gamma=0.0001)
clf.fit(X_train1,y_train)
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
Receive error :-
ValueError: y must be a NumPy array. Found
also tried to convert the y to numpy. Then it prompts error
ValueError: y must be an integer array. Found object. Try passing the array as y.astype(np.integer)
finally i converted it to integer array.
Now it is prompting of error.
ValueError: Filler values must be provided when X has more than 2 training features.
You can use PCA to reduce your data multi-dimensional data to two dimensional data. Then pass the obtained result in plot_decision_region and there will be no need of filler values.
from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions
clf = SVC(C=100,gamma=0.0001)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(X_train)
clf.fit(X_train2, y_train)
plot_decision_regions(X_train2, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
I've spent some time with this too as plot_decision_regions was then complaining ValueError: Column(s) [2] need to be accounted for in either feature_index or filler_feature_values and there's one more parameter needed to avoid this.
So, say, you have 4 features and they come unnamed:
X_train_std.shape[1] = 4
We can refer to each feature by their index 0, 1, 2, 3. You only can plot 2 features at a time, say you want 0 and 2.
You'll need to specify one additional parameter (to those specified in #sos.cott's answer), feature_index, and fill the rest with fillers:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
feature_index=[0,2], #these one will be plotted
filler_feature_values={1: value, 3:value}, #these will be ignored
filler_feature_ranges={1: width, 3: width})
You can just do (Assuming X_train and y_train are still panda dataframes) for the numpy array problem.
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
For the filler_feature issue, you have to specify the number of features so you do the following:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
filler_feature_values={2: value, 3:value, 4:value},
filler_feature_ranges={2: width, 3: width, 4:width},
legend=2, ax=ax)
You need to add one filler feature for each feature you have.

Python Shape function for k-means clustering

I have one geotiff grey scale image which gave me the (4377, 6172) 2D array. In the first part, I am considering (:1024, :1024) values(Total values are -> 1024 * 1024 = 1048576) for my compression algorithm. Through this algorithm, I am getting total 4 values in finalmatrix list var through the algorithm. After this, I am applying K-means algorithm on that values. A program is below :
import numpy as np
from osgeo import gdal
from sklearn import cluster
import matplotlib.pyplot as plt
dataset =gdal.Open("1.tif")
band = dataset.GetRasterBand(1)
img = band.ReadAsArray()
finalmat = [255, 0, 2, 2]
#Converting list to array for dimensional change
ay = np.asarray(finalmat).reshape(-1,1)
fig = plt.figure()
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(ay)
cluster_means = k_means.cluster_centers_.squeeze()
a_clustered = k_means.labels_
print('# of observation :',ay.shape)
print('Cluster Means : ', cluster_means)
a_clustered.shape= img.shape
fig=plt.figure(figsize=(125,125))
ax = plt.subplot(2,4,8)
plt.axis('off')
xlabel = str(1) , ' clusters'
ax.set_title(xlabel)
plt.imshow(a_clustered)
plt.show()
fig.savefig('kmeans-1 clust ndvi08jan2010_guj 12 .png')
In the above Program I am getting error in the line a_clustered.shape= img.shape. The error which I am getting is below:
Error line:
a_clustered.shape= img.shape
ValueError: cannot reshape array of size 4 into shape (4377,6172)
<matplotlib.figure.Figure at 0x7fb7c63975c0>
Actually, I want to visualize the clustering on Original image through compressed value which I am getting. Can you please give suggestion what to do
It does not make a lot of sense to use KMeans on 1 dimensional data.
And it makes even less sense to use it on a 4 x 1 array!
Your site then comes from the fact that you can't just resize a 4 x 1 integer array into a large picture.
Just print the array a_clustered you are trying to plot. It probably contains [0, 1, 1, 1].

Resources