I have a single array numpy array(x) and i want to cluster it in unsupervised way using DBSCAN and hierarchial clustering using scikitlearn. Is the clustering possible for single array data? Additionally i need to plot the clusters and its corresponding representation on the input data.
I tried
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy import stats
import scipy.cluster.hierarchy as hac
#my data
x = np.linspace(0, 500, 10000)
x = 1.5 * np.sin(x)
#dbscan
clustering = DBSCAN(eps=3).fit(x)
# here i am facing problem
# hierarchial
Yes, DBSCAN can cluster "1-D" arrays. See time series below, although I don't know the significance of clustering just the waveform.
For example,
import numpy as np
rng =np.random.default_rng(42)
x=rng.normal(loc=[-10,0,0,0,10], size=(200,5)).reshape(-1,1)
rng.shuffle(x)
print(x[:10])
# [[-10.54349551]
# [ -0.32626201]
# [ 0.22359555]
# [ -0.05841124]
# [ -0.11761086]
# [ -1.0824272 ]
# [ 0.43476607]
# [ 11.40382139]
# [ 0.70166365]
# [ 9.79889535]]
from sklearn.cluster import DBSCAN
dbs=DBSCAN()
clusters = dbs.fit_predict(x)
import matplotlib.pyplot as plt
plt.scatter(x,np.zeros(len(x)), c=clusters)
You can use AgglomerativeClustering for hierarchical clustering.
Here's an example using the data from above.
from sklearn.cluster import AgglomerativeClustering
aggC = AgglomerativeClustering(n_clusters=None, distance_threshold=1.0, linkage="single")
clusters = aggC.fit_predict(x)
plt.scatter(x,np.zeros(len(x)), c=clusters)
Time Series / Waveform (no other features)
You can do it, but with no features other than time and signal amplitude, I don't know if this has any meaning.
import numpy as np
from scipy import signal
y = np.hstack((np.zeros(100), signal.square(2*np.pi*np.linspace(0,2,200, endpoint=False)), np.zeros(100), signal.sawtooth(2*np.pi*np.linspace(0,2,200, endpoint=False)+np.pi/2,width=0.5), np.zeros(100), np.sin(2*np.pi*np.linspace(0,2,200,endpoint=False)), np.zeros(100)))
import datetime
start = datetime.datetime.fromisoformat("2022-12-01T12:00:00.000000")
times = np.array([(start+datetime.timedelta(microseconds=_)).timestamp() for _ in range(1000)])
my_sig = np.hstack((times.reshape(-1,1),y.reshape(-1,1)))
print(my_sig[:5,:])
# [[1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]]
from sklearn.cluster import AgglomerativeClustering
aggC = AgglomerativeClustering(n_clusters=None, distance_threshold=4.0)
clusters = aggC.fit_predict(my_sig)
import matplotlib.pyplot as plt
plt.scatter(my_sig[:,0], my_sig[:,1], c=clusters)
Related
I have working code that is utilizing dbscan to find tight groups of sparse spatial data imported with pd.read_csv.
I am maintaining the original spatial data locations and would like to annotate the labels returned by dbscan for each data point to the original dataframe and then write a csv with the same information.
So the code below is doing exactly what I would expect it to at this point, I would just like to extend it to import the label for each row in the original dataframe.
import argparse
import string
import os, subprocess
import pathlib
import glob
import gzip
import re
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from sklearn.cluster import DBSCAN
X = pd.read_csv(tmp_csv_name)
X = X.drop('Name', axis = 1)
X = X.drop('Type', axis = 1)
X = X.drop('SomeValue', axis = 1)
# only columns 'x' and 'y' now remain
db=DBSCAN(eps=EPS, min_samples=minSamples, metric='euclidean', algorithm='auto', leaf_size=30).fit(X)
labels = def_inst_dbsc.labels_
unique_labels = set(labels)
# maxX , maxY are manual inputs temporarily
while sizeX > 16 or sizeY > 16 :
sizeX=sizeX*0.8 ; sizeY=sizeY*0.8
fig, ax = plt.subplots(figsize=(sizeX,sizeY))
plt.xlim(0,maxX)
plt.ylim(0,maxY)
plt.scatter(X['x'], X['y'], c=colors, marker="o", picker=True)
# hackX , hackY are manual inputs temporarily
# which represent the boundaries defined in the original dataset
poly = patches.Polygon(xy=list(zip(hackX,hackY)), fill=False)
ax.add_patch(poly)
plt.show()
I drop all columns but the two I am interested in. When I try to convert my dataframe to a 2d numpy array from the two columns it turns into an object type that contains strings. I believe this is because the Data_Values has values such as "23.6." Is there anyway I can get rid of the decimal point and trailing numbers in this data as they are all different values.
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.metrics import pairwise_distances_argmin
data = pd.read_csv('Alzheimer_s_Disease_and_Healthy_Aging_Data.csv', engine='python', header=None)
data.columns = ['RowId', 'YearStart', 'YearEnd', 'LocationAbbr', 'LocationDesc', 'Datasource', 'Class', 'Topic', 'Question', 'Response',
'Data_Value_Unit', 'DataValueTypeID', 'Data_Value_Type', 'Data_Value', 'Data_Value_Alt', 'Data_Value_Footnote_Symbol',
'Data_Value_Footnote', 'Low_Confidence_Limit', 'High_Confidence_Limit', 'Sample_Size', 'StratificationCategory1',
'Stratification1', 'StratificationCategory2', 'Stratification2', 'StratificationCategory3', 'Stratification3', 'Geolocation',
'ClassID', 'TopicID', 'QuestionID', 'ResponseID', 'LocationID', 'StratificationCategoryID1', 'StratificationID1',
'StratificationCategoryID2', 'StratificationID2', 'StratificationCategoryID3', 'StratificationID3', 'Report']
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 10)
data1 = data.iloc[1:]
df = data1[data1["Data_Value_Type"].str.contains("Mean") == False]
df = data1[data1["Data_Value"].str.contains("NaN") == False]
df.dropna()
df = df.drop(columns=['RowId', 'YearStart', 'YearEnd', 'LocationAbbr', 'LocationDesc', 'Datasource', 'Class', 'Topic', 'Question', 'Response',
'Data_Value_Unit', 'DataValueTypeID', 'Data_Value_Type', 'Data_Value_Alt', 'Data_Value_Footnote_Symbol',
'Data_Value_Footnote', 'Low_Confidence_Limit', 'High_Confidence_Limit', 'Sample_Size', 'StratificationCategory1',
'Stratification1', 'StratificationCategory2', 'Stratification2', 'StratificationCategory3', 'Stratification3', 'Geolocation',
'ClassID', 'TopicID', 'QuestionID', 'ResponseID', 'StratificationCategoryID1', 'StratificationID1',
'StratificationCategoryID2', 'StratificationID2', 'StratificationCategoryID3', 'StratificationID3', 'Report'])
x = df.to_numpy()
print(x.dtype)
File "D:\Users\Watson Rockstar\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError:
Found input variables with inconsistent numbers of samples: [2883, 1236]
This dataset totally has 4119 data, and the Xtrain volum= (2883,18), Xtest volum = (1236,18)
I have tried to use LabelEncoder and OneHotEncoder to sovle the problems, but it is not helpful:
# Ignore the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
import missingno as msno
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
#import the necessary modelling algos.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#preprocessing
from sklearn.preprocessing import LabelEncoder
telebanking = pd.read_csv('bank-additional.csv')
telebank = telebanking.drop(['duration','default'],axis =1)
def transform(feature):
le = LabelEncoder()
telebank[feature] = le.fit_transform(telebank[feature])
print(le.classes_)
cat_telebank=telebank.select_dtypes(include='object')
cat_telebank.columns
for col in cat_telebank.columns:
transform(col)
scaler=StandardScaler()
scaled_telebank=scaler.fit_transform(telebank.drop('y',axis=1))
X=scaled_telebank
Y=telebank['y'].as_matrix()
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3)
def compare(model):
clf = model
clf.fit(Xtrain,Ytrain)
pred = clf.predict(Xtrain)
acc.append(accuracy_score(pred,Ytest))
prec.append(precision_score(pred,Ytest))
rec.append(recall_score(pred,Ytest))
auroc.append(roc_auc_score(pred,Ytest))
acc=[]
prec=[]
rec=[]
auroc=[]
models=[RandomForestClassifier(),DecisionTreeClassifier()]
model_names=['RandomForestClassifier','DecisionTreeClassifier']
for model in range(len(models)):
compare(models[model])
d={'Modelling Algo':model_names,'Accuracy':acc,'Precision':prec,'Recall':rec,'Area Under ROC Curve':auroc}
met_telebank=pd.DataFrame(d)
met_telebank
It is the first warning's detail.
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3)
should be
Xtrain,Ytrain,Xtest,Ytest = train_test_split(X,Y,test_size=0.3)
This is causing the error, because it wants to use Xtest as the Ytrain values.
I want to get the distribution of each features in cancer dataset using ggplot but its giving me error.
#pip install plotnine
from plotnine import ggplot
from plotnine import *
from sklearn.datasets import load_breast_cancer
for i in cancer.feature_names:
ggplot(cancer.data)+aes(x=i)+geom_bar(size=10)
This is the error message i got
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I would recommand to use seaborn for that. Here is an example of plotting the distribution of each in feature in cancer dataset by target:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
# loading data
cancer = load_breast_cancer()
data = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
df = data.melt(['target'], var_name='cols', value_name='vals')
g = sns.FacetGrid(df, col='cols', hue="target", palette="Set1", col_wrap=4)
g = (g.map(sns.distplot, "vals", hist=True, ))
from plotnine import ggplot
from plotnine import *
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
import pandas as pd
import matplotlib.pyplot as plt
data=pd.DataFrame(cancer.data,columns=cancer.feature_names)
for i in data.columns:
print(ggplot(data)+aes(x=i)+geom_density(size=1))
print(ggplot(data)+aes(x=i)+geom_bar(size=10))
In sklearn 0.17.1 there was-->> grid_scores_ : list of named tuples (https://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV)
Now in sklearn 0.21.2 it is replaced with-->> cv_results_ : dict of numpy (masked) ndarrays (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
Previously with sklearn 0.17.1, I was able to plot all grid parameters on a single plot using grid_scores_ but now I am unable to aggregate the values obtained from cv_results_ as there is no "mean_validation_score" in newer version.
I have an existing code which plotted all the parameters score in sklearn 0.17.1 (https://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV) where grid_scores_ was used and it perfectly plotted all the values on one plot.
In newer version of slearn cv_results_ has been replaced with grid_scores_. I have tried to append all the values in want to plot all the parameters on one plot, currently I am unable to add the correct values to plot on the graph.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics.ranking import precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.metrics import accuracy_score
import sklearn
import itertools
from pandas.tools.plotting import scatter_matrix
import os
import datetime as dt
from operator import itemgetter
from itertools import chain
import graphviz
from sklearn.metrics import precision_recall_fscore_support
import scikitplot as skplt
X_train = np.random.randint(0,1, size=[500,5000])
y_train = np.random.randint(0,1, size=500)
print(X_train.shape, y_train.shape)
# (500, 5000) (500,)
#grid_search = GridSearchCV(clf, param_grid, cv=3) # 10 fold cross validation
### hyperparameter estimator
param_grid = {"criterion": ["gini", "entropy"],
"splitter": ["best", "random"],
"max_depth": np.arange(1,9,7),
"min_samples_split": np.arange(2,150,90),
"min_samples_leaf": np.arange(1,60,45),
"min_weight_fraction_leaf": np.arange(0.1,0.4, 0.3),
"max_features": [1000, 500, 5000],
"max_leaf_nodes": np.arange(2,60,45),
"min_impurity_decrease": [0.0, 0.5],
}
def evaluate_param(parameter, param_range, index):
grid_search = GridSearchCV(clf, param_grid = {parameter: param_range}, cv=3) # 3 fold cross validation
grid_search.fit(X_train, y_train) ### grid_search.fit(X_train[features], y_train)
df = {}
#for i, score in enumerate(grid_search.grid_scores_): # previously used methods
for i, score in enumerate(grid_search.cv_results_["params"]):
## How do we save the correct values here for plotting
df[parameter] = grid_search.cv_results_["params"][i][parameter]
#df[parameter].update(grid_search.cv_results_["params"][i][parameter])
#print("df : ", df)
#df[parameter].append(grid_search.cv_results_["params"][i][parameter])
#print("df : ", df) # the values are not appended to the keys
df = pd.DataFrame.from_dict(df, orient='index')
df.reset_index(level=0, inplace=True)
df = df.sort_values(by='index')
plt.subplot(5,2,index) # Change here according to the number of parameters
plt.xlabel(parameter, color = "red")
plt.ylabel("GridSearchCV Score", color= "blue")
plot = plt.plot(df['index'], df[0])
plt.title(parameter.capitalize(), color = "red")
plt.savefig('DT_GridSearchCV_Score_Hyperparameter.png')
return plot, df
clf = tree.DecisionTreeClassifier(random_state=99) # verbose=True, n_jobs=-1 :: Dt does not support it
### hyperparameter estimator
index = 1
plt.figure(figsize=(30,30))
for parameter, param_range in dict.items(param_grid):
evaluate_param(parameter, param_range, index) ## 120 features
index += 1
This image is not filled as there is no "mean_validation_score" which can be filled for each subplot now:
https://ibb.co/Z6jwnMr
## Keys() gives the list of keys that gridsearchcv has:
grid_search.cv_results_.keys()
# output
# dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_criterion', 'param_max_depth', 'param_max_features', 'param_max_leaf_nodes', 'param_min_impurity_decrease', 'param_min_samples_leaf', 'param_min_samples_split', 'param_min_weight_fraction_leaf', 'param_splitter', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'mean_train_score', 'std_train_score'])
grid_search.best_estimator_
# output
# DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
#max_features=1000, max_leaf_nodes=2, min_impurity_decrease=0.0,
#min_impurity_split=None, min_samples_leaf=1,
#min_samples_split=2, min_weight_fraction_leaf=0.1,
#presort=False, random_state=99, splitter='best')
Expected Result (should be filled): https://ibb.co/Z6jwnMr
However each subplot on the plot should have a curve depicting best value for the parameter. The keys do not have a "mean_validation_score" to plot the actual test score which was there in sklearn 0.17.1 but not in sklearn 0.20.2
Kindly let me know if there is still a way to plot all test scores on subplots of a single plot. Thanks in advance!!