Why does the kernel restart when I try sklearn PCA? - scikit-learn

I use Ipython Notebook and when I input the code:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
I receive a notice that the kernel has died and has restarted. What is going on?
Also my data is in this format:
array([[ 0.00000000e+00, 3.13000000e+02, 3.10000000e+02, ...,
9.00000000e+00, 6.00000000e+00, 2.00000000e+01],
[ 3.00000000e+00, 2.06900000e+03, 2.06700000e+03, ...,
1.90000000e+01, 7.00000000e+00, 3.20000000e+01],
[ 4.00000000e+00, 2.54200000e+03, 2.54000000e+03, ...,
1.10000000e+01, 1.10000000e+01, 1.10000000e+01],
EDIT:
The data itself is not that large (~3 MB). If it helps, I am using ipython notebook.
I tried a simple 3x3 test matrix as input and same problem, so it's probably not something with the data size either:
data = np.array([[1,2,3],[1,4,6],[2,8,11]])
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
I tried the sklearn's pca in the terminal with python as well:
>>> from sklearn.decomposition import PCA
>>> pca = PCA()
>>> import numpy as np
>>> X = np.array([[1,2,3],[1,5,7],[2,6,10]])
>>> y = np.array[1,2,3]
>>> y = np.array([1,2,3])
>>> pca.fit(X, y)
And got:
Illegal instruction (core dumped)

It seems that sklearn will not run nicely on a 32 bit machine so when I ran this later on a 64 bit server it worked!!!!!

Related

clustering for a single timeseries

I have a single array numpy array(x) and i want to cluster it in unsupervised way using DBSCAN and hierarchial clustering using scikitlearn. Is the clustering possible for single array data? Additionally i need to plot the clusters and its corresponding representation on the input data.
I tried
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy import stats
import scipy.cluster.hierarchy as hac
#my data
x = np.linspace(0, 500, 10000)
x = 1.5 * np.sin(x)
#dbscan
clustering = DBSCAN(eps=3).fit(x)
# here i am facing problem
# hierarchial
Yes, DBSCAN can cluster "1-D" arrays. See time series below, although I don't know the significance of clustering just the waveform.
For example,
import numpy as np
rng =np.random.default_rng(42)
x=rng.normal(loc=[-10,0,0,0,10], size=(200,5)).reshape(-1,1)
rng.shuffle(x)
print(x[:10])
# [[-10.54349551]
# [ -0.32626201]
# [ 0.22359555]
# [ -0.05841124]
# [ -0.11761086]
# [ -1.0824272 ]
# [ 0.43476607]
# [ 11.40382139]
# [ 0.70166365]
# [ 9.79889535]]
from sklearn.cluster import DBSCAN
dbs=DBSCAN()
clusters = dbs.fit_predict(x)
import matplotlib.pyplot as plt
plt.scatter(x,np.zeros(len(x)), c=clusters)
You can use AgglomerativeClustering for hierarchical clustering.
Here's an example using the data from above.
from sklearn.cluster import AgglomerativeClustering
aggC = AgglomerativeClustering(n_clusters=None, distance_threshold=1.0, linkage="single")
clusters = aggC.fit_predict(x)
plt.scatter(x,np.zeros(len(x)), c=clusters)
Time Series / Waveform (no other features)
You can do it, but with no features other than time and signal amplitude, I don't know if this has any meaning.
import numpy as np
from scipy import signal
y = np.hstack((np.zeros(100), signal.square(2*np.pi*np.linspace(0,2,200, endpoint=False)), np.zeros(100), signal.sawtooth(2*np.pi*np.linspace(0,2,200, endpoint=False)+np.pi/2,width=0.5), np.zeros(100), np.sin(2*np.pi*np.linspace(0,2,200,endpoint=False)), np.zeros(100)))
import datetime
start = datetime.datetime.fromisoformat("2022-12-01T12:00:00.000000")
times = np.array([(start+datetime.timedelta(microseconds=_)).timestamp() for _ in range(1000)])
my_sig = np.hstack((times.reshape(-1,1),y.reshape(-1,1)))
print(my_sig[:5,:])
# [[1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]
# [1.6698924e+09 0.0000000e+00]]
from sklearn.cluster import AgglomerativeClustering
aggC = AgglomerativeClustering(n_clusters=None, distance_threshold=4.0)
clusters = aggC.fit_predict(my_sig)
import matplotlib.pyplot as plt
plt.scatter(my_sig[:,0], my_sig[:,1], c=clusters)

kernel dies when computing DBSCAN in scikit-learn after dimensionality reduction

I have some data after using ColumnTransformer() like
>>> X_trans
<197431x6040 sparse matrix of type '<class 'numpy.float64'>'
with 3553758 stored elements in Compressed Sparse Row format>
I transform the data using TruncatedSVD() which seems to work like
from sklearn.decomposition import TruncatedSVD
>>> svd = TruncatedSVD(n_components=3, random_state=0)
>>> X_trans_svd = svd.fit_transform(X_trans)
>>> X_trans_svd
array([[ 1.72326526, 1.85499833, -1.41848742],
[ 1.67802434, 1.81705149, -1.25959756],
[ 1.70251936, 1.82621935, -1.33124505],
...,
[ 1.5607798 , 0.07638707, -1.11972714],
[ 1.56077981, 0.07638652, -1.11972728],
[ 1.91659627, -0.12081577, -0.84551125]])
Now I want to apply the transformed data to DBSCAN like
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(X_trans_svd)
but my kernel crashes.
I also tried converting it back to a df and apply it to DBSCAN
>>> d = {'1st_component': X_trans_svd[:, 0],
'2nd_component': X_trans_svd[:, 1],
'3rd_component': X_trans_svd[:, 2]}
>>> df = pd.DataFrame(data=d)
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(df)
But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint.
EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000] and starts crashing at X_trans_svd[0:180000]. Furthermore the size of the array is
>>> X_trans_svd.nbytes
4738344
EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.cluster import DBSCAN
s = np.loadtxt('data.txt', dtype='float')
elapsed = datetime.now()
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(s)
elapsed = datetime.now() - elapsed
print(elapsed)

program is not working "TypeError: fit() missing 1 required positional argument: 'y'"

from sklearn import tree
from sklearn.datasets import load_iris
iris=load_iris()
dir(iris)
#output data to traixn setosa,versicolor and virginica
x=iris.data
#fetching data
x=np.delete(x, np.s_[::50], 0)
#print(x)
y=iris.target
#featching output
y=np.delete(y, np.s_[::50], 0)
algo=tree.DecisionTreeClassifier
when i try to use fit it does not support
train=algo.fit(x,y)
res=train.pridict([test_setosa])
print(res)
You need to change something in your code. The DecisionTreeClassifier is a class and the way your call it in your code is wrong.
Replace
algo=tree.DecisionTreeClassifier
with
algo=tree.DecisionTreeClassifier()
Full code
from sklearn import tree
from sklearn.datasets import load_iris
import numpy as np
iris=load_iris()
dir(iris)
#output data to traixn setosa,versicolor and virginica
x=iris.data
#fetching data
x=np.delete(x, np.s_[::50], 0)
#print(x)
y=iris.target
#featching output
y=np.delete(y, np.s_[::50], 0)
algo=tree.DecisionTreeClassifier()
train=algo.fit(x,y)
res=train.predict([test_setosa])

Get random gamma distribution in tensorflow like numpy.random.gamma

Hi I am new to tensorflow and I am trying to generate random gamma distribution in tensorflow just like numpy.random.gamma
My numpy code is :-
self._lambda = 1 * np.random.gamma(100., 1. / 100, (self.n_topic, self.n_voca))
where n_topic=240 and n_voca=198
My tensorflow code is :-
self._tf_lambda = tf.random_gamma((self.n_topic, self.n_voca),1, dtype=tf.float32, seed=0, name='_tf_lambda')
Is it a correct implementation? I believe I failed to understand the parameters of tf.random_gamma became self._lambda <> self.tf_lambda.
You are setting different shape parameters in your distribution, so it is expected that they differ.
One thing to watch out for is that numpy has a "scale" parameter while TF has an "inverse scale" parameter. So one has to be inverted to get the same distribution.
Jupyter notebook example with matching distributions:
%matplotlib inline
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
size = (50000,)
shape_parameter = 1.5
scale_parameter = 0.5
bins = np.linspace(-1, 5, 30)
np_res = np.random.gamma(shape=shape_parameter, scale=scale_parameter, size=size)
# Note the 1/scale_parameter here
tf_op = tf.random_gamma(shape=size, alpha=shape_parameter, beta=1/scale_parameter)
with tf.Session() as sess:
tf_res = sess.run(tf_op)
plt.hist(tf_res, bins=bins, alpha=0.5);
plt.hist(np_res, bins=bins, alpha=0.5);

Sklearn kmeans equivalent of elbow method

Let's say I'm examining up to 10 clusters, with scipy I usually generate the 'elbow' plot as follows:
from scipy import cluster
cluster_array = [cluster.vq.kmeans(my_matrix, i) for i in range(1,10)]
pyplot.plot([var for (cent,var) in cluster_array])
pyplot.show()
I have since became motivated to use sklearn for clustering, however I'm not sure how to create the array needed to plot as in the scipy case. My best guess was:
from sklearn.cluster import KMeans
km = [KMeans(n_clusters=i) for i range(1,10)]
cluster_array = [km[i].fit(my_matrix)]
That unfortunately resulted in an invalid command error. What is the best way sklearn way to go about this?
Thank you
you can use the inertia attribute of Kmeans class.
Assuming X is your dataset:
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
X = # <your_data>
distorsions = []
for k in range(2, 20):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
distorsions.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
You had some syntax problems in the code. They should be fixed now:
Ks = range(1, 10)
km = [KMeans(n_clusters=i) for i in Ks]
score = [km[i].fit(my_matrix).score(my_matrix) for i in range(len(km))]
The fit method just returns a self object. In this line in the original code
cluster_array = [km[i].fit(my_matrix)]
the cluster_array would end up having the same contents as km.
You can use the score method to get the estimate for how well the clustering fits. To see the score for each cluster simply run plot(Ks, score).
You can also use euclidean distance between the each data with the cluster center distance to evaluate how many clusters to choose. Here is the code example.
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
x = iris.data
res = list()
n_cluster = range(2,20)
for n in n_cluster:
kmeans = KMeans(n_clusters=n)
kmeans.fit(x)
res.append(np.average(np.min(cdist(x, kmeans.cluster_centers_, 'euclidean'), axis=1)))
plt.plot(n_cluster, res)
plt.title('elbow curve')
plt.show()

Resources