kernel dies when computing DBSCAN in scikit-learn after dimensionality reduction - python-3.x

I have some data after using ColumnTransformer() like
>>> X_trans
<197431x6040 sparse matrix of type '<class 'numpy.float64'>'
with 3553758 stored elements in Compressed Sparse Row format>
I transform the data using TruncatedSVD() which seems to work like
from sklearn.decomposition import TruncatedSVD
>>> svd = TruncatedSVD(n_components=3, random_state=0)
>>> X_trans_svd = svd.fit_transform(X_trans)
>>> X_trans_svd
array([[ 1.72326526, 1.85499833, -1.41848742],
[ 1.67802434, 1.81705149, -1.25959756],
[ 1.70251936, 1.82621935, -1.33124505],
...,
[ 1.5607798 , 0.07638707, -1.11972714],
[ 1.56077981, 0.07638652, -1.11972728],
[ 1.91659627, -0.12081577, -0.84551125]])
Now I want to apply the transformed data to DBSCAN like
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(X_trans_svd)
but my kernel crashes.
I also tried converting it back to a df and apply it to DBSCAN
>>> d = {'1st_component': X_trans_svd[:, 0],
'2nd_component': X_trans_svd[:, 1],
'3rd_component': X_trans_svd[:, 2]}
>>> df = pd.DataFrame(data=d)
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(df)
But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint.
EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000] and starts crashing at X_trans_svd[0:180000]. Furthermore the size of the array is
>>> X_trans_svd.nbytes
4738344
EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.cluster import DBSCAN
s = np.loadtxt('data.txt', dtype='float')
elapsed = datetime.now()
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(s)
elapsed = datetime.now() - elapsed
print(elapsed)

Related

Data Cleaning Error in Classification KNN Alrogithm Problem

I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis.
The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.
Code:
import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r"C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx")
#print(dataset.columns)
#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)
# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])
labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])
labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])
labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])
#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)
y_pred = knn.predict(X)
print(y_pred)
Error Message:
ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
You have to fit and transform the data with the SimpleImputer you created. From the documentation:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Here the imputer is created
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Here the imputer is fitted, i.e. learns the mean
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X)) # Here the imputer is applied, i.e. filling the mean
The crucial parts here are imp_mean.fit() and imp_mean.transform(X)
Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:
This transformer should be used to encode target values, i.e. y, and not the input X.
For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?
You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)
# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

While creating dummy variables getting memory error

I am working on a project and While getting the dummies value i was getting memory exception
I have tried using .astype(np.int8) and i have also tried writing exception handling code by importing psutil
i am using below code
dummy_cols = ['emp_title','grade','home_ownership','verification_status','addr_state','pub_rec','application_type']
df_dummies = pd.get_dummies(df[dummy_cols], drop_first = True)
It's not working and throwing an error
pandas.get_dummies creates a dense representation of dummy variables, which may request lots of memory depending on the number of levels in the categorical features.
I would prefer scikit-learn.preprocessing.OneHotEncoder that outputs sparse matrices:
The code would look like this :
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a fake dataframe
df = pd.DataFrame(
{
"df1": np.random.choice(["a", "b"], 100),
"df2": np.random.choice(["c", "d"], 100)
}
)
dummy_cols = ["df1", "df2"]
# LabelEncode categoricals
for f in dummy_cols:
df[f] = LabelEncoder().fit_transform(df[f])
# Transform to dummies in sparse representation (csr_matrix)
df_dummies = OneHotEncoder().fit_transform(df[dummy_cols])

How to extract rows and columns from a 3D array in Tensorflow

I wanted to do the following indexing operation on a TensorFlow tensor.
What should be the equivalent operations in TensorFlow to get b and c as output? Although tf.gather_nd documentation has several examples but I could not generate equivalent indices tensor to get these results.
import tensorflow as tf
import numpy as np
a=np.arange(18).reshape((2,3,3))
idx=[2,0,1] #it can be any validing re-ordering index list
#These are the two numpy operations that I want to do in Tensorflow
b=a[:,idx,:]
c=a[:,:,idx]
# TensorFlow operations
aT=tf.constant(a)
idxT=tf.constant(idx)
# what should be these two indices
idx1T=tf.reshape(idxT, (3,1))
idx2T=tf.reshape(idxT, (1,1,3))
bT=tf.gather_nd(aT, idx1T ) #does not work
cT=tf.gather_nd(aT, idx2T) #does not work
with tf.Session() as sess:
b1,c1=sess.run([bT,cT])
print(np.allclose(b,b1))
print(np.allclose(c,c1))
I am not restricted to tf.gather_nd Any other suggestion to achieve the same operations on GPU will be helpful.
Edit: I have updated the question for a typo:
old statement: c=a[:,idx],
New statement: c=a[:,:,idx]
What I wanted to achieve was re-ordering of columns as well.
That can be done with tf.gather, using the axis parameter:
import tensorflow as tf
import numpy as np
a = np.arange(18).reshape((2,3,3))
idx = [2,0,1]
b = a[:, idx, :]
c = a[:, :, idx]
aT = tf.constant(a)
idxT = tf.constant(idx)
bT = tf.gather(aT, idxT, axis=1)
cT = tf.gather(aT, idxT, axis=2)
with tf.Session() as sess:
b1, c1=sess.run([bT, cT])
print(np.allclose(b, b1))
print(np.allclose(c, c1))
Output:
True
True

How do I map df column values to hex color in one go?

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.
This is what I have tried so far. Part of the toy code is taken from here.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df
Which outputs:
How do I convert 'some_value' df column values to hex in one go?
Ideally using the sns.cubehelix_palette(light=1)
I am not opposed to using something other than matplotlib
Thanks in advance.
You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df
Efficiency
The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def create_df(n=10):
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)),
columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
return df
The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.
def apply1(df):
# map values to colors in hex via
# matplotlib to_hex by pandas apply
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
return df
That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.
def apply2(df):
# map values to colors in hex via
# matplotlib to_hex by assigning numpy array as column
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
a = mapper.to_rgba(df['some_value'])
df['some_value_color'] = np.apply_along_axis(mcolors.to_hex, 1, a)
return df
Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.
def apply3(df):
# map values to colors in hex via
# creating a hex Look up table table and apply the normalized data to it
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
lut = plt.cm.viridis(np.linspace(0,1,256))
lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
a = (norm(df['some_value'].values)*255).astype(np.int16)
df['some_value_color'] = lut[a]
return df
Compare the timings
Let's take a dataframe with 10000 rows.
df = create_df(10000)
Original solution (apply1)
%timeit apply1(df)
2.66 s per loop
Array solution (apply2)
%timeit apply2(df)
240 ms per loop
LUT solution (apply3)
%timeit apply1(df)
7.64 ms per loop
In this case the LUT solution gives almost a factor 400 of improvement.

Why does the kernel restart when I try sklearn PCA?

I use Ipython Notebook and when I input the code:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
I receive a notice that the kernel has died and has restarted. What is going on?
Also my data is in this format:
array([[ 0.00000000e+00, 3.13000000e+02, 3.10000000e+02, ...,
9.00000000e+00, 6.00000000e+00, 2.00000000e+01],
[ 3.00000000e+00, 2.06900000e+03, 2.06700000e+03, ...,
1.90000000e+01, 7.00000000e+00, 3.20000000e+01],
[ 4.00000000e+00, 2.54200000e+03, 2.54000000e+03, ...,
1.10000000e+01, 1.10000000e+01, 1.10000000e+01],
EDIT:
The data itself is not that large (~3 MB). If it helps, I am using ipython notebook.
I tried a simple 3x3 test matrix as input and same problem, so it's probably not something with the data size either:
data = np.array([[1,2,3],[1,4,6],[2,8,11]])
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
I tried the sklearn's pca in the terminal with python as well:
>>> from sklearn.decomposition import PCA
>>> pca = PCA()
>>> import numpy as np
>>> X = np.array([[1,2,3],[1,5,7],[2,6,10]])
>>> y = np.array[1,2,3]
>>> y = np.array([1,2,3])
>>> pca.fit(X, y)
And got:
Illegal instruction (core dumped)
It seems that sklearn will not run nicely on a 32 bit machine so when I ran this later on a 64 bit server it worked!!!!!

Resources