I try imputing in sklearn but I have an error - scikit-learn

I try below code but I have some error.
imp=SimpleImputer(missing_values='NaN',strategy="mean")
col = veriler.iloc[:,1:4].values
type(col) ##numpy.ndarray
imp=imp.fit(col)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

You need to convert the infinity values to a bounded value to apply imputation. np.nan_to_num clips nan, inf and -inf to workable values.
For example:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X = [[7, np.inf, 3], [4, np.nan, 6], [10, 5, 9]]
X = np.nan_to_num(X, nan=-9999, posinf=33333333, neginf=-33333333)
imp_mean.fit(X)
>>> SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=nan, strategy='mean', verbose=0)
For transform also, this can be applied:
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9], [np.nan, np.inf, -np.inf]]
X = np.nan_to_num(X, nan=-9999, posinf=33333333, neginf=-33333333)
print(imp_mean.transform(X))
>>>
[[-9.9990000e+03 2.0000000e+00 3.0000000e+00]
[ 4.0000000e+00 -9.9990000e+03 6.0000000e+00]
[ 1.0000000e+01 -9.9990000e+03 9.0000000e+00]
[-9.9990000e+03 3.3333333e+07 -3.3333333e+07]]

Related

Extract lower off diagonal elements from numpy array

I have below array
import numpy as np
a = np.array([[7412, 33, 2],
[2, 7304, 83],
[3, 101, 7237]])
I would like to extract only lower off-diagonal elements from above array and put them in a vector.
I tried with np.extract(~a, a), but is extracting all elements.
Desired output will be [2, 3, 101] for above example.
Any insight would be helpful
You can use np.tril_indices or np.tri:
import numpy as np
a = np.array([[7412, 33, 2],
[2, 7304, 83],
[3, 101, 7237]])
n, m = a.shape
# Option 1
out = a[ np.tril_indices(n=n, k=-1, m=m) ]
# Option 2 (should have equivalent output)
out = a[ np.tri(N=n, M=m, k=-1, dtype=bool) ]
out:
array([ 2, 3, 101])

Print multiple columns from a matrix

I have a list of column vectors and I want to print only those column vectors from a matrix.
Note: the list can be of random length, and the indices can also be random.
For instance, the following does what I want:
import numpy as np
column_list = [2,3]
a = np.array([[1,2,6,1],[4,5,8,2],[8,3,5,3],[6,5,4,4],[5,2,8,8]])
new_matrix = []
for i in column_list:
new_matrix.append(a[:,i])
new_matrix = np.array(new_matrix)
new_matrix = new_matrix.transpose()
print(new_matrix)
However, I was wondering if there is a shorter method?
Yes, there's a shorter way. You can pass a list (or numpy array) to an array's indexer. Therefore, you can pass column_list to the columns indexer of a:
>>> a[:, column_list]
array([[6, 1],
[8, 2],
[5, 3],
[4, 4],
[8, 8]])
# This is your new_matrix produced by your original code:
>>> new_matrix
array([[6, 1],
[8, 2],
[5, 3],
[4, 4],
[8, 8]])
>>> np.all(a[:, column_list] == new_matrix)
True

What does the ordering/index of cluster_centers_ represent in KMeans clustering SKlearn

I have implemented the following code
k_mean = KMeans(n_clusters=5,init=centroids,n_init=1,random_state=SEED).fit(X_input)
k_mean.cluster_centers_.shape
>>
(5, 50)
I have 5 clusters of the data.
How are the clusters ordered? Are the indices of the clusters centres representing the labels?
Means does the cluster_center index at 0th position represent the label = 0 or not?
In the docs you have a smiliar example:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
The indexes are ordered yes. Btw with k_mean.cluster_centers_.shapeyou only return the shape of your array, and not the values. So in your case you have 5 clusters, and the dimension of your features is 50.
To get the nearest point, you can have a look here.

How to convert a grouped pandas dataframe into a numpy 3d array and apply right-padding?

In order to feed data into a LSTM network to predict remaining-useful-life (RUL) I need to create a 3D numpy array (No of machines, No of sequences, No of variables).
I already tried to combine solutions from stackoverflow and managed to create a prototype (which you can see below).
import numpy as np
import tensorflow as tf
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 3, 3, 3, 3],
'V1': [1, 2, 2, 3, 3, 4, 2],
'V2': [4, 2, 3, 2, 1, 5, 1],
})
df_desired_result = np.array([[[1, 4], [2, 2], [-99, -99]],
[[2, 3], [-99, -99], [-99, -99]],
[[3, 2], [3, 1], [4, 5]]])
max_len = df['ID'].value_counts().max()
def pad_df(df, cols, max_seq, group_col= 'ID'):
array_for_pad = np.array(list(df[cols].groupby(df[group_col]).apply(pd.DataFrame.as_matrix)))
padded_array = tf.keras.preprocessing.sequence.pad_sequences(array_for_pad,
padding='post',
maxlen=max_seq,
value=-99
)
return padded_array
#testing prototype
pad_df(df, ['V1', 'V2'], max_len)
But when I apply the code above to my data, it applies the right-padding correctly but all values are set to 0.0.
I can't fully figure out this behaviour, I noticed that in the first line of my function, I get returned an array with nested arrays for 'array_for_pad'.
Here is a screenshot of the result:
result padding

How to get the row/column labels of a Confusion Matrix from scikit-learn?

How would I confirm the the columns/rows of an outputted Confusion Matrix if I didn't initially specify them when creating the matrix such as in the below code:
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
cm=confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
From the docs I know it says If none is given, those that appear at least once in y_true or y_pred are used in sorted order so I would assume the columns/rows would be ("ant", "bird", "cat") but how do I confirm that?
I tried something like cm.labels but that doesn't work.
In the source code of the confusion_matrix:
if labels is None:
labels = unique_labels(y_true, y_pred)
What is unique_labels and where is it imported from?
from sklearn.utils.multiclass import unique_labels
unique_labels(y_true, y_pred)
Returns
array(['ant', 'bird', 'cat'],
dtype='<U4')
unique_labels extracts an ordered array of unique labels.
Examples:
>>> from sklearn.utils.multiclass import unique_labels
>>> unique_labels([3, 5, 5, 5, 7, 7])
array([3, 5, 7])
>>> unique_labels([1, 2, 3, 4], [2, 2, 3, 4])
array([1, 2, 3, 4])
>>> unique_labels([1, 2, 10], [5, 11])
array([ 1, 2, 5, 10, 11])
Maybe a more intuitive example:
unique_labels(['z', 'x', 'y'], ['a', 'z', 'c'], ['e', 'd', 'y'])
Returns:
array(['a', 'c', 'd', 'e', 'x', 'y', 'z'],
dtype='<U1')

Resources