scikit-learn: Get selected features for prediction data - scikit-learn

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?

You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Related

Using numba to randomly sample possible combinations of categories

I am trying to speed up a function that randomly samples a number of records with the possible combinations of a number of categories for a number of records and ensures they are unique (i.e. let's assume there's 3 records, any of them can be either 0 or 1 and I want 10 random samples of unique possible combinations of records).
If I did not use numba, I might would do something like this:
import numpy as np
def myfunc(categories, NumberOfRecords, maxsamples):
return np.unique( np.random.choice(np.arange(categories), size=(maxsamples*10, NumberOfRecords), replace=True), axis=0 )[0:maxsamples]
Annoyingly, numba does not support axis in np.unique, so I can do something like this, but some of the records may turn out to be non-unique.
from numba import njit, int64
import numpy as np
#njit(int64[:,:](int64, int64, int64), cache=True)
def myfunc(categories, NumberOfRecords, maxsamples):
return np.random.choice(np.arange(categories), size=(maxsamples, NumberOfRecords), replace=True)
myfunc(categories=2, NumberOfRecords=3, maxsamples=10)
E.g. in one call (obviously there's some randomness here), I got the below (for which the indices 1 and 6, and 3 and 4, and 7 and 9 are identical rows):
array([[0, 1, 1],
[1, 1, 0],
[0, 1, 0],
[1, 0, 1],
[1, 0, 1],
[1, 1, 1],
[1, 1, 0],
[1, 0, 0],
[0, 0, 0],
[1, 0, 0]])
My questions are:
Is this something where I would even expect a speed up from numba?
If so, how can I get a unique rows (this seems rather difficult with numba, but presumably there's a way)?
Perhaps there's a way to get at this more efficiently (perhaps without creating more random samples than I need in the end)?
In the following, I don't use numba, but all the operations use vectorized numpy functions.
Each row of the result that you generate can be interpreted as an integer expressed in base N, where N is the number of categories. With that interpretation, what you want is to sample without replacement from the integers [0, 1, ... N**R-1], where R is the number of "records". You can use the choice function for that, with the argument replace=False. Once you have that, you need to convert the chosen integers to base N. For that, I use the function int2base, which is a pared down version of a function that I wrote in a different answer.
Here's the code:
import numpy as np
def int2base(x, base, ndigits):
# x = np.asarray(x) # Uncomment this line for general purpose use.
powers = base ** np.arange(ndigits)
digits = (x.reshape(x.shape + (1,)) // powers) % base
return digits
def makesample(ncategories, nrecords, nsamples, rng=None):
if rng is None:
rng = np.random.default_rng()
n = ncategories ** nrecords
choices = rng.choice(n, replace=False, size=nsamples)
return int2base(choices, ncategories, nrecords)
In makesample, I included the optional argument rng. It allows you to specify the object that holds the choice function. If not provided, it uses np.random.default_rng().
Example:
In [118]: makesample(2, 3, 6)
Out[118]:
array([[0, 1, 1],
[0, 0, 1],
[1, 0, 1],
[0, 0, 0],
[1, 1, 0],
[1, 1, 1]])
In [119]: makesample(5, 4, 12)
Out[119]:
array([[3, 4, 0, 1],
[2, 0, 2, 0],
[4, 2, 4, 3],
[0, 1, 0, 4],
[0, 2, 0, 1],
[1, 2, 0, 1],
[0, 3, 0, 4],
[3, 3, 0, 3],
[3, 4, 1, 4],
[2, 4, 1, 1],
[3, 4, 1, 0],
[1, 1, 4, 4]])
makesample will raise an exception if you ask for too many samples:
In [120]: makesample(2, 3, 10)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-120-80044e78a60a> in <module>
----> 1 makesample(2, 3, 10)
~/code_snippets/python/numpy/random_samples_for_so_question.py in makesample(ncategories, nrecords, nsamples, rng)
17 rng = np.random.default_rng()
18 n = ncategories ** nrecords
---> 19 choices = rng.choice(n, replace=False, size=nsamples)
20 return int2base(choices, ncategories, nrecords)
_generator.pyx in numpy.random._generator.Generator.choice()
ValueError: Cannot take a larger sample than population when 'replace=False'

What does the ordering/index of cluster_centers_ represent in KMeans clustering SKlearn

I have implemented the following code
k_mean = KMeans(n_clusters=5,init=centroids,n_init=1,random_state=SEED).fit(X_input)
k_mean.cluster_centers_.shape
>>
(5, 50)
I have 5 clusters of the data.
How are the clusters ordered? Are the indices of the clusters centres representing the labels?
Means does the cluster_center index at 0th position represent the label = 0 or not?
In the docs you have a smiliar example:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
The indexes are ordered yes. Btw with k_mean.cluster_centers_.shapeyou only return the shape of your array, and not the values. So in your case you have 5 clusters, and the dimension of your features is 50.
To get the nearest point, you can have a look here.

regarding transform a one way list into a two-dimensional array representing a mesh grid

I have a data set saved as following. Generally speaking, it is a list, each element of this master list is a sublist. Each sublist includes two elements, where the first one is a value, and the second one is ID.
[[0.089, 0],
[0.075, 1],
[0.588, 2],
[0.906, 3],
[0.332, 4],
[0.707, 5],
[0.668, 6],
[0.426, 7],
[0.034, 8]]
The above test data set can be generated using the following code segment
import numpy as np
testlist=[]
for i in range(9):
temp=[]
x1 = np.random.rand()
temp.append(x1)
temp.append(i)
testlist.append(temp)
How to transfer this list into the two-dimensional array representing a mesh. For instance, the values will be arranged in this two-dimensional array
0.089 0.045 0.588
0.907 0.332 0.707
0.668 0.426 0.034
Is this what you want?
the reshape will change array's shape, -1 in shape means the value will be infered by numpy itself
arr = np.array([[0.089, 0],
[0.075, 1],
[0.588, 2],
[0.906, 3],
[0.332, 4],
[0.707, 5],
[0.668, 6],
[0.426, 7],
[0.034, 8]])
arr[:,0].reshape(-1,3).copy()
Result
array([[0.089, 0.075, 0.588],
[0.906, 0.332, 0.707],
[0.668, 0.426, 0.034]])

Numpy matrix addition vs ndarrays, convenient oneliner

How does numpy's matrix class work? I understand it will likely be removed in the future, so I am trying to understand how it works, so I can do the same with ndarrrays.
>>> x=np.matrix([[1,1,1],[2,2,2],[3,3,3]])
>>> x[:,0] + x[0,:]
matrix([[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
Seems like a row of ones got added to every row.
>>> x=np.matrix([[1,2,3],[1,2,3],[1,2,3]])
>>> x[0,:] + x[:,0]
matrix([[2, 3, 4],
[2, 3, 4],
[2, 3, 4]])
Now it seems like a column of ones got added to every column. What it does it with the identity is even weirder,
>>> x=np.matrix([[1,0,0],[0,1,0],[0,0,1]])
>>> x[0,:] + x[:,0]
matrix([[2, 1, 1],
[1, 0, 0],
[1, 0, 0]])
EDIT:
It seems if you take a (N,1) shape matrix and add it to a (N,1) shape matrix, then of one these is replicated to form a (N,N) matrix and the other is added to every row or column of this new matrix. It seems to be a convenience restricted to vectors of the right sizes. A nice use case was networkx's implementation of Floyd-Warshal.
Is there an equivalently convenient one-liner for this using standard numpy ndarrays?

Differences between index-assignment in Numpy and Theano's set_subtensor()

I am trying to do index-assignment in Theano using set_subtensor(), but it is giving different results to Numpy's index-assignment. Am I doing something wrong, or is this a difference in how set_subtensor and Numpy's index-assignment work?
What I want to do:
X = np.zeros((2, 2))
X[[[0, 1], [0, 1]]] = np.array([1, 2])
X is now:
[[ 1. 0.]
[ 0. 2.]]
Trying to do the same thing in Theano:
X = theano.shared(value=np.zeros((2, 2)))
X = T.set_subtensor(X[[[0, 1], [0, 1]]], np.array([1, 2]))
X.eval()
Raises this error
ValueError: array is not broadcastable to correct shape
This highlights a subtle difference between numpy and Theano but it can be worked around easily.
Advanced indexing can be enabled in numpy by using a list of positions or a tuple of positions. In Theano, one can only use a tuple of positions.
So changing
X = T.set_subtensor(X[[[0, 1], [0, 1]]], np.array([1, 2]))
to
X = T.set_subtensor(X[([0, 1], [0, 1])], np.array([1, 2]))
solves the problem in Theano.
One continues to get the same result in numpy if one changes
X[[[0, 1], [0, 1]]] = np.array([1, 2])
to
X[([0, 1], [0, 1])] = np.array([1, 2])

Resources