Remove variable with zero variance - python-3.x

Could somebody help me in writing the code to remove the variable whose variance is zero in the data frame using python?

Removing features with low variance
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
There are 3 boolean features here, each with 6 instances. Suppose we wish to remove those that are constant in at least 80% of the instances. Some probability calculations show that these features will need to have variance lower than 0.8 * (1 - 0.8). Consequently, we can use Ref: Scikit link
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
Output will be:
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])

Related

How to convert the following processing using numpy

I am trying to improve a part of code that is slowing down the whole script significantly, right to the point of making it unfeasible. In particular the piece of code is:
for vectors1 in EC1:
for vectors2 in EC2:
r = np.add(vectors1, vectors2)
for vectors3 in CDC:
result = np.add(r, vectors3).tolist()
if result not in states: # This is what makes it very slow
states.append(result)
EC1, EC2 and CDC are lists that contains as elements, lists of lists, as an example of one iteration, we get:
vectors1: [[2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0]]
vectors2: [[0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
vectors3: [[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [2, 1, 0], [2, 1, 0]]
result: [[2, 0, 0], [2, 0, 0], [2, 1, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], [4, 1, 0], [2, 1, 0]]
Notice how vectors1, vectors2 and vectors3 correspond to one element from EC1, EC2 and CDC respectively, also how 'result' is the summation from vectors1, vectors2 and vectors3, hence the previous vectors cannot be altered in any manner or sorted, otherwise it would change the expected result from the 'result' variable.
In the first two loops each item in EC1 and EC2 are summed, for later on sum up the previous result with items in CDC. To sum the list of lists from EC1 and EC2 and later on the previous result ('r') with the list of lists from CDC I use numpy.add(). Finally, I reconvert 'result' back to list. So Basically I am managing lists of lists as elements from EC1, EC2 and CDC.
The problem is that I must deal with hundreds of thousands (close to 1M) of results and having to check if a result exists in states list is slowing things drastically, specially since states list grows as more results are processed.
I've tried to keep inside the numpy world by managing everything as numpy arrays. First declaring states as:
states = np.empty([9, 3], int)
Then, concatenating the result numpy array to states numpy array, prior checking if already exists in states:
for vectors1 in EC1:
for vectors2 in EC2:
r = np.add(vectors1, vectors2)
for vectors3 in CDC:
result = np.add(r, vectors3)
if not np.isin(states, result).any():
np.concatenate(states, result, axis=0)
But definitely I am doing something wrong because result is not being concatenated to states, I've also tried without success:
np.append(states, result, axis=0)
Could this be parallelized in some way?
You can do the sums solely in numpy by using broadcasting
res = ((EC1[:,None,:] + EC2).reshape(-1, 1, 3) + CDC).reshape(-1, 3)
given that EC1, EC2 and CDC are arrays.
Afterwards you can filter out the duplicates with
np.unique(res, axis=0)
But like Lucas, I would strongly advise you to filter the arrays beforehand. For your example arrays that would shrink the number of rows in res from 729 to 8.
I'm not sure how large the data are that you are working with but this may speed things up somewhat:
EC1 = [[2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0]]
EC2 = [[0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
CDC = [[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [2, 1, 0], [2, 1, 0]]
EC1.sort()
EC2.sort()
CDC.sort()
unique_triples = dict()
for v1 in EC1:
for v2 in EC2:
for v3 in CDC:
if str(v1)+str(v2)+str(v3) not in unique_triples: # list not hashable but strings are
unique_triples[str(v1)+str(v2)+str(v3)] = list(np.add(np.add(v1, v2), v3))
The basic idea is to remove duplicate triples of (EC1,EC2, CDC) entries and only do the additions on unique triples, sort the lists so that they are ordered lexicographically
A dictionary has O(1) lookups so these lookups are (maybe) faster.
Whether this is faster or not might depend on how large-and how many unique values of triples-the data are that are being processed.
The 3-vector sums are the values of the dictionary, e.g.
list(unique_triples.values()) for me gives:
>>> list(unique_triples.values())
[[0, 0, 0], [2, 1, 0], [2, 0, 0], [4, 1, 0], [2, 0, 0], [4, 1, 0], [4, 0, 0], [6, 1, 0]]
I did not remove the duplicates in the original lists of lists here. If the application you are looking at allows, it is also likely beneficial to remove these duplicates in EC1, EC2, and CDC before iterating over the values.

Multiclass vs. multilabel fitting

In scikit-learn tutorials, I found the following paragraphs in the section 'Multiclass vs. multilabel fitting'.
I couldn't understand why the following codes generate the given results.
First
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Next
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
Label binarization in scikit-learn will transform your targets and represent them in a label indicator matrix. This label indicator matrix has the shape (n_samples, n_classes) and is composed as follows:
each row represents a sample
each column represents a class
each element is 1 if the sample is labeled with the class and 0 if not
In your first example, you have a target collection with 5 samples and 3 classes. That's why transforming y with LabelBinarizer results in a 5x3 matrix. In your case, [1, 0, 0] corresponds to class 0, [0, 1, 0] corresponds to class 1 and so forth. Notice that in each row there is only one element set to 1, since each sample can have one label only.
In your next example, you have a target collection with 5 samples and 5 classes. That's why transforming y with MultiLabelBinarizer results in a 5x5 matrix. In your case, [1, 1, 0, 0, 0] corresponds to the multilabel [0, 1], [0, 1, 0, 1, 0] corresponds to the multilabel [1, 3] and so forth. The key difference to the first example is that each row can have multiple elements set to 1, because each sample can have multiple labels/classes.
The predicted values you get follow the very same pattern. They are however not equivalent to the original values in y since your classification model has obviously predicted different values. You can check this with the inverse_transform() of the binarizers:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = np.array([[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]])
y_bin = mlb.fit_transform(y)
# direct transformation
[[1 1 0 0 0]
[1 0 1 0 0]
[0 1 0 1 0]
[1 0 1 1 0]
[0 0 1 0 1]]
# prediction of your classifier
y_pred = np.array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
# inverting the binarized values to the original classes
y_inv = mlb.inverse_transform(y_pred)
# output
[(0, 1), (0, 2), (1, 3), (0, 2), (0, 2)]

cv2.Laplacian vs cv2.filter2d - Different results

I am trying to convolve my grayscale image with various filters. I have used the
cv2.Laplacian(gray, cv2.CV_64F)
and
kernel =np.array([[0, 1, 0] , [1, -4, 1] , [0, 1, 0]])
dst = cv2.filter2D(gray, -1, kernel)
But the results are different.
Can someone elaborate why I am getting different results?
Since what the implementation of cv2.Laplacian does in that case is exactly to convolve with the [[0, 1, 0], [1, -4, 1], [0, 1, 0]] filter as you do, it seems that the likely culprit is the datatype that your are feeding to cv2.Filter2D.
By using this code
kernel = np.array([[0, 1, 0] , [1, -4, 1] , [0, 1, 0]])
dst1 = cv2.filter2D(im, ddepth=cv2.CV_64F, kernel=kernel)
dst2 = cv2.Laplacian(im, cv2.CV_64F)
you should get
>>> np.all(dst1==dst2)
True

Using numba to randomly sample possible combinations of categories

I am trying to speed up a function that randomly samples a number of records with the possible combinations of a number of categories for a number of records and ensures they are unique (i.e. let's assume there's 3 records, any of them can be either 0 or 1 and I want 10 random samples of unique possible combinations of records).
If I did not use numba, I might would do something like this:
import numpy as np
def myfunc(categories, NumberOfRecords, maxsamples):
return np.unique( np.random.choice(np.arange(categories), size=(maxsamples*10, NumberOfRecords), replace=True), axis=0 )[0:maxsamples]
Annoyingly, numba does not support axis in np.unique, so I can do something like this, but some of the records may turn out to be non-unique.
from numba import njit, int64
import numpy as np
#njit(int64[:,:](int64, int64, int64), cache=True)
def myfunc(categories, NumberOfRecords, maxsamples):
return np.random.choice(np.arange(categories), size=(maxsamples, NumberOfRecords), replace=True)
myfunc(categories=2, NumberOfRecords=3, maxsamples=10)
E.g. in one call (obviously there's some randomness here), I got the below (for which the indices 1 and 6, and 3 and 4, and 7 and 9 are identical rows):
array([[0, 1, 1],
[1, 1, 0],
[0, 1, 0],
[1, 0, 1],
[1, 0, 1],
[1, 1, 1],
[1, 1, 0],
[1, 0, 0],
[0, 0, 0],
[1, 0, 0]])
My questions are:
Is this something where I would even expect a speed up from numba?
If so, how can I get a unique rows (this seems rather difficult with numba, but presumably there's a way)?
Perhaps there's a way to get at this more efficiently (perhaps without creating more random samples than I need in the end)?
In the following, I don't use numba, but all the operations use vectorized numpy functions.
Each row of the result that you generate can be interpreted as an integer expressed in base N, where N is the number of categories. With that interpretation, what you want is to sample without replacement from the integers [0, 1, ... N**R-1], where R is the number of "records". You can use the choice function for that, with the argument replace=False. Once you have that, you need to convert the chosen integers to base N. For that, I use the function int2base, which is a pared down version of a function that I wrote in a different answer.
Here's the code:
import numpy as np
def int2base(x, base, ndigits):
# x = np.asarray(x) # Uncomment this line for general purpose use.
powers = base ** np.arange(ndigits)
digits = (x.reshape(x.shape + (1,)) // powers) % base
return digits
def makesample(ncategories, nrecords, nsamples, rng=None):
if rng is None:
rng = np.random.default_rng()
n = ncategories ** nrecords
choices = rng.choice(n, replace=False, size=nsamples)
return int2base(choices, ncategories, nrecords)
In makesample, I included the optional argument rng. It allows you to specify the object that holds the choice function. If not provided, it uses np.random.default_rng().
Example:
In [118]: makesample(2, 3, 6)
Out[118]:
array([[0, 1, 1],
[0, 0, 1],
[1, 0, 1],
[0, 0, 0],
[1, 1, 0],
[1, 1, 1]])
In [119]: makesample(5, 4, 12)
Out[119]:
array([[3, 4, 0, 1],
[2, 0, 2, 0],
[4, 2, 4, 3],
[0, 1, 0, 4],
[0, 2, 0, 1],
[1, 2, 0, 1],
[0, 3, 0, 4],
[3, 3, 0, 3],
[3, 4, 1, 4],
[2, 4, 1, 1],
[3, 4, 1, 0],
[1, 1, 4, 4]])
makesample will raise an exception if you ask for too many samples:
In [120]: makesample(2, 3, 10)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-120-80044e78a60a> in <module>
----> 1 makesample(2, 3, 10)
~/code_snippets/python/numpy/random_samples_for_so_question.py in makesample(ncategories, nrecords, nsamples, rng)
17 rng = np.random.default_rng()
18 n = ncategories ** nrecords
---> 19 choices = rng.choice(n, replace=False, size=nsamples)
20 return int2base(choices, ncategories, nrecords)
_generator.pyx in numpy.random._generator.Generator.choice()
ValueError: Cannot take a larger sample than population when 'replace=False'

Is there a way to form sparse n-dimensional array in Python3?

I am pretty new to Python and have been wondering if there an easy way so that I could form a sparse n-dimensional array M in Python3 with following 2 conditions mainly required (along the lines of SciPy COO_Matrix):
M[dim1,dim2,dim3,...] = 1.0
Like SciPy COO_Matrix M: M.row, M.col, I may be able to get all the row and column indices for which non-zero entries exist in the matrix. In N-dimension, this generalizes to calling: M.1 for 1st dimension, M.2 for 2nd dimension and so on...
For 2-dimension (the 2 conditions):
1.
for u, i in data:
mat[u, i] = 1.0
2. def get_triplets(mat):
return mat.row, mat.col
Can these 2 conditions be generalized in N-dimensions? I searched and came across this:
sparse 3d matrix/array in Python?
But here 2nd condition is not satisfied: In other words, I can't get the all the nth dimensional indices in a vectorized format.
Also this:
http://www.janeriksolem.net/sparray-sparse-n-dimensional-arrays-in.html works for python and not python3.
Is there a way to implement n-dimensional arrays with above mentioned 2 conditions satisfied? Or I am over-complicating things? I appreciate any help with this :)
In the spirit of coo format I could generate a 3d sparse array representation:
In [106]: dims = 2,4,6
In [107]: data = np.zeros((10,4),int)
In [108]: data[:,-1] = 1
In [112]: for i in range(3):
...: data[:,i] = np.random.randint(0,dims[i],10)
In [113]: data
Out[113]:
array([[0, 2, 3, 1],
[0, 3, 4, 1],
[0, 0, 1, 1],
[0, 3, 0, 1],
[1, 1, 3, 1],
[1, 0, 2, 1],
[1, 1, 2, 1],
[0, 2, 5, 1],
[0, 1, 5, 1],
[0, 1, 2, 1]])
Does that meet your requirements? It's possible there are some duplicates. sparse.coo sums duplicates before it converts the array to dense for display, or to csr for calculations.
The corresponding dense array is:
In [130]: A=np.zeros(dims, int)
In [131]: for row in data:
...: A[tuple(row[:3])] += row[-1]
In [132]: A
Out[132]:
array([[[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]],
[[0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]])
(no duplicates in this case).
A 2d sparse matrix using a subset of this data is
In [118]: sparse.coo_matrix((data[:,3],(data[:,1],data[:,2])),(4,6)).A
Out[118]:
array([[0, 1, 1, 0, 0, 0],
[0, 0, 2, 1, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]])
That's in effect the sum over the first dimension.
I'm assuming that
M[dim1,dim2,dim3,...] = 1.0
means the non-zero elements of the array must have a data value of 1.
Pandas has a sparse data series and data frame format. That allows for a non-zero 'fill' value. I don't know if the multi-index version can be thought of as higher than 2d or not. There have been a few SO questions about converting the Pandas sparse arrays to/from the scipy sparse.
Convert Pandas SparseDataframe to Scipy sparse csc_matrix
http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

Resources