Remove columns that contain only zeros in a matrix on Python - python-3.x

I am building a matrix on python and once it's built, I would like to remove all the columns that contains only zeros. (Some columns contain zeros but not only so i want to keep them.
def remove_column_with_all_zeros(matrix):
zero_columns = []
for i in range(len(matrix[0])):
column = [row[i] for row in matrix]
if all(val == 0 for val in column):
zero_columns.append(i)
for i in sorted(zero_columns, reverse=True):
for row in matrix:
del row[i]
return matrix
I tried this function but it doesn't work.
Thank you

So convert your matrix into np array if it is not:
So this is an example:
Here you would like to remove first column I guess
array([[0, 0, 1],
[0, 2, 3],
[0, 1, 4]])
So if you have your matrix as the above example you may then do:
matrixT = matrix.T
# This will return a boolean that will have True value where all the elements of a column are 0
all_zeros = (matrixT==0).all(1)
updated_matrix = np.delete(matrix, all_zeros, axis =1)
Output for my example :
array([[0, 1],
[2, 3],
[1, 4]])
Let me know if it works for you!!

Related

Sort python dictionary with value as list

If we want to compare on the basis of all indices of the list and not just the 1st element. If lists are identical, then sort by key. Also length of the list is not known in advance. In that case how to sort the keys. Below is the example:
{'A': [5, 0, 0], 'B': [0, 2, 3], 'C': [0, 3, 2]}
output:
[A, C, B]
Explanation: A is at 1st position because at 0th index 5 is highest and rest is 0. C is at 2nd position because 2nd 1st index of C is 3 compared to 1st index of B. As you can see we need to compare all positions to sort it and we don't know the array size before hand.
I tried below code:
countPos = {'A': [5, 0, 0], 'B': [0, 2, 3], 'C': [0, 3, 2]}
res = sorted(countPos.items(), key=lambda x: ((-x[1][i]) for i in range(3)))
Getting an error for above code. Could someone help me on this?
I think got a solution, which worked. This might be naive. I encourage gurus to rectify me.
r = sorted(countPos.items(), key=lambda x: x[0])
r = dict(r)
res = sorted(r.items(), key=lambda x: x[1], reverse=True)
So, first sorted based on keys and then I sorted based on values in reverse order.

How to make permutation matrix for two lists of str, Python3

I have two lists.
a_num = [1, 3, 2, 4]
b_num = [1, 2, 3, 4]
I want to find a permutation matrix to convert a to b. Mathematically, a permutation matrix is a square matrix, whose elements are either 1 or 0. It can change the sequence of elements in a vector, by multiplying it.
In this particular example, the permutation matrix is:
p = [[1,0,0,0],
[0,0,1,0],
[0,1,0,0],
[0,0,0,1]]
# check whether p is correct.
b_num == np.dot(np.array(p), np.array(a_num).reshape(4,1))
Could you please show me how to make that matrix p? In my real application, there can be tens of elements in the lists with arbitrary sequence. And the two lists always contain str instead of int.
And how to make p when the a and b are lists of str?
a_str = ['c1', 'c2', 's1', 's2']
b_str = ['c1', 's1', 'c2', 's2']
In pure Python you can do:
a_str = ['c1', 'c2', 's1', 's2']
b_str = ['c1', 's1', 'c2', 's2']
from collections import defaultdict
dd = defaultdict(lambda: [0, []])
for i, x in enumerate(b_str): # collect indexes of target chars
dd[x][1].append(i)
matrix = [[0]*len(a_str) for x in b_str]
for i, a in enumerate(a_str):
# set cell at row (src index) and col (next tgt index) to 1
matrix[i][dd[a][1][dd[a][0]]] = 1
# increment index for looking up next tgt index
dd[a][0] += 1
matrix
# [[1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1]]
This assumes that a_str and b_str are in fact respective permutations.
Edit: works only with numbers.
For arrays with the same length using NumPy:
import numpy as np
a_num = [1, 3, 2, 4]
b_num = [4, 2, 3, 1]
a = np.array(a_num)
b = np.array(b_num)
len_v = len(a) # = len(b)
A = np.zeros((len_v, len_v))
A[np.argsort(a), np.arange(len_v)] = 1
B = np.zeros((len_v, len_v))
B[np.argsort(b), np.arange(len_v)] = 1
np.dot(np.linalg.inv(B), np.dot(A, a)) # array([4., 2., 3., 1.])

Concatenate two 1 column DataFrames doesn't return both columns

I'm using Python 3.6 and I'm a newbie so thanks in advance for your patience.
I have a function that sums the difference between 3 points. It should then take the 'differences' and concatenate them with another DataFrame called labels. k and length are integers. I expected the resulting DataFrame to have two columns but it only has one.
Sample Code:
def distance(df1,df2,labels,k,length):
total_dist = 0
for i in range(length):
dist_dif = df1.iloc[:,i] - df2.iloc[:,i]
sq_dist = dist_dif ** 2
root_dist = sq_dist ** 0.5
total_dist = total_dist + root_dist
return total_dist
distance_df = pd.concat([total_dist, labels], axis=1)
distance_df.sort(ascending=False, axis=1, inplace=True)
top_knn = distance_df[:k]
return top_knn.value_counts().index.values[0]
Sample Data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': [0, 1,1]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
I expected the data to look something like this:
total_dist labels
0 1.715349 0
1 2.872991 1
2 4.344087 1
but instead it looks like this:
0 1.715349
1 4.344087
2 2.872991
dtype: float64
The output doesn't do the following:
1. Return the labels column data
2. Sort the data in descending order
If someone could point me in the right direction, I'd truly appreciate it.
Given two DataFrames, df1-df2 will perform the subtraction element-wise. Use abs() to take the absolute value of that difference, and finally sum each row. That's the explanation to the first command in the following function. The other lines are similar to your code.
import numpy as np
import pandas as pd
def calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels):
diff = np.sum(np.abs(df1-df2), axis=1) # np.sum(..., axis=1) sums the rows
diff.name = 'total_abs_distance' # Not really necessary, but just to refer to it later
diff = pd.concat([diff, labels], axis=1)
diff.sort_values(by='total_abs_distance', axis=0, ascending=True, inplace=True)
return diff
So for your example data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': ['a', 'b', 'c']}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels)
We get hopefully what you wanted:
total_abs_distance Survived
0 1.71 a
2 3.87 c
1 4.34 b
A few notes:
Did you really want the L1-norm? If you wanted the L2-norm (Euclidean distance), then replace the first command in that function above by np.sqrt(np.sum(np.square(df1-df2),axis=1)).
What's the purpose of those labels? Consider using the index of the DataFrames instead. Maybe it will fit your purposes better? For example:
# lbl_series = pd.Series(['a','b','c'], name='Survived') # Try this later instead of lbl_list, to further explore the wonders of Pandas indexes :)
lbl_list = ['a', 'b', 'c']
df1.index = lbl_list
df2.index = lbl_list
# Then the L1-norm is simply this:
np.sum(np.abs(df1 - df2), axis=1).sort_values()
# Whose output is the Series: (with the labels as its index)
a 1.71
c 3.87
b 4.34
dtype: float64

Returning the N largest values' indices in a multidimensional array (can find solutions for one dimension but not multi-dimension)

I have a numpy array X, and I'd like to return another array Y whose entries are the indices of the n largest values of X i.e. suppose I have:
a =np.array[[1, 3, 5], [4, 5 ,6], [9, 1, 7]]
then say, if I want the first 5 "maxs"'s indices-here 9, 7 , 6 , 5, 5 are the maxs, and their indices are:
b=np.array[[2, 0], [2 2], [ 2 1], [1 1], [0 , 2])
I've been able to find some solutions and make this work for a one dimensional array like
c=np.array[1, 2, 3, 4, 5, 6]:
def f(a,N):
return np.argsort(a)[::-1][:N]
But have not been able to generate something that works in more than one dimension. Thanks!
Approach #1
Get the argsort indices on its flattened version and select the last N indices. Then, get the corresponding row and column indices -
N = 5
idx = np.argsort(a.ravel())[-N:][::-1] #single slicing: `[:N-2:-1]`
topN_val = a.ravel()[idx]
row_col = np.c_[np.unravel_index(idx, a.shape)]
Sample run -
# Input array
In [39]: a = np.array([[1,3,5],[4,5,6],[9,1,7]])
In [40]: N = 5
...: idx = np.argsort(a.ravel())[-N:][::-1]
...: topN_val = a.ravel()[idx]
...: row_col = np.c_[np.unravel_index(idx, a.shape)]
...:
In [41]: topN_val
Out[41]: array([9, 7, 6, 5, 5])
In [42]: row_col
Out[42]:
array([[2, 0],
[2, 2],
[1, 2],
[1, 1],
[0, 2]])
Approach #2
For performance, we can use np.argpartition to get top N indices without keeping sorted order, like so -
idx0 = np.argpartition(a.ravel(), -N)[-N:]
To get the sorted order, we need one more round of argsort -
idx = idx0[a.ravel()[idx0].argsort()][::-1]

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Resources