Get top-n items of every row in a scipy sparse matrix - python-3.x

After reading this similar question, I still can't fully understand how to go about implementing the solution im looking for. I have a sparse matrix, i.e.:
import numpy as np
from scipy import sparse
arr = np.array([[0,5,3,0,2],[6,0,4,9,0],[0,0,0,6,8]])
arr_csc = sparse.csc_matrix(arr)
I would like to efficiently get the top n items of each row, without converting the sparse matrix to dense.
The end result should look like this (assuming n=2):
top_n_arr = np.array([[0,5,3,0,0],[6,0,0,9,0],[0,0,0,6,8]])
top_n_arr_csc = sparse.csc_matrix(top_n_arr)

What is wrong with the linked answer? Does it not work in your case? or you just don't understand it? Or it isn't efficient enough?
I was going to suggest working out a means of finding the top values for a row of an lil format matrix, and apply that row by row. But I would just be repeating my earlier answer.
OK, my previous answer was a start, but lacked some details on iterating through the lol format. Here's a start; it probably could be cleaned up.
Make the array, and a lil version:
In [42]: arr = np.array([[0,5,3,0,2],[6,0,4,9,0],[0,0,0,6,8]])
In [43]: arr_sp=sparse.csc_matrix(arr)
In [44]: arr_ll=arr_sp.tolil()
The row function from the previous answer:
def max_n(row_data, row_indices, n):
i = row_data.argsort()[-n:]
# i = row_data.argpartition(-n)[-n:]
top_values = row_data[i]
top_indices = row_indices[i] # do the sparse indices matter?
return top_values, top_indices, i
Iterate over the rows of arr_ll, apply this function and replace the elements:
In [46]: for i in range(arr_ll.shape[0]):
d,r=max_n(np.array(arr_ll.data[i]),np.array(arr_ll.rows[i]),2)[:2]
arr_ll.data[i]=d.tolist()
arr_ll.rows[i]=r.tolist()
....:
In [47]: arr_ll.data
Out[47]: array([[3, 5], [6, 9], [6, 8]], dtype=object)
In [48]: arr_ll.rows
Out[48]: array([[2, 1], [0, 3], [3, 4]], dtype=object)
In [49]: arr_ll.tocsc().A
Out[49]:
array([[0, 5, 3, 0, 0],
[6, 0, 0, 9, 0],
[0, 0, 0, 6, 8]])
In the lil format, the data is stored in 2 object type arrays, as sublists, one with the data numbers, the other with the column indices.
Viewing the data attributes of sparse matrix is handy when doing new things. Changing those attributes has some risk, since it mess up the whole array. But it looks like the lil format can be tweaked like this safely.
The csr format is better for accessing rows than csc. It's data is stored in 3 arrays, data, indices and indptr. The lil format effectively splits 2 of those arrays into sublists based on information in the indptr. csr is great for math (multiplication, addition etc), but not so good when changing the sparsity (turning nonzero values into zeros).

Related

Equivalent of np.multiply.at in Pytorch

Are there equivalent of np.multiply.at in Pytorch? I have two 4d arrays and one 2d index array:
base = torch.ones((2, 3, 5, 5))
to_multiply = torch.arange(120).view(2, 3, 4, 5)
index = torch.tensor([[0, 2, 4, 2], [0, 3, 3, 2]])
As shown in this question I asked earlier (in Numpy), the row index of the index array corresponds to the 1st dimension of base and to_multiply, and the value of the index array corresponds to the 3rd dimension of base. I want to take the slice from base according to the index and multiply with to_multiply, it can be achieved in Numpy as follows:
np.multiply.at(base1, (np.arange(2)[:,None,None],np.arange(3)[:,None],index[:,None,:]), to_multiply)
However, now when I want to translate this to PyTorch, I cannot find an equivalent of np.multiply.at in Pytorch, I can only find the "index_add_" method but there is no "index_multiply". And I want to avoid doing explicit for loop.
So how can I achieve above in PyTorch? Thanks!

Getting Concordance result of lifelines CoxPH model in a dataframe

I am using CoxPH implementation of lifelines package in python. Currently, results are in tabular view of coefficients and related stats and can be seen with print_summary(). Here is an example
df = pd.DataFrame({'duration': [4, 6, 5, 5, 4, 6],
'event': [0, 0, 0, 1, 1, 1],
'cat': [0, 1, 0, 1, 0, 1]})
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event', show_progress=True)
cph.print_summary()
out[]
[Table of results from print_summary()][1]
How can I get only Concordance index as dataframe or list. cph.summary
returns a dataframe of main results i.e. p-values and coef but it does not include concordance index and other surrounding information.
you can access the c-index with cph.concordance_index_ - and you could put this into a list or dataframe if you wish.
You can also compute the concordance index for Cox model using a small script available at this link. The code is given below.
from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
Cindex = concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
This code will give C-index value, which also matches with cph.concordance_index_

harmonic mean for nested list the contains some negative values

I have to find the harmonic mean of the nested list that contains some negative values. I know harmonicmean is only used for positive values, so what can I do to compute harmonic mean of my list?
I tried this:
x=[['a', 1, -3, 5], ['b', -2, 6, 8], ['c', 3, 7, -9]]
import statistics as s
y=[s.harmonicmean(i[1:]) for i in x1]
but I get statistics.statisticserror for the negative values.
You probably want to use filter
filter will iterate over a copy of a list, or anything that's iterable, while filtering out elements that don't satisfy a specific condition. Keep in mind I said "copy;" it doesn't mutate the iterable you pass to it.
for example:
>>> numbers = [-1, 2, 3]
>>> filter(lambda i: i >= 0, numbers)
[2, 3]
or if you just want absolute values, you can use map which will iterate over a copy of a list, or anything that's iterable, while applying a function to each element:
>>> map(abs, numbers)
[1, 2, 3]

python numpy stack matrices and add specific corner/column entries

Say we have two matrices A and B with a size of 2 by 2. Is there a command that can stack them horizontally and add A[:,1] to B[:,0] so that the resulting matrix C is 2 by 3, with C[:,0] = A[:,0], C[:,1] = A[:,1] + B[:,0], C[:,2] = B[:,1]. One step further, stacking them on diagonal so that C[0:2,0:2] = A, C[1:2,1:2] = B, C[1,1] = A[1,1] + B[0,0]. C is 3 by 3 in this case. Hard coding this routine is not hard, but I'm just curious since MATLAB has a similar function if my memory serves me well.
A straight forward approach is to copy or add the two arrays to a target:
In [882]: A=np.arange(4).reshape(2,2)
In [883]: C=np.zeros((2,3),int)
In [884]: C[:,:-1]=A
In [885]: C[:,1:]+=A # or B
In [886]: C
Out[886]:
array([[0, 1, 1],
[2, 5, 3]])
Another approach is to to pad A at the end, pad B at the start, and sum; while there is a convenient pad function, it won't be any faster.
And for the diagonal
In [887]: C=np.zeros((3,3),int)
In [888]: C[:-1,:-1]=A
In [889]: C[1:,1:]+=A
In [890]: C
Out[890]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
Again the 2 arrays could be pad and added.
I'm not aware of any specialized function to do this; even if there were, it probably would do the same thing. This isn't a common enough operation to justify a compiled version.
I have built up finite element sparse matrices by adding over lapping element matrices. The sparse formats for both MATLAB and scipy facilitate this (duplicate coordinates are summed).
============
In [896]: np.pad(A,[[0,0],[0,1]],mode='constant')+np.pad(A,[[0,0],[1,0]],mode='
...: constant')
Out[896]:
array([[0, 1, 1],
[2, 5, 3]])
In [897]: np.pad(A,[[0,1],[0,1]],mode='constant')+np.pad(A,[[1,0],[1,0]],mode='
...: constant')
Out[897]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
What's the special MATLAB code for doing this?
in Octave I found:
prepad(A,3,0,axis=2)+postpad(A,3,0,axis=2)

Python declaring a numpy matrix of lists of lists

I would like to have a numpy matrix that looks like this
[int, [[int,int]]]
I receive an error that looks like this "ValueError: setting an array element with a sequence."
below is the declaration
def __init__(self):
self.path=np.zeros((1, 2))
I attempt to assign a value to this in the line below
routes_traveled.path[0, 1]=[loc]
loc is a list and routes_traveled is the object
Do you want a higher dimensional array, say 3d, or do you really want a 2d array whose elements are Python lists. Real lists, not numpy arrays?
One way to put lists in to an array is to use dtype=object:
In [71]: routes=np.zeros((1,2),dtype=object)
In [72]: routes[0,1]=[1,2,3]
In [73]: routes[0,0]=[4,5]
In [74]: routes
Out[74]: array([[[4, 5], [1, 2, 3]]], dtype=object)
One term of this array is 2 element list, the other a 3 element list.
I could have created the same thing directly:
In [76]: np.array([[[4,5],[1,2,3]]])
Out[76]: array([[[4, 5], [1, 2, 3]]], dtype=object)
But if I'd given it 2 lists of the same length, I'd get a 3d array:
In [77]: routes1=np.array([[[4,5,6],[1,2,3]]])
Out[77]:
array([[[4, 5, 6],
[1, 2, 3]]])
I could index the last, routes1[0,1], and get an array: array([1, 2, 3]), where as routes[0,1] gives [1, 2, 3].
In this case you need to be clear where you talking about arrays, subarrays, and Python lists.
With dtype=object, the elements can be anything - lists, dictionaries, numbers, strings
In [84]: routes[0,0]=3
In [85]: routes
Out[85]: array([[3, [1, 2, 3]]], dtype=object)
Just be ware that such an array looses a lot of the functionality that a purely numeric array has. What the array actually contains is pointers to Python objects - just a slight generalization of Python lists.
Did you want to create an array of zeros with shape (1, 2)? In that case use np.zeros((1, 2)).
In [118]: np.zeros((1, 2))
Out[118]: array([[ 0., 0.]])
In contrast, np.zeros(1, 2) raises TypeError:
In [117]: np.zeros(1, 2)
TypeError: data type not understood
because the second argument to np.zeros is supposed to be the dtype, and 2 is not a value dtype.
Or, to create a 1-dimensional array with a custom dtype consisting of an int and a pair of ints, you could use
In [120]: np.zeros((2,), dtype=[('x', 'i4'), ('y', '2i4')])
Out[120]:
array([(0, [0, 0]), (0, [0, 0])],
dtype=[('x', '<i4'), ('y', '<i4', (2,))])
I wouldn't recommend this though. If the values are all ints, I think you would be better off with a simple ndarray with homogeneous integer dtype, perhaps of shape (nrows, 3):
In [121]: np.zeros((2, 3), dtype='<i4')
Out[121]:
array([[0, 0, 0],
[0, 0, 0]], dtype=int32)
Generally I find using an array with a simple dtype makes many operations from building the array to slicing and reshaping easier.

Resources