Get index value from array with condition - python-3.x
I have a np array like that.
a = [ [0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 1. 0.]
]
I want to get all rows index where in 3rd column if item value is ==1
a[:,2:2+1]==1
In that case my result would be
index = [0 3,3]
Is there any function that I can use for that?
import numpy as np
a=np.array([[0, 0, 1, 0],[0, 1, 0, 0],[1,0, 0, 0],[0, 0, 1, 0]])
index,value_first_at_index=np.where(a[:,2:3]==1)
print(index)
Related
What average precision is the plot_precision_recall_curve() function plotting?
After using the plot_precision_recall_curve() from scikit learn I was wondering what average precision this function is using. When looking in the docs, this is what I find for a binary target: # %% # Compute the average precision score # ................................... from sklearn.metrics import average_precision_score average_precision = average_precision_score(y_test, y_score) print('Average precision-recall score: {0:0.2f}'.format( average_precision)) This is my data: clf_4 = svm.SVC() clf_4.fit(X_train, y_train) y_clf_4 = clf_4.predict(X_test) y1_test = np.array([1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1] y1_clf4 = np.array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] average_precision_5 = average_precision_score(y1_test, y1_clf4) average_precision_5 Out: 0.5625 Now we use the plot_precision_recall_curve with X_test being this (same as above): X_test= np.array([[0.01167537, 0.04676259, 0.02145552, 0.015625 , 0. , 0. , 0. , 0.5 , 0.01020408, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00478415, 0.01258993, 0.06759886, 0.09375 , 0. , 0. , 0. , 0.43421053, 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.01503446, 0.04136691, 0.02600806, 0.015625 , 0. , 0. , 1. , 0.13157895, 0.02721088, 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.017396 , 0.04856115, 0.07737383, 0.046875 , 0. , 0. , 0. , 0.44736842, 0.04421769, 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.0072882 , 0.01079137, 0.07866155, 0.078125 , 1. , 0. , 0. , 0.63157895, 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00733909, 0.0323741 , 0.0487578 , 0.046875 , 0. , 0. , 0. , 0.44736842, 0.02040816, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. ], [0.02579371, 0.11151079, 0.03639438, 0.0625 , 0. , 0. , 0. , 0.53947368, 0.02380952, 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00203581, 0.03417266, 0.12611863, 0.125 , 0. , 0. , 0. , 0.05263158, 0.00680272, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00527275, 0.03057554, 0.0344563 , 0.03125 , 0. , 0. , 1. , 0.09210526, 0.00680272, 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00590385, 0.02158273, 0.05135926, 0.046875 , 0. , 0. , 0. , 0.43421053, 0.00340136, 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.01910608, 0.16366906, 0.05917014, 0.03125 , 1. , 0. , 1. , 0.28947368, 0.12244898, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. ], [0.12737045, 0.13669065, 0.07280827, 0.078125 , 1. , 0. , 0. , 0.46052632, 0.07823129, 0. , 0. , 1. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. ], [0.0537861 , 0.17446043, 0.14109651, 0.078125 , 0. , 0. , 0. , 0.32894737, 0.08843537, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.01027066, 0.05755396, 0.06110172, 0.078125 , 1. , 0. , 0. , 0.30263158, 0.01360544, 1. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.0085504 , 0.01978417, 0.03185484, 0.03125 , 1. , 1. , 0. , 0.51315789, 0.00340136, 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. ], [0.02224122, 0.05215827, 0.06370968, 0.0625 , 0. , 0. , 0. , 0.47368421, 0.04081633, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00896774, 0.05035971, 0.00974896, 0.015625 , 0. , 0. , 0. , 0.5 , 0.02721088, 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.03302084, 0.07014388, 0.00779787, 0.015625 , 1. , 1. , 0. , 0.25 , 0.03741497, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00630083, 0.06115108, 0.01495838, 0. , 0. , 0. , 0. , 0.10526316, 0.00340136, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], [0.00951741, 0.03776978, 0.13261576, 0.140625 , 1. , 1. , 0. , 0.47368421, 0.0170068 , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 1. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]]) Now we can use the plot_precision_recall_curve function and print the two results, and they differ: disp = plot_precision_recall_curve(clf_4, X_test, y1_test) disp.ax_.set_title(f'2-class Precision-Recall curve:{average_precision_5}') So where does the difference come from?
The y_score parameter of average_precision_score needs to be probability estimates (or a similar continuous score), not the hard classification results. So your average_precision_5 is incorrect.
Replace python numpy matrix value based on a condition, without using a for loop
I have a symetric numpy matrix, for example. matrix([[0. , 0.125, 0.75 , 0. , 0. ], [0.125, 0. , 0. , 0. , 0. ], [0.75 , 0. , 0. , 0. , 0.375], [0. , 0. , 0. , 0. , 1.2 ], [0. , 0. , 0.375, 1.2 , 0. ]]) If a value in the array is greater than zero, is it possible to replace that value with the multiplication of the sum of that given row and column. For example 0.125 would be replaced by 0.109375, as row_sum * col_sum = 0.125 *(0.125+0.75)=0.109375. I know it can be done using for loop, but is it possible to do using standard numpy library as I want to avoid for loops.
Declaring the given matrix import numpy as np arr=np.array([[0. , 0.125, 0.75 , 0. , 0. ], [0.125, 0. , 0. , 0. , 0. ], [0.75 , 0. , 0. , 0. , 0.375], [0. , 0. , 0. , 0. , 1.2 ], [0. , 0. , 0.375, 1.2 , 0. ]]) using list comprehension and np.argwhere for conditional indices: def replace(x,y,arr=arr,column_sums=arr.sum(axis=0),row_sum=arr.sum(axis=1)): arr[x][y]=row_sum[x]*column_sums[y] _=[replace(x,y) for x,y in np.argwhere(arr>0)] The output: array([[0. , 0.109375, 0.984375, 0. , 0. ], [0.109375, 0. , 0. , 0. , 0. ], [0.984375, 0. , 0. , 0. , 1.771875], [0. , 0. , 0. , 0. , 1.89 ], [0. , 0. , 1.771875, 1.89 , 0. ]]) Note that the code can be more optimized its laid out for better understanding
What about using numpy's indexing features? arr[arr > 0] = x
Compute TF-IDF word score with relevant and random corpus
Given a corpus of relevant documents (CORPUS) and a corpus of random documents (ran_CORPUS) I want to compute TF-IDF scores for all words in CORPUS, using ran_CORPUS as a base line. In my project, the ran_CORPUS has approximately 10 times as many documents as CORPUS. CORPUS = ['this is a relevant document', 'this one is a relevant text too'] ran_CORPUS = ['the sky is blue', 'my cat has a furry tail'] My plan is to normalize the documents, make all documents in CORPUS to one document (CORPUS being now a list with one long string element). To CORPUS I append all ran_CORPUS documents. Using sklearn's TfidfTransformer I then would compute the TF-IDF matrix for the corpus (consisting now of CORPUS and ran_CORPUS). And finally select the first row of that CORPUS to get the TF-IDF scores for my initial relevant CORPUS. Does anybody know whether this approach could work and if there is a simple way to code it?
When you say "whether this approach could work", I presume you mean does merging all the relevant documents into one and vectorising present a valid model. I would guess it depends what you are going to try to do with that model. I'm not much of a mathematician, but I imagine that this is like averaging the scores for all your documents into one vector space, so you have lost some of the shape of the space the original vector space occupied by the individual relevant documents. So you have tried to make a "master" or "prototype" document which is mean to represent a topic? If you are then going to do something like similarity matching with test documents, or classification by distance comparison then you may have lost some of the subtlety of the original documents' vectorisation. There may be more facets to the overall topic than the averages represent. More specifically, imagine your original "relevant corpus" has two clusters of documents because there are actually two main sub-topics represented by different groups of important features. Later while doing classification, test documents could match either of those clusters individually - again because they are close to one of the two sub-topics. By averaging the whole "relevant corpus" in this case you would end up with a single document that was half-way between both of these clusters, but not accurately representing either. Therefore the test presentations might not match at all - depending on the classification technique. I think it's hard to say without trialling it on proper specific corpuses. Regardless of the validity, below is how it could be implemented. Note you can also use the TfidfVectorizer to combine the vectorising and Tfidf'ing steps in one. The results are not always the exactly same, but they are in this case. Also, you say normalise the documents - typically you might normalise the a vector representation before feeding into a classification algorithm which requires a normalised distribution (like SVM). However I think TFIDF naturally normalises so it doesn't appear to have any further effect (I may be wrong here). import logging from sklearn import preprocessing from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer CORPUS = ['this is a relevant document', 'this one is a relevant text too'] ran_CORPUS = ['the sky is blue', 'my cat has a furry tail'] doc_CORPUS = ' '.join([str(x) for x in CORPUS]) ran_CORPUS.append(doc_CORPUS) count_vect = CountVectorizer() X_counts = count_vect.fit_transform(ran_CORPUS) tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(X_counts) logging.debug("\nCount + TdidfTransform \n%s" % X_tfidf.toarray()) # or do it in one pass with TfidfVectorizer vectorizer = TfidfVectorizer() X_tfidf = vectorizer.fit_transform(ran_CORPUS) logging.debug("\nTdidfVectoriser \n%s" % X_tfidf.toarray()) # normalising doesn't achieve much as tfidf is already normalised. normalizer = preprocessing.Normalizer() X_tfidf = normalizer.transform(X_tfidf) logging.debug("\nNormalised:\n%s" % X_tfidf.toarray()) Count + TdidfTransform [[0.52863461 0. 0. 0. 0. 0.40204024 0. 0. 0. 0.52863461 0. 0. 0.52863461 0. 0. ] [0. 0.4472136 0. 0.4472136 0.4472136 0. 0.4472136 0. 0. 0. 0.4472136 0. 0. 0. 0. ] [0. 0. 0.2643173 0. 0. 0.40204024 0. 0.2643173 0.52863461 0. 0. 0.2643173 0. 0.52863461 0.2643173 ]] TdidfVectoriser [[0.52863461 0. 0. 0. 0. 0.40204024 0. 0. 0. 0.52863461 0. 0. 0.52863461 0. 0. ] [0. 0.4472136 0. 0.4472136 0.4472136 0. 0.4472136 0. 0. 0. 0.4472136 0. 0. 0. 0. ] [0. 0. 0.2643173 0. 0. 0.40204024 0. 0.2643173 0.52863461 0. 0. 0.2643173 0. 0.52863461 0.2643173 ]] Normalised: [[0.52863461 0. 0. 0. 0. 0.40204024 0. 0. 0. 0.52863461 0. 0. 0.52863461 0. 0. ] [0. 0.4472136 0. 0.4472136 0.4472136 0. 0.4472136 0. 0. 0. 0.4472136 0. 0. 0. 0. ] [0. 0. 0.2643173 0. 0. 0.40204024 0. 0.2643173 0.52863461 0. 0. 0.2643173 0. 0.52863461 0.2643173 ]]
Why the '\n' symbol is in numpy empty array?
During debug I see this: It is just creation of empty array... Why '\n'? How to make array without it?
Program to create a numpy array of zeros # Python 3.x import numpy as np y1 = np.zeros(100) print(y1) print("Shape of y1") print(y1.shape) I have directed the output of this into a 'tester.txt' file python3 numpynewline.py >> tester.txt The output although has a newline for displays, the shape of the array is not effected by it [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Shape of y1 (100,) Just 100 elements The output just looks to be having a newline for display, there is no actual '\n' in the array, you must be reading the y1 through a terminal or something, otherwise the normal python is not having any such characteristic of creating a numpy array with '\n' character Implemented on Ubuntu 17.04, and Python 3.6
Normalizing sparse.csc_matrix by its diagonals
I have a scipy.sparse.csc_matrix with dtype = np.int32. I want to efficiently divide each column (or row, whichever faster for csc_matrix) of the matrix by the diagonal element in that column. So mnew[:,i] = m[:,i]/m[i,i] . Note that I need to convert my matrix to np.double (since mnew elements will be in [0,1]) and since the matrix is massive and very sparse I wonder if I can do it in some efficient/no for loop/never going dense way. Best, Ilya
Make a sparse matrix: In [379]: M = sparse.random(5,5,.2, format='csr') In [380]: M Out[380]: <5x5 sparse matrix of type '<class 'numpy.float64'>' with 5 stored elements in Compressed Sparse Row format> In [381]: M.diagonal() Out[381]: array([ 0., 0., 0., 0., 0.]) too many 0s in the diagonal - lets add a nonzero diagonal: In [382]: D=sparse.dia_matrix((np.random.rand(5),0),shape=(5,5)) In [383]: D Out[383]: <5x5 sparse matrix of type '<class 'numpy.float64'>' with 5 stored elements (1 diagonals) in DIAgonal format> In [384]: M1 = M+D In [385]: M1 Out[385]: <5x5 sparse matrix of type '<class 'numpy.float64'>' with 10 stored elements in Compressed Sparse Row format> In [387]: M1.A Out[387]: array([[ 0.35786668, 0.81754484, 0. , 0. , 0. ], [ 0. , 0.41928992, 0. , 0.01371273, 0. ], [ 0. , 0. , 0.4685924 , 0. , 0.35724102], [ 0. , 0. , 0.77591294, 0.95008721, 0.16917791], [ 0. , 0. , 0. , 0. , 0.16659141]]) Now it's trivial to divide each column by its diagonal (this is a matrix 'product') In [388]: M1/M1.diagonal() Out[388]: matrix([[ 1. , 1.94983185, 0. , 0. , 0. ], [ 0. , 1. , 0. , 0.01443313, 0. ], [ 0. , 0. , 1. , 0. , 2.1444144 ], [ 0. , 0. , 1.65583764, 1. , 1.01552603], [ 0. , 0. , 0. , 0. , 1. ]]) Or divide the rows - (multiply by a column vector) In [391]: M1/M1.diagonal()[:,None] oops, these are dense; let's make the diagonal sparse In [408]: md = sparse.csr_matrix(1/M1.diagonal()) # do the inverse here In [409]: md Out[409]: <1x5 sparse matrix of type '<class 'numpy.float64'>' with 5 stored elements in Compressed Sparse Row format> In [410]: M.multiply(md) Out[410]: <5x5 sparse matrix of type '<class 'numpy.float64'>' with 5 stored elements in Compressed Sparse Row format> In [411]: M.multiply(md).A Out[411]: array([[ 0. , 1.94983185, 0. , 0. , 0. ], [ 0. , 0. , 0. , 0.01443313, 0. ], [ 0. , 0. , 0. , 0. , 2.1444144 ], [ 0. , 0. , 1.65583764, 0. , 1.01552603], [ 0. , 0. , 0. , 0. , 0. ]]) md.multiply(M) for the column version. Division of sparse matrix - similar except it is using the sum of the rows instead of the diagonal. Deals a bit more with the potential 'divide-by-zero' issue.