Related
I have a symetric numpy matrix, for example.
matrix([[0. , 0.125, 0.75 , 0. , 0. ],
[0.125, 0. , 0. , 0. , 0. ],
[0.75 , 0. , 0. , 0. , 0.375],
[0. , 0. , 0. , 0. , 1.2 ],
[0. , 0. , 0.375, 1.2 , 0. ]])
If a value in the array is greater than zero, is it possible to replace that value with the multiplication of the sum of that given row and column. For example 0.125 would be replaced by 0.109375, as row_sum * col_sum = 0.125 *(0.125+0.75)=0.109375.
I know it can be done using for loop, but is it possible to do using standard numpy library as I want to avoid for loops.
Declaring the given matrix
import numpy as np
arr=np.array([[0. , 0.125, 0.75 , 0. , 0. ],
[0.125, 0. , 0. , 0. , 0. ],
[0.75 , 0. , 0. , 0. , 0.375],
[0. , 0. , 0. , 0. , 1.2 ],
[0. , 0. , 0.375, 1.2 , 0. ]])
using list comprehension and np.argwhere for conditional indices:
def replace(x,y,arr=arr,column_sums=arr.sum(axis=0),row_sum=arr.sum(axis=1)):
arr[x][y]=row_sum[x]*column_sums[y]
_=[replace(x,y) for x,y in np.argwhere(arr>0)]
The output:
array([[0. , 0.109375, 0.984375, 0. , 0. ],
[0.109375, 0. , 0. , 0. , 0. ],
[0.984375, 0. , 0. , 0. , 1.771875],
[0. , 0. , 0. , 0. , 1.89 ],
[0. , 0. , 1.771875, 1.89 , 0. ]])
Note that the code can be more optimized its laid out for better understanding
What about using numpy's indexing features?
arr[arr > 0] = x
Given a corpus of relevant documents (CORPUS) and a corpus of random documents (ran_CORPUS) I want to compute TF-IDF scores for all words in CORPUS, using ran_CORPUS as a base line. In my project, the ran_CORPUS has approximately 10 times as many documents as CORPUS.
CORPUS = ['this is a relevant document',
'this one is a relevant text too']
ran_CORPUS = ['the sky is blue',
'my cat has a furry tail']
My plan is to normalize the documents, make all documents in CORPUS to one document (CORPUS being now a list with one long string element). To CORPUS I append all ran_CORPUS documents. Using sklearn's TfidfTransformer I then would compute the TF-IDF matrix for the corpus (consisting now of CORPUS and ran_CORPUS). And finally select the first row of that CORPUS to get the TF-IDF scores for my initial relevant CORPUS.
Does anybody know whether this approach could work and if there is a simple way to code it?
When you say "whether this approach could work", I presume you mean does merging all the relevant documents into one and vectorising present a valid model. I would guess it depends what you are going to try to do with that model.
I'm not much of a mathematician, but I imagine that this is like averaging the scores for all your documents into one vector space, so you have lost some of the shape of the space the original vector space occupied by the individual relevant documents. So you have tried to make a "master" or "prototype" document which is mean to represent a topic?
If you are then going to do something like similarity matching with test documents, or classification by distance comparison then you may have lost some of the subtlety of the original documents' vectorisation. There may be more facets to the overall topic than the averages represent.
More specifically, imagine your original "relevant corpus" has two clusters of documents because there are actually two main sub-topics represented by different groups of important features. Later while doing classification, test documents could match either of those clusters individually - again because they are close to one of the two sub-topics. By averaging the whole "relevant corpus" in this case you would end up with a single document that was half-way between both of these clusters, but not accurately representing either. Therefore the test presentations might not match at all - depending on the classification technique.
I think it's hard to say without trialling it on proper specific corpuses.
Regardless of the validity, below is how it could be implemented.
Note you can also use the TfidfVectorizer to combine the vectorising and Tfidf'ing steps in one. The results are not always the exactly same, but they are in this case.
Also, you say normalise the documents - typically you might normalise the a vector representation before feeding into a classification algorithm which requires a normalised distribution (like SVM). However I think TFIDF naturally normalises so it doesn't appear to have any further effect (I may be wrong here).
import logging
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
CORPUS = ['this is a relevant document',
'this one is a relevant text too']
ran_CORPUS = ['the sky is blue',
'my cat has a furry tail']
doc_CORPUS = ' '.join([str(x) for x in CORPUS])
ran_CORPUS.append(doc_CORPUS)
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(ran_CORPUS)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
logging.debug("\nCount + TdidfTransform \n%s" % X_tfidf.toarray())
# or do it in one pass with TfidfVectorizer
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(ran_CORPUS)
logging.debug("\nTdidfVectoriser \n%s" % X_tfidf.toarray())
# normalising doesn't achieve much as tfidf is already normalised.
normalizer = preprocessing.Normalizer()
X_tfidf = normalizer.transform(X_tfidf)
logging.debug("\nNormalised:\n%s" % X_tfidf.toarray())
Count + TdidfTransform
[[0.52863461 0. 0. 0. 0. 0.40204024
0. 0. 0. 0.52863461 0. 0.
0.52863461 0. 0. ]
[0. 0.4472136 0. 0.4472136 0.4472136 0.
0.4472136 0. 0. 0. 0.4472136 0.
0. 0. 0. ]
[0. 0. 0.2643173 0. 0. 0.40204024
0. 0.2643173 0.52863461 0. 0. 0.2643173
0. 0.52863461 0.2643173 ]]
TdidfVectoriser
[[0.52863461 0. 0. 0. 0. 0.40204024
0. 0. 0. 0.52863461 0. 0.
0.52863461 0. 0. ]
[0. 0.4472136 0. 0.4472136 0.4472136 0.
0.4472136 0. 0. 0. 0.4472136 0.
0. 0. 0. ]
[0. 0. 0.2643173 0. 0. 0.40204024
0. 0.2643173 0.52863461 0. 0. 0.2643173
0. 0.52863461 0.2643173 ]]
Normalised:
[[0.52863461 0. 0. 0. 0. 0.40204024
0. 0. 0. 0.52863461 0. 0.
0.52863461 0. 0. ]
[0. 0.4472136 0. 0.4472136 0.4472136 0.
0.4472136 0. 0. 0. 0.4472136 0.
0. 0. 0. ]
[0. 0. 0.2643173 0. 0. 0.40204024
0. 0.2643173 0.52863461 0. 0. 0.2643173
0. 0.52863461 0.2643173 ]]
l work with Networkx to generate some class of graphs.
Now l would like to permute nodes and rotate the graph with (80°, 90°,120° degree)
How can l apply permutation and rotation on graphs with NetworkX ?
Edit_1:
Given an adjacency matrix of a graph, l would like to rotate the graph in the way that it preserves the edges and vertices link. The only thing that changes is the position of nodes.
What l would like to do is to rotate my graph with 90 degree.
Input :
Adjacency matrix of graph G
process :
Apply rotation on G with 90 degree
Output :
Rotated adjacency matrix
It means, the graph preserves its topology and just the index of adjacency matrix that changes position.
For example nodes 1 at index 0 after rotation will be at index 4 for instance.
What l have tried ?
1)l looked after numpy.random.permutation() but it does't seem to accept the rotation parameter.
2) In networkX l didn't find any function that allows to do rotation.
EDIT2
Given an adjacency matrix of 5*5 (5 nodes:
adj=[[0,1,0,0,1],
[1,0,1,1,0],
[0,0,0,1,1],
[0,0,1,0,1],
[1,1,1,1,0]
]
l would like to permute between indexes .
Say that node 1 takes the place of node 3 , node 3 takes the place of nodes 4 and node 4 takes the place of node 1.
It's just the permutation of nodes (preserving their edges).
l would like to keep in a dictionary the mapping between original index and the new index after permutation.
Secondly, l would like to apply permutation or rotation of this adjacency matrix with an angle of 90°. (It's like apply rotation on an image). I'm not sure how it can be done.
Take a look at the networkx command relabel_nodes.
Given a graph G, if we want to relabel node 0 as 1, 1 as 3, and 3 as 0 [so a permutation of the nodes, leaving 2 in place], we create the dict mapping = {0:1, 1:3, 3:0}. Then we do
H = nx.relabel_nodes(G, mapping)
And H is now the permuted graph.
import networkx as nx
G = nx.path_graph(4) #0-1-2-3
mapping = {0:1, 1:3, 3:0}
H = nx.relabel_nodes(G, mapping) #1-3-2-0
#check G's adjacency matrix
print(nx.to_numpy_matrix(G,nodelist=[0,1,2,3]))
> [[ 0. 1. 0. 0.]
[ 1. 0. 1. 0.]
[ 0. 1. 0. 1.]
[ 0. 0. 1. 0.]]
#check H's adjacency matrix
print(nx.to_numpy_matrix(H,nodelist=[0,1,2,3]))
> [[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]
[ 1. 0. 0. 1.]
[ 0. 1. 1. 0.]]
During debug I see this:
It is just creation of empty array... Why '\n'? How to make array without it?
Program to create a numpy array of zeros
# Python 3.x
import numpy as np
y1 = np.zeros(100)
print(y1)
print("Shape of y1")
print(y1.shape)
I have directed the output of this into a 'tester.txt' file
python3 numpynewline.py >> tester.txt
The output although has a newline for displays, the shape of the array is not effected by it
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Shape of y1
(100,)
Just 100 elements
The output just looks to be having a newline for display, there is no actual '\n' in the array, you must be reading the y1 through a terminal or something, otherwise the normal python is not having any such characteristic of creating a numpy array with '\n' character
Implemented on Ubuntu 17.04, and Python 3.6
I have a scipy.sparse.csc_matrix with dtype = np.int32. I want to efficiently divide each column (or row, whichever faster for csc_matrix) of the matrix by the diagonal element in that column. So mnew[:,i] = m[:,i]/m[i,i] . Note that I need to convert my matrix to np.double (since mnew elements will be in [0,1]) and since the matrix is massive and very sparse I wonder if I can do it in some efficient/no for loop/never going dense way.
Best,
Ilya
Make a sparse matrix:
In [379]: M = sparse.random(5,5,.2, format='csr')
In [380]: M
Out[380]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [381]: M.diagonal()
Out[381]: array([ 0., 0., 0., 0., 0.])
too many 0s in the diagonal - lets add a nonzero diagonal:
In [382]: D=sparse.dia_matrix((np.random.rand(5),0),shape=(5,5))
In [383]: D
Out[383]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements (1 diagonals) in DIAgonal format>
In [384]: M1 = M+D
In [385]: M1
Out[385]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [387]: M1.A
Out[387]:
array([[ 0.35786668, 0.81754484, 0. , 0. , 0. ],
[ 0. , 0.41928992, 0. , 0.01371273, 0. ],
[ 0. , 0. , 0.4685924 , 0. , 0.35724102],
[ 0. , 0. , 0.77591294, 0.95008721, 0.16917791],
[ 0. , 0. , 0. , 0. , 0.16659141]])
Now it's trivial to divide each column by its diagonal (this is a matrix 'product')
In [388]: M1/M1.diagonal()
Out[388]:
matrix([[ 1. , 1.94983185, 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0.01443313, 0. ],
[ 0. , 0. , 1. , 0. , 2.1444144 ],
[ 0. , 0. , 1.65583764, 1. , 1.01552603],
[ 0. , 0. , 0. , 0. , 1. ]])
Or divide the rows - (multiply by a column vector)
In [391]: M1/M1.diagonal()[:,None]
oops, these are dense; let's make the diagonal sparse
In [408]: md = sparse.csr_matrix(1/M1.diagonal()) # do the inverse here
In [409]: md
Out[409]:
<1x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [410]: M.multiply(md)
Out[410]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [411]: M.multiply(md).A
Out[411]:
array([[ 0. , 1.94983185, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.01443313, 0. ],
[ 0. , 0. , 0. , 0. , 2.1444144 ],
[ 0. , 0. , 1.65583764, 0. , 1.01552603],
[ 0. , 0. , 0. , 0. , 0. ]])
md.multiply(M) for the column version.
Division of sparse matrix - similar except it is using the sum of the rows instead of the diagonal. Deals a bit more with the potential 'divide-by-zero' issue.