I want to represent each text-based item I have in my system as a vector in vector space model. The values for the terms can be negative or positive that reflect the frequency of a term in the positive or negative class. The zero value means neutral
for example:
Item1 (-1,0,-5,4.5,2)
Item2 (2,6,0,-4,0.5)
My questions are:
1- How can I normalize my vectors to a range of [0 to 1] where:
.5 means zero before normalization
and .5> if it is positive
.5< if it negative
I want to know if there is a mathematical formula to do such a thing.
2- Will similarity measure choice be different after the normalization?? For example can I use Cosine similarity?
3- Will it be difficult if I preform dimensionality reduction after the normalization??
Thanks in advance
One solution could be to use the MinMaxScaler which scales the number between (0, 1) range and then divide each row by the sum of the row. In python using sklearn you can do something like this:
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
scaled_X = scaler.fit_transform(X)
normalized_X = normalize(scaled_X, norm='l1', axis=1, copy=True)
Related
I have a (relatively sparse) 2d tensor U of shape (B, I) of 1s and 0s. Each row represents a user and each column an item where the cell is 1 if the user has interacted with said item and 0 if not.
I want to apply dropout (or a similar tensor operation to it) so that, at random, p% of the 1s in each row (i.e. per user) are set to 0.
How can I go about doing that efficiently without a for-loop along the B dimension (where I would just use pytorch's dropout on the row 1d tensors, after accounting for the 0s)?
If I understand the question correctly, you want to build the network out manually? One way to do this would be to create a boolean array (same size of your weights) each run. Then multiply that with the weight before using it.
dropout = torch.randint(2, (10,))
weights = torch.randn(10)
dr_wt = dropout * weights
Edit
you can create a array with 10% 1s rest 0s. then shuffle it every run to multiply with the weights.
a = np.zeros(10)
a[0] = 1
np.random.shuffle(a)
a = torch.as_tensor(a)
Correct me if I'm wrong, but if you say you want a p% of 1s to turn to 0s per row, then row 0 might have 10 1s and row 1 might have 100. When you apply dropout on average, only one of the 1s gets affected by the dropout mask, while about 10 get affected in the second row.
def dropout(input: Tensor, p: float = 0.5):
mask = torch.rand_like(input) > p # creates a bool tensor
return input * mask
I don't know how you would be able to guarantee that exactly 10% get nulled without using some sort of row-based sampling of nonzero indices, which in turn requires a for loop.
I'm implementing an efficient PageRank algorithm so I'm using sparse matrices. I'm close, but there's one problem. I have a matrix where I want the sum of each column to be one. This is easy to implement, but the problem occurs when I get a matrix with a zero column.
In this case, I want to set each element in the column to be 1/(n-1) where n is the dimension of the matrix. I divide by n-1 and not n because I wish to keep the diagonals zero, always.
How can I implement this efficiently? My naive solution is to just determine the sum of each column and then find the column indices that are zero and replace the entire column with an 1/(n-1) value like so:
# naive approach (too slow!)
# M is my nxn sparse matrix where each column sums to one
col_sums = M.sum(axis=0)
for i in range(n):
if col_sums[0,i] == 0:
# set entire column to 1/(n-1)
M[:, i] = 1/(n-1)
# make sure diagonal is zeroed
M[i,i] = 0
My M matrix is very very very large and this method simply doesn't scale. How can I do this efficiently?
You can't add new nonzero values without reallocating and copying the underlying data structure. If you expect these zero columns to be very common (> 25% of the data) you should handle them in some other way, or you're better off with a dense array.
Otherwise try this:
import scipy.sparse
M = scipy.sparse.rand(1000, 1000, density=0.001, format='csr')
nz_col_weights = scipy.sparse.csr_matrix(M.shape, dtype=M.dtype)
nz_col_weights[:, M.getnnz(axis=0) == 0] = 1 / (M.shape[0] - 1)
nz_col_weights.setdiag(0)
M += nz_col_weights
This has only two allocation operations
I'd like to calculate document similarity by using word embedding models (w2v, glove)
so one document can be represented 257*300 matrix
( 257(max number of document) * 300(pretrained embedding model dimension))
And now I try to calculate distance between all document.
When I use cosine similarity, euclidean or other vector calculation methods in scikit-learn.
But these methods return similarity matrix.
Is there any method to get one number after matrix distance calculation?
Or should I calculate average of all vectors in similarity matrix ? (I think this is not proper way to solve this problem..)
All,
I ran a logistic Regression on a set of variables both categorical and continuous with a binary event as dependent variable.
Now post modelling, I observe a set of categorical variables showing negative sign which I presume is to understand that if that categorical variable occurs high number of times then the probability of the dependent variable occurring is low.
But when I see the % of occurrence of that independent variable I see the reverse trend happening. hence the result seems to be counter intuitive. Any reason why this could happen. I have tried explaining below with a pseudo example.
Dependent Variable - E
Predictors:
1. Categorical Var - Cat1 with 2 levels (0,1)
2. Continuous Var - Con1
3. Categorical Var - Cat2 with 2 levels (0,1)
Post Modelling:
Say all are significant and the coefficients are like below,
Cat1 - (-0.6)
Con1- (0.3)
Cat2 - (-0.4)
But when I calculate the % of occurrence of Event E on Cat 1, I observe that the % of occurence is high when Cat1 is 1, which I think is counter intuitive.
Pls help in understanding this.
Coefficients of logistic regression are not directly related to the chage of probability of the event, rather it's a relative measure of the change in the odds of the event. This article has detailed derivation of how to interpret the coefficients of logistic regression. In your context, the coefficient for CAT1 is -0.6 means p(E|CAT1 = 1) < p(E|CAT1 = 0) and it's not related to exactly how big p(E|CAT1 = 1) is.
I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.