I have the following vectors in my toy example:
data = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
I have calculated similarity matrix (by excluding the id column).
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the pairwise cosine similarities
S = cosine_similarity(data.drop('id', axis=1))
T = S.tolist()
df = pd.DataFrame.from_records(T)
It returns me a matrix/dataframe with all options including self similarity and duplicates.
Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).
The best solution I found so far is to take the lower triangle under the diagonal:
[In] S[np.triu_indices_from(S, k=1)]
[Out] array([ 0.93420158, -0.93416293, 0.99856978, -0.81303909, -0.99999999,
0.91379242, -0.96724292, -0.91374841, 0.96727042, -0.78074903])
What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too.
Related
I am trying to use numpy to multiply two matrices:
import numpy as np
A = np.array([[1, 3, 2], [4, 0, 1]])
B = np.array([[1, 0, 5], [3, 1, 2]])
I tested the process and ran the calculations manually, utilizing the formula for matrix multiplications. So, in this case, I would first multiply [1, 0, 5] x A, which resulted in [11, 9] and then multiply [3, 1, 2] x B, which resulted in [10, 14]. Finally, the product of this multiplication is [[11, 9], [10, 14]]
nevertheless, when I use numpy to multiply these matrices, I am getting an error:
ValueError: shapes (2,3) and (2,3) not aligned: 3 (dim 1) != 2 (dim 0)
Is there a way to do this with python, successfully?
Read the docs on matrix multiplication in numpy, specifically on behaviours.
The behavior depends on the arguments in the following way.
If both arguments are 2-D they are multiplied like conventional
matrices. If either argument is N-D, N > 2, it is treated as a stack
of matrices residing in the last two indexes and broadcast
accordingly. If the first argument is 1-D, it is promoted to a matrix
by prepending a 1 to its dimensions. After matrix multiplication the
prepended 1 is removed. If the second argument is 1-D, it is promoted
to a matrix by appending a 1 to its dimensions. After matrix
multiplication the appended 1 is removed.
to get your output, try transposing one before multiplying?
c=np.matmul(A,B.transpose())
array([[11, 10],
[ 9, 14]])
I'm working on a smaller project to better understand RNN, in particualr LSTM and GRU. I'm not at all an expert, so please bear that in mind.
The problem I'm facing is given as data in the form of:
>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame([[1, 2, 3],[1, 2, 1], [1, 3, 2],[2, 3, 1],[3, 1, 1],[3, 3, 2],[4, 3, 3]], columns=['person', 'interaction', 'group'])
person interaction group
0 1 2 3
1 1 2 1
2 1 3 2
3 2 3 1
4 3 1 1
5 3 3 2
6 4 3 3
this is just for explanation. We have different person interacting with different groups in different ways. I've already encoded the various features. The last interaction of a user is always a 3, which means selecting a certain group. In the short example above person 1 chooses group 2, person 2 chooses group 1 and so on.
My whole data set is much bigger but I would like to understand first the conceptual part before throwing models at it. The task I would like to learn is given a sequence of interaction, which group is chosen by the person. A bit more concrete, I would like to have an output a list with all groups (there are 3 groups, 1, 2, 3) sorted by the most likely choice, followed by the second and third likest group. The loss function is therefore a mean reciprocal rank.
I know that in Keras Grus/LSTM can handle various length input. So my three questions are.
The input is of the format:
(samples, timesteps, features)
writing high level code:
import keras.layers as L
import keras.models as M
model_input = L.Input(shape=(?, None, 2))
timestep=None should imply the varying size and 2 is for the feature interaction and group. But what about the samples? How do I define the batches?
For the output I'm a bit puzzled how this should look like in this example? I think for each last interaction of a person I would like to have a list of length 3. Assuming I've set up the output
model_output = L.LSTM(3, return_sequences=False)
I then want to compile it. Is there a way of using the mean reciprocal rank?
model.compile('adam', '?')
I know the questions are fairly high level, but I would like to understand first the big picture and start to play around. Any help would therefore be appreciated.
The concept you've drawn in your question is a pretty good start already. I'll add a few things to make it work, as well as a code example below:
You can specify LSTM(n_hidden, input_shape=(None, 2)) directly, instead of inserting an extra Input layer; the batch dimension is to be omitted for the definition.
Since your model is going to perform some kind of classification (based on time series data) the final layer is what we'd expect from "normal" classification as well, a Dense(num_classes, action='softmax'). Chaining the LSTM and the Dense layer together will first pass the time series input through the LSTM layer and then feed its output (determined by the number of hidden units) into the Dense layer. activation='softmax' allows to compute a class score for each class (we're going to use one-hot-encoding in a data preprocessing step, see code example below). This means class scores are not ordered, but you can always do so via np.argsort or np.argmax.
Categorical crossentropy loss is suited for comparing the classification score, so we'll use that one: model.compile(loss='categorical_crossentropy', optimizer='adam').
Since the number of interactions. i.e. the length of model input, varies from sample to sample we'll use a batch size of 1 and feed in one sample at a time.
The following is a sample implementation w.r.t to the above considerations. Note that I modified your sample data a bit, in order to provide more "reasoning" behind group choices. Also each person needs to perform at least one interaction before choosing a group (i.e. the input sequence cannot be empty); if this is not the case for your data, then introducing an additional no-op interaction (e.g. 0) can help.
import pandas as pd
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(10, input_shape=(None, 2))) # LSTM for arbitrary length series.
model.add(tf.keras.layers.Dense(3, activation='softmax')) # Softmax for class probabilities.
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Example interactions:
# * 1: Likes the group,
# * 2: Dislikes the group,
# * 3: Chooses the group.
df = pd.DataFrame([
[1, 1, 3],
[1, 1, 3],
[1, 2, 2],
[1, 3, 3],
[2, 2, 1],
[2, 2, 3],
[2, 1, 2],
[2, 3, 2],
[3, 1, 1],
[3, 1, 1],
[3, 1, 1],
[3, 2, 3],
[3, 2, 2],
[3, 3, 1]],
columns=['person', 'interaction', 'group']
)
data = [person[1][['interaction', 'group']].values for person in df.groupby('person')]
x_train = [x[:-1] for x in data]
y_train = tf.keras.utils.to_categorical([x[-1, 1]-1 for x in data]) # Expects class labels from 0 to n (-> subtract 1).
print(x_train)
print(y_train)
class TrainGenerator(tf.keras.utils.Sequence):
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, index):
# Need to expand arrays to have batch size 1.
return self.x[index][None, :, :], self.y[index][None, :]
model.fit_generator(TrainGenerator(x_train, y_train), epochs=1000)
pred = [model.predict(x[None, :, :]).ravel() for x in x_train]
for p, y in zip(pred, y_train):
print(p, y)
And the corresponding sample output:
[...]
Epoch 1000/1000
3/3 [==============================] - 0s 40ms/step - loss: 0.0037
[0.00213619 0.00241093 0.9954529 ] [0. 0. 1.]
[0.00123938 0.99718493 0.00157572] [0. 1. 0.]
[9.9632275e-01 7.5039308e-04 2.9268670e-03] [1. 0. 0.]
Using custom generator expressions: According to the documentation we can use any generator to yield the data. The generator is expected to yield batches of the data and loop over the whole data set indefinitely. When using tf.keras.utils.Sequence we do not need to specify the parameter steps_per_epoch as this will default to len(train_generator). Hence, when using a custom generator, we shall provide this parameter as well:
import itertools as it
model.fit_generator(((x_train[i % len(x_train)][None, :, :],
y_train[i % len(y_train)][None, :]) for i in it.count()),
epochs=1000,
steps_per_epoch=len(x_train))
i am trying to reduce the dimensions of MNIST dataset using PCA. Trick is, i have to preserve the certain percentage of variance(say 80%) while reducing the dimension. I am using Scikit learn. I am doing pca.get_variance ratio but it gives me same values with different dot location like 9.7 or .97 or .097. i am also tried pca.get_variance() but i assume that's not the answer. My question is how to ensure that i have reduce the dimension with certain variance percentage preserve?
If you apply PCA without passing the n_components argument, then the explained_variance_ratio_ attribute of the PCA object will give you the information you need. This attribute indicates the fraction of total variance associated with the corresponding eigenvector. Here is an example copied directly from the current stable PCA documentation:
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
>>> print(pca.explained_variance_ratio_)
[ 0.99244... 0.00755...]
In your case, if you apply np.cumsum to the explained_variance_ratio_ attribute, then the number of principal components you need to keep corresponds to the position of the first element in np.cumsum(pca.explained_variance_ratio_) that is greater than or equal to 0.8.
After reading this similar question, I still can't fully understand how to go about implementing the solution im looking for. I have a sparse matrix, i.e.:
import numpy as np
from scipy import sparse
arr = np.array([[0,5,3,0,2],[6,0,4,9,0],[0,0,0,6,8]])
arr_csc = sparse.csc_matrix(arr)
I would like to efficiently get the top n items of each row, without converting the sparse matrix to dense.
The end result should look like this (assuming n=2):
top_n_arr = np.array([[0,5,3,0,0],[6,0,0,9,0],[0,0,0,6,8]])
top_n_arr_csc = sparse.csc_matrix(top_n_arr)
What is wrong with the linked answer? Does it not work in your case? or you just don't understand it? Or it isn't efficient enough?
I was going to suggest working out a means of finding the top values for a row of an lil format matrix, and apply that row by row. But I would just be repeating my earlier answer.
OK, my previous answer was a start, but lacked some details on iterating through the lol format. Here's a start; it probably could be cleaned up.
Make the array, and a lil version:
In [42]: arr = np.array([[0,5,3,0,2],[6,0,4,9,0],[0,0,0,6,8]])
In [43]: arr_sp=sparse.csc_matrix(arr)
In [44]: arr_ll=arr_sp.tolil()
The row function from the previous answer:
def max_n(row_data, row_indices, n):
i = row_data.argsort()[-n:]
# i = row_data.argpartition(-n)[-n:]
top_values = row_data[i]
top_indices = row_indices[i] # do the sparse indices matter?
return top_values, top_indices, i
Iterate over the rows of arr_ll, apply this function and replace the elements:
In [46]: for i in range(arr_ll.shape[0]):
d,r=max_n(np.array(arr_ll.data[i]),np.array(arr_ll.rows[i]),2)[:2]
arr_ll.data[i]=d.tolist()
arr_ll.rows[i]=r.tolist()
....:
In [47]: arr_ll.data
Out[47]: array([[3, 5], [6, 9], [6, 8]], dtype=object)
In [48]: arr_ll.rows
Out[48]: array([[2, 1], [0, 3], [3, 4]], dtype=object)
In [49]: arr_ll.tocsc().A
Out[49]:
array([[0, 5, 3, 0, 0],
[6, 0, 0, 9, 0],
[0, 0, 0, 6, 8]])
In the lil format, the data is stored in 2 object type arrays, as sublists, one with the data numbers, the other with the column indices.
Viewing the data attributes of sparse matrix is handy when doing new things. Changing those attributes has some risk, since it mess up the whole array. But it looks like the lil format can be tweaked like this safely.
The csr format is better for accessing rows than csc. It's data is stored in 3 arrays, data, indices and indptr. The lil format effectively splits 2 of those arrays into sublists based on information in the indptr. csr is great for math (multiplication, addition etc), but not so good when changing the sparsity (turning nonzero values into zeros).
I am not understanding how the coverage_error is calculated in scikit learn, available in sklearn.metrics module. Explanation in the docs is as below:
The coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted.
For eg:
import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 1, 1]])
y_score = np.array([[1, 0, 0], [0, 1, 1]])
print coverage_error(y_true, y_score)
1.5
As per my understanding, here we need to include 3 labels from the prediction to get all labels in y_true. So coverage error = 3/2, ie, 1.5. But I am not able to understand what happens in the below cases:
>>> y_score = np.array([[1, 0, 0], [0, 0, 1]])
>>> print coverage_error(y_true, y_score)
2.0
>>> y_score = np.array([[1, 0, 1], [0, 1, 1]])
>>> print coverage_error(y_true, y_score)
2.0
How come the error is same in both the cases?
You can have a look at User Guide 3.3.3. Multilabel ranking metrics
with
One thing you need to take care is how to compute ranks and break ties in ranking y_score.
To be specific, the first case:
In [4]: y_true
Out[4]:
array([[1, 0, 0],
[0, 1, 1]])
In [5]: y_score
Out[5]:
array([[1, 0, 0],
[0, 0, 1]])
For the 1st sample, the 1st true label is true, and the rank of 1st score is 1.
For the 2ed sample, the 2ed and 3rd true label are true, and the ranks of score are 3 and 1 respectively, so the max rank is 3.
The average is (3+1)/2=2.
the second case:
In [7]: y_score
Out[7]:
array([[1, 0, 1],
[0, 1, 1]])
For the 1st sample, the 1st true label is true, and the rank of 1st score is 2.
For the 2ed sample, the 2ed and 3rd true label are true, and the ranks of score are 2 and 2 respectively, so the max rank is 2.
The average is (2+2)/2=2.
Edit:
The rank is within one sample of y_score. The formula says the rank of a label is the number of labels (including itself) whose score is greater than or equal to its score.
It is just like sort the labels by y_score, and the label with largest score is ranked 1, the second largest is ranked 2, the third largest is ranked 3, etc. But if the second and third largest labels have the same score, they are both ranked 3.
Notice that y_score is
Target scores, can either be probability estimates of the positive class, confidence values, or binary decisions.
The goal is to have all true labels predicted, so we need to include all the labels with higher or equal scores than the true label.