How to convert pyspark rdd into sparse matrix - apache-spark

I have one key/value pair RDD
{(("a", "b"), 1), (("a", "c"), 3), (("c", "d"), 5)}
how could I get the sparse matrix:
0 1 3 0
1 0 0 0
3 0 0 5
0 0 5 0
i.e.
from pyspark.mllib.linalg import Matrices
Matrices.sparse(4, 4, [0, 2, 3, 5, 6], [1, 2, 0, 0, 3, 2], [1, 3, 1, 3, 5, 5])
or
import numpy as np
from scipy.sparse import csc_matrix
data = [1, 3, 1, 3, 5, 5]
indices = [1, 2, 0, 0, 3, 2]
indptr = [0, 2, 3, 5, 6]
csc_matrix((data, indices, indptr), shape=(4, 4), dtype=np.float)

Could you apply pivot to dataframe then convert to matrix?

Related

Multiclass vs. multilabel fitting

In scikit-learn tutorials, I found the following paragraphs in the section 'Multiclass vs. multilabel fitting'.
I couldn't understand why the following codes generate the given results.
First
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Next
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
Label binarization in scikit-learn will transform your targets and represent them in a label indicator matrix. This label indicator matrix has the shape (n_samples, n_classes) and is composed as follows:
each row represents a sample
each column represents a class
each element is 1 if the sample is labeled with the class and 0 if not
In your first example, you have a target collection with 5 samples and 3 classes. That's why transforming y with LabelBinarizer results in a 5x3 matrix. In your case, [1, 0, 0] corresponds to class 0, [0, 1, 0] corresponds to class 1 and so forth. Notice that in each row there is only one element set to 1, since each sample can have one label only.
In your next example, you have a target collection with 5 samples and 5 classes. That's why transforming y with MultiLabelBinarizer results in a 5x5 matrix. In your case, [1, 1, 0, 0, 0] corresponds to the multilabel [0, 1], [0, 1, 0, 1, 0] corresponds to the multilabel [1, 3] and so forth. The key difference to the first example is that each row can have multiple elements set to 1, because each sample can have multiple labels/classes.
The predicted values you get follow the very same pattern. They are however not equivalent to the original values in y since your classification model has obviously predicted different values. You can check this with the inverse_transform() of the binarizers:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = np.array([[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]])
y_bin = mlb.fit_transform(y)
# direct transformation
[[1 1 0 0 0]
[1 0 1 0 0]
[0 1 0 1 0]
[1 0 1 1 0]
[0 0 1 0 1]]
# prediction of your classifier
y_pred = np.array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
# inverting the binarized values to the original classes
y_inv = mlb.inverse_transform(y_pred)
# output
[(0, 1), (0, 2), (1, 3), (0, 2), (0, 2)]

How to initialize columns in hybrid sparse tensor

How initialize in pytorch hybrid tensor torch.sparse_coo_tensor (one dimension is sparse and other is not), which have the following dense representation?
array([[1, 0, 5, 0],
[2, 0, 6, 0],
[3, 0, 7, 0],
[4, 0, 8, 0]])
What should I put into the indices argument?
How to initialize
Something like this:
import torch
indices = torch.tensor([[0, 0, 1, 1, 2, 2, 3, 3], [0, 2, 0, 2, 0, 2, 0, 2]])
tensor = torch.sparse_coo_tensor(
indices, torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]), size=(4, 4)
)
Given above:
indices - first dimension specifies row, second column, where non-zero value(s) will be located. Those become pairs, in this case: (0, 0), (0, 2), (1, 0), (1, 2)... and so on
values - values located at those pairs, so 1 will be under (0, 0) coordinate, 2 under (0, 2) and so it goes.
size - total size of the matrix, optional, might be inferred in this case from your input
8 pairs, 8 values, there are also other ways to specify it, but the idea holds.
And a quick check:
print(tensor)
print(tensor.to_dense())
Gives us:
tensor(indices=tensor([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 2, 0, 2, 0, 2, 0, 2]]),
values=tensor([1, 2, 3, 4, 5, 6, 7, 8]),
size=(4, 4), nnz=8, layout=torch.sparse_coo)
tensor([[1, 0, 2, 0],
[3, 0, 4, 0],
[5, 0, 6, 0],
[7, 0, 8, 0]])
Why to initialize
If your actual data is 50% sparse, you shouldn't use COO tensor.
It will save some memory, but operations will be way slower, so keep that in mind.

How to reshape an array with numpy like this:

I have this:
array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]])
And I would like to reshape my array like this:
array([[0, 0, 1, 1],
[0, 0, 1, 1],
[2, 2, 3, 3],
[2, 2, 3, 3]])
How do I do it using python numpy?
You can just split and concatenate:
a = np.array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]])
cols = a.shape[1] // 2
np.concatenate((a[:,:cols], a[:,cols:]))
#[[0 0 1 1]
# [0 0 1 1]
# [2 2 3 3]
# [2 2 3 3]]
You can simply swap rows after reshaping it.
a= np.array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]]).reshape(4,4)
a[[1,2]] = a[[2,1]]
Output:
array([[0, 0, 1, 1],
[0, 0, 1, 1],
[2, 2, 3, 3],
[2, 2, 3, 3]])

Numpy is not inserting the right array into multidimensional array

I have a matrix
M = np.array([
[1, -2, -2, -2, 1, 2],
[0, 3, -2, -3, 1, 3],
[3, 0, 0, 1, -1, 2],
[3, -3, -2, 0, 1, 1],
[0, -3, 3, -3, -3, 2]
])
and I'm trying to replace the first row by itself modulo some number N = 2497969412496091.
I've been playing around with this in the IDE for a while, and even though
>>> M[0] % N
array([1, 2497969412496089, 2497969412496089, 2497969412496089, 1, 2], dtype=int64)
After I preform M[0] = M[0] % N and print the matrix M, I get
>>> M[0]
array([1, -746726695, -746726695, -746726695, 1, 2])
I've also tried to copy the intermediate step M[0] % N in a temporary variable and then setting it equal to M[0] but the problem still persists. What is going on here?
Your array is np.int32:
print(type(M[0][0])) # <class 'numpy.int32'>
Create the original array as np.int64 - to avoid getting integer overflow happening:
import numpy as np
M = np.array([
[1, -2, -2, -2, 1, 2],
[0, 3, -2, -3, 1, 3],
[3, 0, 0, 1, -1, 2],
[3, -3, -2, 0, 1, 1],
[0, -3, 3, -3, -3, 2]
], dtype=np.int64)
N = 2497969412496091
M[0] = M[0] % N
print(M)
Output:
[[ 1 2497969412496089 2497969412496089 2497969412496089 1 2]
[ 0 3 -2 -3 1 3]
[ 3 0 0 1 -1 2]
[ 3 -3 -2 0 1 1]
[ 0 -3 3 -3 -3 2]]

How to use Numpy .tobytes() to serialize objects

How do you serialized/deserialize a numpy array?
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
print (A.dtype)
snapshot = A
serialized = snapshot.tobytes()
[[9 5 5 7 4]
[3 8 8 1 0]
[5 7 1 0 2]
[2 2 7 1 2]
[2 6 3 5 4]
[7 5 4 8 3]
[2 4 2 4 7]
[3 4 2 6 2]]
int64
Returns
deserialized = np.frombuffer(serialized).astype(np.int64)
print (deserialized)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0]
There is a mismatch between the default dtype used to generate A and in np.frombuffer. Works as expected when using the correct dtype (may depend on the machine / Python / numpy version):
# Python 3.6 64-bits with numpy 1.12.1 64-bits
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])
A.dtype
>>> dtype('int32')
deserialized = np.frombuffer(A.tobytes(), dtype=np.int32).reshape(A.shape)
print(deserialized)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])

Resources