How to convert pyspark rdd into sparse matrix

How to convert pyspark rdd into sparse matrix - apache-spark

I have one key/value pair RDD
{(("a", "b"), 1), (("a", "c"), 3), (("c", "d"), 5)}
how could I get the sparse matrix:
0 1 3 0
1 0 0 0
3 0 0 5
0 0 5 0
i.e.
from pyspark.mllib.linalg import Matrices
Matrices.sparse(4, 4, [0, 2, 3, 5, 6], [1, 2, 0, 0, 3, 2], [1, 3, 1, 3, 5, 5])
or
import numpy as np
from scipy.sparse import csc_matrix
data = [1, 3, 1, 3, 5, 5]
indices = [1, 2, 0, 0, 3, 2]
indptr = [0, 2, 3, 5, 6]
csc_matrix((data, indices, indptr), shape=(4, 4), dtype=np.float)

Could you apply pivot to dataframe then convert to matrix?

Related

Multiclass vs. multilabel fitting

In scikit-learn tutorials, I found the following paragraphs in the section 'Multiclass vs. multilabel fitting'.
I couldn't understand why the following codes generate the given results.
First
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Next
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])

Label binarization in scikit-learn will transform your targets and represent them in a label indicator matrix. This label indicator matrix has the shape (n_samples, n_classes) and is composed as follows:
each row represents a sample
each column represents a class
each element is 1 if the sample is labeled with the class and 0 if not
In your first example, you have a target collection with 5 samples and 3 classes. That's why transforming y with LabelBinarizer results in a 5x3 matrix. In your case, [1, 0, 0] corresponds to class 0, [0, 1, 0] corresponds to class 1 and so forth. Notice that in each row there is only one element set to 1, since each sample can have one label only.
In your next example, you have a target collection with 5 samples and 5 classes. That's why transforming y with MultiLabelBinarizer results in a 5x5 matrix. In your case, [1, 1, 0, 0, 0] corresponds to the multilabel [0, 1], [0, 1, 0, 1, 0] corresponds to the multilabel [1, 3] and so forth. The key difference to the first example is that each row can have multiple elements set to 1, because each sample can have multiple labels/classes.
The predicted values you get follow the very same pattern. They are however not equivalent to the original values in y since your classification model has obviously predicted different values. You can check this with the inverse_transform() of the binarizers:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = np.array([[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]])
y_bin = mlb.fit_transform(y)
# direct transformation
[[1 1 0 0 0]
[1 0 1 0 0]
[0 1 0 1 0]
[1 0 1 1 0]
[0 0 1 0 1]]
# prediction of your classifier
y_pred = np.array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
# inverting the binarized values to the original classes
y_inv = mlb.inverse_transform(y_pred)
# output
[(0, 1), (0, 2), (1, 3), (0, 2), (0, 2)]

How to initialize columns in hybrid sparse tensor

How initialize in pytorch hybrid tensor torch.sparse_coo_tensor (one dimension is sparse and other is not), which have the following dense representation?
array([[1, 0, 5, 0],
[2, 0, 6, 0],
[3, 0, 7, 0],
[4, 0, 8, 0]])
What should I put into the indices argument?

How to initialize
Something like this:
import torch
indices = torch.tensor([[0, 0, 1, 1, 2, 2, 3, 3], [0, 2, 0, 2, 0, 2, 0, 2]])
tensor = torch.sparse_coo_tensor(
indices, torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]), size=(4, 4)
)
Given above:
indices - first dimension specifies row, second column, where non-zero value(s) will be located. Those become pairs, in this case: (0, 0), (0, 2), (1, 0), (1, 2)... and so on
values - values located at those pairs, so 1 will be under (0, 0) coordinate, 2 under (0, 2) and so it goes.
size - total size of the matrix, optional, might be inferred in this case from your input
8 pairs, 8 values, there are also other ways to specify it, but the idea holds.
And a quick check:
print(tensor)
print(tensor.to_dense())
Gives us:
tensor(indices=tensor([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 2, 0, 2, 0, 2, 0, 2]]),
values=tensor([1, 2, 3, 4, 5, 6, 7, 8]),
size=(4, 4), nnz=8, layout=torch.sparse_coo)
tensor([[1, 0, 2, 0],
[3, 0, 4, 0],
[5, 0, 6, 0],
[7, 0, 8, 0]])
Why to initialize
If your actual data is 50% sparse, you shouldn't use COO tensor.
It will save some memory, but operations will be way slower, so keep that in mind.

How to reshape an array with numpy like this:

I have this:
array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]])
And I would like to reshape my array like this:
array([[0, 0, 1, 1],
[0, 0, 1, 1],
[2, 2, 3, 3],
[2, 2, 3, 3]])
How do I do it using python numpy?

You can just split and concatenate:
a = np.array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]])
cols = a.shape[1] // 2
np.concatenate((a[:,:cols], a[:,cols:]))
#[[0 0 1 1]
# [0 0 1 1]
# [2 2 3 3]
# [2 2 3 3]]

You can simply swap rows after reshaping it.
a= np.array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]]).reshape(4,4)
a[[1,2]] = a[[2,1]]
Output:
array([[0, 0, 1, 1],
[0, 0, 1, 1],
[2, 2, 3, 3],
[2, 2, 3, 3]])

Numpy is not inserting the right array into multidimensional array

I have a matrix
M = np.array([
[1, -2, -2, -2, 1, 2],
[0, 3, -2, -3, 1, 3],
[3, 0, 0, 1, -1, 2],
[3, -3, -2, 0, 1, 1],
[0, -3, 3, -3, -3, 2]
])
and I'm trying to replace the first row by itself modulo some number N = 2497969412496091.
I've been playing around with this in the IDE for a while, and even though
>>> M[0] % N
array([1, 2497969412496089, 2497969412496089, 2497969412496089, 1, 2], dtype=int64)
After I preform M[0] = M[0] % N and print the matrix M, I get
>>> M[0]
array([1, -746726695, -746726695, -746726695, 1, 2])
I've also tried to copy the intermediate step M[0] % N in a temporary variable and then setting it equal to M[0] but the problem still persists. What is going on here?

Your array is np.int32:
print(type(M[0][0])) # <class 'numpy.int32'>
Create the original array as np.int64 - to avoid getting integer overflow happening:
import numpy as np
M = np.array([
[1, -2, -2, -2, 1, 2],
[0, 3, -2, -3, 1, 3],
[3, 0, 0, 1, -1, 2],
[3, -3, -2, 0, 1, 1],
[0, -3, 3, -3, -3, 2]
], dtype=np.int64)
N = 2497969412496091
M[0] = M[0] % N
print(M)
Output:
[[ 1 2497969412496089 2497969412496089 2497969412496089 1 2]
[ 0 3 -2 -3 1 3]
[ 3 0 0 1 -1 2]
[ 3 -3 -2 0 1 1]
[ 0 -3 3 -3 -3 2]]

How to use Numpy .tobytes() to serialize objects

How do you serialized/deserialize a numpy array?
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
print (A.dtype)
snapshot = A
serialized = snapshot.tobytes()
[[9 5 5 7 4]
[3 8 8 1 0]
[5 7 1 0 2]
[2 2 7 1 2]
[2 6 3 5 4]
[7 5 4 8 3]
[2 4 2 4 7]
[3 4 2 6 2]]
int64
Returns
deserialized = np.frombuffer(serialized).astype(np.int64)
print (deserialized)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0]

There is a mismatch between the default dtype used to generate A and in np.frombuffer. Works as expected when using the correct dtype (may depend on the machine / Python / numpy version):
# Python 3.6 64-bits with numpy 1.12.1 64-bits
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])
A.dtype
>>> dtype('int32')
deserialized = np.frombuffer(A.tobytes(), dtype=np.int32).reshape(A.shape)
print(deserialized)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to convert pyspark rdd into sparse matrix - apache-spark

Could you apply pivot to dataframe then convert to matrix?

Related

Multiclass vs. multilabel fitting

How to initialize columns in hybrid sparse tensor

How to reshape an array with numpy like this:

Numpy is not inserting the right array into multidimensional array

How to use Numpy .tobytes() to serialize objects

Categories

Resources