Spark Matrix multiplication with python - apache-spark

I am trying to do matrix multiplication using Apache Spark and Python.
Here is my data
from pyspark.mllib.linalg.distributed import RowMatrix
My RDD of vectors
rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])
My maxtrix
mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)
I would like to do something like this:
mat = mat1 * mat2
I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:
def matrix_multiply(df1, df2):
nb_row = df1.count()
mat=[]
for i in range(0, nb_row):
row=list(df1.filter(df1['index']==i).take(1)[0])
row_out = []
for r in range(0, len(row)):
r_value = 0
col = df2.select(df2[list_col[r]]).collect()
col = [list(c)[0] for c in col]
for c in range(0, len(col)):
r_value += row[c] * col[c]
row_out.append(r_value)
mat.append(row_out)
return mat
My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time?
If someone have another idea it will be helpful for me.

You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))

Related

Matrix Multiplication Python

I am trying to multiply the matrix and came across below code, can someone please help me understand the logic for second 'for' loop, why is it range(len(B[0])). I am quite a newbie to programming world so unable to understand the logic. Please help.
for i in range(r1):
print("i=",i)
for j in range(len(B[0])):
print("j=",j)
for k in range(r2):
print("k=",k)
result[i][j] += A[i][k] * B[k][j]
return(result)
Here r1 and r2 are lengths of two matrix
A simple way to do matrix multiplication is to use numpy dot product:
import numpy as np
result = np.dot([[2, 5], [5, 8]],[[2, 1], [5, 9]])
#result = np.dot(matrix1, matrix2)

dask array map_blocks, with differently shaped dask array as argument

I'm trying to use dask.array.map_blocks to process a dask array, using a second dask array with different shape as an argument. The use case is firstly running some peak finding on a 2-D stack of images (4-dimensions), which is returned as a 2-D dask array of np.objects. Ergo, the two first dimensions of the two dask arrays are the same. The peaks are then used to extract intensities from the 4-dimensional dataset. In the code below, I've omitted the peak finding part. Dask version 1.0.0.
import numpy as np
import dask.array as da
def test_processing(data_chunk, position_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = position_chunk[islice]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
positions = np.empty(data.shape[:-2], dtype='object')
for index in np.ndindex(positions.shape):
positions[index] = np.arange(10).reshape(5, 2)
data_output = da.map_blocks(test_processing, data, positions, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_output.compute()
This gives the error ValueError: Can't drop an axis with more than 1 block. Please useatopinstead., which I'm guessing is due to positions having 3 dimensions, while data has 4 dimensions.
The same function, but without the positions dask array works fine.
import numpy as np
import dask.array as da
def test_processing(data_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = [[5, 2], [1, 3]]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
data_output = da.map_blocks(test_processing, data, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_computed = data_output.compute()
This has been fixed in more recent versions of dask: running the same code on version 2.3.0 of dask works fine.

Tensorflow map_fn Out of Memory Issues

I am having issues with my code running out of memory on large data sets. I attempted to chunk the data to feed it into the calculation graph but I eventually get an out of memory error. Would setting it up to use the feed_dict functionality get around this problem?
My code is set up like the following, with a nested map_fn function due to a result of the tf_itertools_product_2D_nest function.
tf_itertools_product_2D_nest function is from Cartesian Product in Tensorflow
I also tried a variation where I made a list of tensor-lists which was significantly slower than doing it purely in tensorflow so I'd prefer to avoid that method.
import tensorflow as tf
import numpy as np
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tensorboard_log_dir = "../log/"
def tf_itertools_product_2D_nest(a,b): #does not work on nested tensors
a, b = a[ None, :, None ], b[ :, None, None ]
#print(sess.run(tf.shape(a)))
#print(sess.run(tf.shape(b)))
n_feat_dimension_in_common = tf.shape(a)[-1]
c = tf.concat( [ a + tf.zeros_like( b ), tf.zeros_like( a ) + b ], axis = 2 )
return c
def do_calc(arr_pair):
arr_1 = arr_pair[0]
arr_binary = arr_pair[1]
return tf.reduce_max(tf.cumsum(arr_1*arr_binary))
def calc_row_wrapper(row):
return tf.map_fn(do_calc,row)
for i in range(0,10):
a = tf.constant(np.random.random((7,10))*10,tf.float64)
b = tf.constant(np.random.randint(2, size=(3,10)),tf.float64)
a_b_itertools_product = tf_itertools_product_2D_nest(a,b)
'''Creates array like this:
[ [[arr_a0,arr_b0], [arr_a1,arr_b0],...],
[[arr_a0,arr_b1], [arr_a1,arr_b1],...],
[[arr_a0,arr_b2], [arr_a1,arr_b2],...],
...]
'''
with tf.summary.FileWriter(tensorboard_log_dir, sess.graph) as writer:
result_array = sess.run(tf.map_fn(calc_row_wrapper,a_b_itertools_product),
options=run_options,run_metadata=run_metadata)
writer.add_run_metadata(run_metadata,"iteration {}".format(i))
print(result_array.shape)
print(result_array)
print("")
# result_array should be an array with 3 rows (1 for each binary vector in b) and 7 columns (1 for each row in a)
I can imagine that is unnecessarily consuming memory due to the extra dimension added. Is there a way to mimic the outcome of the standard itertools.product() function to output 1 long list of every possible combination of items in the 2 input iterables? Like the result of:
itertools.product([[1,2],[3,4]],[[5,6],[7,8]])
# [([1, 2], [5, 6]), ([1, 2], [7, 8]), ([3, 4], [5, 6]), ([3, 4], [7, 8])]
That would eliminate the need to call map_fn twice.
When map_fn is called within a loop as my code shows, will it keep spawning graphs for every iteration? There appears to be a big "map_" node for every iteration cycle in this code's Tensorboardgraph.
Tensorboard Default View (not enough reputation yet)
When I select a particular iteration based on the tag in Tensorboard, only the map node corresponding to the iteration is highlighted with all the others grayed out. Does that mean that for that cycle only the map node for that cycle is present (and the others no longer, if from a previous cycle , exist in memory)?
Tensorboard 1 iteration view

Efficient way to get nearest point [duplicate]

let's say I have the following numpy matrix (simplified):
matrix = np.array([[1, 1],
[2, 2],
[5, 5],
[6, 6]]
)
And now I want to get the vector from the matrix closest to a "search" vector:
search_vec = np.array([3, 3])
What I have done is the following:
min_dist = None
result_vec = None
for ref_vec in matrix:
distance = np.linalg.norm(search_vec-ref_vec)
distance = abs(distance)
print(ref_vec, distance)
if min_dist == None or min_dist > distance:
min_dist = distance
result_vec = ref_vec
The result works, but is there a native numpy solution to do it more efficient?
My problem is, that the bigger the matrix becomes, the slower the entire process will be.
Are there other solutions that handle these problems in a more elegant and efficient way?
Approach #1
We can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance -
In [276]: from scipy.spatial import cKDTree
In [277]: matrix[cKDTree(matrix).query(search_vec, k=1)[1]]
Out[277]: array([2, 2])
Approach #2
With SciPy's cdist -
In [286]: from scipy.spatial.distance import cdist
In [287]: matrix[cdist(matrix, np.atleast_2d(search_vec)).argmin()]
Out[287]: array([2, 2])
Approach #3
With Scikit-learn's Nearest Neighbors -
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1).fit(matrix)
closest_vec = matrix[nbrs.kneighbors(np.atleast_2d(search_vec))[1][0,0]]
Approach #4
With Scikit-learn's kdtree -
from sklearn.neighbors import KDTree
kdt = KDTree(matrix, metric='euclidean')
cv = matrix[kdt.query(np.atleast_2d(search_vec), k=1, return_distance=False)[0,0]]
Approach #5
From eucl_dist package (disclaimer: I am its author) and following the wiki contents, we could leverage matrix-multiplication -
M = matrix.dot(search_vec)
d = np.einsum('ij,ij->i',matrix,matrix) + np.inner(search_vec,search_vec) -2*M
closest_vec = matrix[d.argmin()]

Using Theano.scan with multidimensional arrays

To speed up my code I am converting a multidimensional sumproduct function from Python to Theano. My Theano code reaches the same result, but only calculates the result for one dimension at a time, so that I have to use a Python for-loop to get the end result. I assume that would make the code slow, because Theano cannot optimize memory usage and transfer (for the gpu) between multiple function calls. Or is this a wrong assumption?
So how can I change the Theano code, so that the sumprod is calculated in one function call?
The original Python function:
def sumprod(a1, a2):
"""Sum the element-wise products of the `a1` and `a2`."""
result = numpy.zeros_like(a1[0])
for i, j in zip(a1, a2):
result += i*j
return result
For the following input
a1 = ([1, 2, 4], [5, 6, 7])
a2 = ([1, 2, 4], [5, 6, 7])
the output would be: [ 26. 40. 65.] that is 1*1 + 5*5, 2*2 + 6*6 and 4*4 + 7*7
The Theano version of the code:
import theano
import theano.tensor as T
import numpy
a1 = ([1, 2, 4], [5, 6, 7])
a2 = ([1, 2, 4], [5, 6, 7])
# wanted result: [ 26. 40. 65.]
# that is 1*1 + 5*5, 2*2 + 6*6 and 4*4 + 7*7
Tk = T.iscalar('Tk')
Ta1_shared = theano.shared(numpy.array(a1).T)
Ta2_shared = theano.shared(numpy.array(a2).T)
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1_shared, Ta2_shared, prior_value:
prior_value + Ta1_shared * Ta2_shared,
outputs_info=outputs_info,
sequences=[Ta1_shared[Tk], Ta2_shared[Tk]])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function([Tk], outputs=Tsumprod_result)
result = numpy.zeros_like(a1[0])
for i in range(len(a1[0])):
result[i] = Tsumprod(i)
print result
First, there is more people that will answer your questions on theano mailing list then on stackoverflow. But I'm here:)
First, your function isn't a good fit for GPU. Even if everything was well optimized, the transfer of the input to the gpu just to add and sum the result will take more time to run then the python version.
Your python code is slow, here is a version that should be faster:
def sumprod(a1, a2):
"""Sum the element-wise products of the `a1` and `a2`."""
a1 = numpy.asarray(a1)
a2 = numpy.asarray(a2)
result (a1 * a2).sum(axis=0)
return result
For the theano code, here is the equivalent of this faster python version(no need of scan)
m1 = theano.tensor.matrix()
m2 = theano.tensor.matrix()
f = theano.function([m1, m2], (m1 * m2).sum(axis=0))
The think to remember from this is that you need to "vectorize" your code. The "vectorize" is used in the NumPy context and it mean to use numpy.ndarray and use function that work on the full tensor at a time. This is always faster then doing it with loop (python loop or theano scan). Also, Theano optimize some of thoses cases by moving the computation outside the scan, but it don't always do it.

Resources