I am trying to implement a distributed algorithm using Spark. It is a computer vision algorithm with tens of thousands of images. The images are divided into "partitions" that are processed in a distributed fashion, with the help of a master process. The pseudocode goes something like this:
# Iterate
for t = 1 ... T
# Each partition
for p = 1 ... P
d[p] = f1(b[p], z[p], u[p])
# Master
y = f2(d)
# Each partition
for p = 1 ... P
u[p] = f3(u[p], y)
# Each partition
for p = 1 ... P
# Iterate
for t = 1 ... T
z[p] = f4(b[p], y, v[p])
v[p] = f5(z[p])
where b[p] contains the pth partition of the images which is a numpy ndarray, z[p] contains some function of b[p] and also a numpy ndarray, y is computed on the master knowing all the partitions of d, and then u[p] is updated on each partition knowing y. In my attempted implementation, all of b, z, and u are separate RDDs with corresponding keys (e.g. (1, b[1]), (1,z[1]) and (1, u[1]) correspond to the first partition, etc.).
The problem now with using Spark is that b and z are extremely large, in the order of GBs. Since RDDs are immutable, whenever I want to "join" them (e.g. bring z[1] and b[1] onto the same machine for processing) they are replicated i.e. new copies are returned from the numpy arrays. This just multiplies the amount of memory needed, and limits the number of images that can be processed.
I thought a way to avoid the joins is to have an RDD that combines all of variables e.g. (p, (z[p], b[p], u[p], v[p])), but then the immutability problem is still there.
So my question: is there a workaround to update the RDD in place? For example, if I have the RDD as (p, (z[p], b[p], u[p], v[p])), I can update z[p] in-memory?
Related
I have a spark rdd data, let's suppose it has 1000 elements and can be grouped into 10 groups, what I want to do is select 2 element which meets my special requirement in each group. And then, get a new rdd with 20 elements.
suppose the rdd data is like
((1,a1),
(1,a2),
(1,a3),
...
(1,a100),
(2,b1),
(2,b2),
(2,b3)
...
(2,b100))
what i want is
((1,a1),
(1,a99),
(2,b1),
(2,b99)
)
and I select a1、a99、b1、b99 with a function called my_func
I think the code may be something like:
myrdd.groupby(x => x._1)....(my_func)...
Not convinced you need groupBy. Not sure of structure of RDD.
This using my own contrived data, so you will need to adapt:
// Gen some data. My data. Adapt to yours.
val rdd = spark.sparkContext.parallelize(Seq((1, "x"), (2, "y"), (3, "z"), (4, "z"), (5, "bbb") ))
// Compare list.
val l = List("x", "y", "z")
// Function to filter, could be inline or via mapPartitions.
def my_function(l: List[String], r: RDD[(Int, String)]) = {
r.map(x => (x._1, x._2)).filter(x => l.contains(x._2))
}
// Run it all.
val rdd2 = my_function(l,rdd)
rdd2.collect
returns:
res24: Array[(Int, String)] = Array((1,x), (2,y), (3,z), (4,z))
I strongly discourage you from using groupBy() or even mapPartitions() for big dataset when you need to subsequently aggregate your data. The purpose of RDD and MapReduce programming model is to distribute computations: computing the max/min/sum etc in the driver or on a single node means using only the HDFS part of Spark.
Besides, there are many ways to perform your task, but focusing on finding a pattern that fits for every type of aggregation you need is just wrong and inevitably make your code inefficient.
Here is a possible PySpark solution for the problem you have:
rdd.reduceByKey(lambda x, y: x if x < y else y)\
.union(rdd.reduceByKey(lambda x, y: x if x > y else y)).sortByKey().collect()
In the first reduceByKey I find the smallest value for each key and in the second one the biggest value for each key. Then I can union them and, if necessary, sort the resulting RDD to obtain the result you showed us.
I have a list of spheres with some known characteristics (ids, radii, masses, and positions) with ids, radii, and masses being 1D arrays with shape (511, ) and positions being 3D array with shape (511, 3) inside some big spherical volume with known center, (0, 0, 0) and radius, distance_max.
hal_ids_data = np.array([19895, 19896, ..., 24249])
hal_radiuss_data = np.array([1.047, 1.078, ..., 3.263])
hal_masss_data = np.array([2.427e+06, 8.268e+06, ..., 8.954e+07]
hal_positions_data = np.array([np.array([-33.78, 10.4, 33.83]), np.array([-33.61, 6.34, 35.64]), ..., np.array([-0.4014, 4.121, 33.05])])
I would like to randomly place these tiny spheres throughout the volume within the big sphere while keeping their individual characteristics intact meaning only their positions need to be shuffled subject to two constraints shown below.
for hal_id, hal_position, hal_radius, hal_mass in zip(hal_ids_data, hal_positions_data, hal_radiuss_data, hal_masss_data):
# check if 1) any one of the small spheres are above some mass threshold AND 2) inside the big sphere
if ((np.sqrt(pow(hal_position[0], 2)+pow(hal_position[1], 2)+pow(hal_position[2], 2)) < distance_max) and (log10(hal_mass)>=1e8)):
# if so, then do the following stuff down here but to the shuffled populations of small spheres meeting the conditions above rather than to the original population
What is the fastest and shortest way to shuffle my spheres under the last if statement before doing some stuff on them? (I do need my original population info though for later use so I cannot disregard it)
The best approach would be to compute your constraints in a vectorized format (which is very efficient in numpy) instead of using a for loop. Then generate an array of indexes that match your constraints, and then shuffle those indexes.
So using your example data above:
import numpy as np
distance_max = 49 #I chose this so that we have some matching items
hal_ids_data = np.array([19895, 19896, 24249])
hal_radius_data = np.array([1.047, 1.078, 3.263])
hal_mass_data = np.array([2.427e+06, 8.268e+06, 8.954e+07])
hal_positions_data = np.array([np.array([-33.78, 10.4, 33.83]), np.array([-33.61, 6.34, 35.64]), np.array([-0.4014, 4.121, 33.05])])
# Compute the conditions for every sphere at the same time instead of for loop
within_max = np.sqrt(pow(hal_positions_data[:,0],2) + pow(hal_positions_data[:,1],2) + pow(hal_positions_data[:,2],2)) < distance_max
mass_contraint = np.log10(hal_mass_data) >= 1 #I chose this so that we have some matching items
matched_spheres = within_max & mass_contraint
# Get indexes of matching spheres
idx = np.where(matched_spheres)[0] # create array of indexes
np.random.shuffle(idx) #shuffle array of indexes in place
# Generate shuffled data by applying the idx to the original arrays and saving to new 's_' arrays
s_hal_ids_data = hal_ids_data[idx]
s_hal_radius_data = hal_radius_data[idx]
s_hal_mass_data = hal_mass_data[idx]
s_hal_positions_data = hal_positions_data[idx]
# Do stuff with shuffled population of small spheres
Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.
Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }
I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.
I have a use case where I want to count types of elements in an RDD matching some filter.
e.g. RDD.filter(F1) and RDD.filter(!F1)
I have 2 options
Use accumulators: e.g.
LongAccumulator l1 = sparkContext.longAccumulator("Count1")
LongAccumulator l2 = sparkContext.longAccumulator("Count2")
RDD.forEachPartition(f -> {
if(F1) l1.add(1)
else l2.add(1)
});
Use Count
RDD.filter(F1).count(); RDD.filter(!F1).count()
One benefit of the first approach is that we only need to iterate data once (useful since my data set is 10s of TB)
What is the use of count if same affect can be achieved by using Accumulators ?
Major difference is that if your code will fail in transformation, then Accumulators will be updated and count() result not.
Other option is to use pure map-reduce:
val counts = rdd.map(x => (F1(x), 1)).reduceByKey(_ + _).collectAsMap()
Network cost should be also low as only few numbers will be sent. It creates pairs of (is F1(x) true/false, 1) and then sum all ones - it will give you number of items both F1(x) and !F1(x) in counts map
Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.