Element-wise addition of RDDs in PySpark - apache-spark

Suppose you have two vectors of the same size that are stored as rdd1 and rdd2. Please write a function where the inputs are rdd1 and rdd2, and the output is a rdd which is the element-wise addition of rdd1 and rdd2. You should not load all data to the driver program.
Hint: You may use zip() in Spark, not the zip() in Python.
I do not understand what it wrong with the below code, and whether it is correct or not. When I run it, it takes forever. Would you be able to help me with this? Thanks.
spark = SparkSession(sc)
numPartitions = 10
rdd1 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[0]))
rdd2 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[1]))
def ele_wise_add(rdd1, rdd2):
rdd3 = rdd1.zip(rdd2).map(lambda x,y: x + y)
return rdd3
rdd3 = ele_wise_add(rdd1, rdd2)
print(rdd3.collect())
rdd1 and rdd2 have 10000 numbers each, and below are the first 10 numbers in it.
rdd1 = [47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057]
rdd2 = [30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315]
expected output = [78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]

rdd1.zip(rdd2) would create a single tuple for each pair, so when writing lambda function, you only have x and not y. So you'd want to sum(x) or x[0] + x[1], not x + y.
rdd1 = spark.sparkContext.parallelize((47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057))
rdd2 = spark.sparkContext.parallelize((30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315))
rdd1.zip(rdd2).map(lambda x: sum(x)).collect()
[78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]

Related

How to map RDD function over each RDD in iterator returned by mapPartitions

I have a DataFrame with document ids doc_id, line ids for a set of lines in each document line_id, and a dense vector representation of each line vectors. For each document (doc_id), I want to convert the set of vectors representing each line into a mllib.linalg.distributed.BlockMatrix
It is relatively straight forward to convert the vectors of the entire DataFrame, or DataFrame filtered by doc_id into a BlockMatrix by first converting the vectors into an RDD of (numRows, numCols), DenseMatrix). A coded example of that below.
However, I am having trouble converting the RDD of Iterator[(numRows, numCols), DenseMatrix)] returned by mapPartition, which converted the vectors column for each doc_id partition, into a separate BlockMatrix for each doc_id partition.
My cluster has 3 worker nodes with 16 cores and 62 GB of memory each.
Imports and start spark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.mllib.random import RandomRDDs
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import VectorUDT
from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg import MatrixUDT
from pyspark.mllib.linalg.distributed import BlockMatrix
spark = (
SparkSession.builder
.master('yarn')
.appName("linalg_test")
.getOrCreate()
)
Create test dataframe
nRows = 25000
""" Create ids dataframe """
win = (W
.partitionBy(F.col('doc_id'))
.rowsBetween(W.unboundedPreceding, W.currentRow)
)
df_ids = (
spark.range(0, nRows, 1)
.withColumn('rand1', (F.rand(seed=12345) * 50).cast(T.IntegerType()))
.withColumn('doc_id', F.floor(F.col('rand1')/3).cast(T.IntegerType()) )
.withColumn('int', F.lit(1))
.withColumn('line_id', F.sum(F.col('int')).over(win))
.select('id', 'doc_id', 'line_id')
)
""" Create vector dataframe """
df_vecSchema = T.StructType([
T.StructField('vectors', T.StructType([T.StructField('vectors', VectorUDT())] ) ),
T.StructField('id', T.LongType())
])
vecDim = 50
df_vec = (
spark.createDataFrame(
RandomRDDs.normalVectorRDD(sc, numRows=nRows, numCols=vecDim, seed=54321)
.map(lambda x: Row(vectors=Vectors.dense(x),))
.zipWithIndex(), schema=df_vecSchema)
.select('id', 'vectors.*')
)
""" Create final test dataframe """
df_SO = (
df_ids.join(df_vec, on='id', how='left')
.select('doc_id', 'line_id', 'vectors')
.orderBy('doc_id', 'line_id')
)
numDocs = df_SO.agg(F.countDistinct(F.col('doc_id'))).collect()[0][0]
# numDocs = df_SO.groupBy('doc_id').agg(F.count(F.col('line_id'))).count()
df_SO = df_SO.repartition(numDocs, 'doc_id')
RDD functions to create matrices out of Vector column
def vec2mat(row):
return (
(row.line_id-1, 0),
Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
create dense matrix out of each line_id vector
mat = df_SO.rdd.map(vec2mat)
create distributed BlockMatrix from RDD of DenseMatrix
blk_mat = BlockMatrix(mat, 1, vecDim)
check output
blk_mat
<pyspark.mllib.linalg.distributed.BlockMatrix at 0x7fe1da370a50>
blk_mat.blocks.take(1)
[((273, 0),
DenseMatrix(1, 50, [1.749, -1.4873, -0.3473, 0.716, 2.3916, -1.5997, -1.7035, 0.0105, ..., -0.0579, 0.3074, -1.8178, -0.2628, 0.1979, 0.6046, 0.4566, 0.4063], 0))]
Problem
I cannot get the same thing to work after converting each partition of doc_id with mapPartitions. The mapPartitions function works, but I cannot get the RDD that it returns converted into a BlockMatrix.
RDD function to create dense matrix out of each line_id vector separately for each doc_id partition
def vec2mat_p(iter):
yield [((row.line_id-1, 0),
Matrices.dense(1, vecDim, (row.vectors.toArray().tolist())), )
for row in iter]
create dense matrix out of each line_id vector separately for each doc_id partition
mat_doc = df_SO.rdd.mapPartitions(vec2mat_p, preservesPartitioning=True)
Check
mat_doc
PythonRDD[4991] at RDD at PythonRDD.scala:48
mat_test.take(1)
[[((0, 0),
DenseMatrix(1, 50, [1.814, -1.1681, -2.1887, -0.5371, -0.7509, 2.3679, 0.2795, 1.4135, ..., -0.3584, 0.5059, -0.6429, -0.6391, 0.0173, 1.2109, 1.804, -0.9402], 0)),
((1, 0),
DenseMatrix(1, 50, [0.3884, -1.451, -0.0431, -0.4653, -2.4541, 0.2396, 1.8704, 0.8471, ..., -2.5164, 0.1298, -1.2702, -0.1286, 0.9196, -0.7354, -0.1816, -0.4553], 0)),
((2, 0),
DenseMatrix(1, 50, [0.1382, 1.6753, 0.9563, -1.5251, 0.1753, 0.9822, 0.5952, -1.3924, ..., 0.9636, -1.7299, 0.2138, -2.5694, 0.1701, 0.2554, -1.4879, -1.6504], 0)),
...]]
Check types
(mat_doc
.filter(lambda p: len(p) > 0)
.map(lambda mlst: [(type(m[0]), (type(m[0][0]),type(m[0][1])), type(m[1])) for m in mlst] )
.first()
)
[(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
(tuple, (int, int), pyspark.mllib.linalg.DenseMatrix),
...]
Seems correct, however, running:
(mat_doc
.filter(lambda p: len(p) > 0)
.map(lambda mlst: [BlockMatrix((m[0], m[1])[0], 1, vecDim) for m in mlst] )
.first()
)
results in the following type error:
TypeError: blocks should be an RDD of sub-matrix blocks as ((int, int), matrix) tuples, got
Unfortunately, the error stops short and does not tell me what it 'got'.
Also, I cannot call sc.parallelize() inside of a map() call.
How do I convert each item in the RDD iterator that mapPartitions returns into a RDD that BlockMatrix will accept?

Spark RDD: lookup from other RDD

I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.

Creating a new rdd from two different rdds

I have two rdd as following:
rdd1=sc.parallelize([(('a','b'),10),(('c','d'),20)])
rdd2=sc.parallelize([('a',2),('b',3),('c',4)])
I need to make a new rdd as following: (value for ('a', 'b') => value(a,b)/value(a) => 10/2
[(('a','b'), 5.0), (('c','d'), 5.0)]
You requirement says that you want the number rdd1 devided by the value from rdd2 which matches the key of rdd2 with first value of rdd1 key.
If my understanding is correct then your requirement can be fulfilled by doing the following where rdd1 is transformed to make the first value as key so that join between two rdds can be perfomed.
rdd1.map(lambda x: (x[0][0], x)).join(rdd2).map(lambda x: (x[1][0][0], float(x[1][0][1]/x[1][1])))
#[(('a', 'b'), 5.0), (('c', 'd'), 5.0)]

Spark RDD Sampling, faster with or without replacement?

In general, all other things being equal, which would be expected to run more quickly?
val a = myRDD.sample(true, 0.01)
val b = myRDD.sample(false, 0.01)

Find sum of second values in key/value pair

just started with PySpark
I have a key/value pair like following (key,(value1,value2))
I'd like to find a sum of value2 for each key
example of input data
(22, (33, 17.0)),(22, (34, 15.0)),(20, (3, 5.5)),(20, (11, 0.0))
Thanks !
At the end I created a new RDD contains key,value2 only , then just sum values of the new RDD
sumRdd = rdd.map(lambda (x, (a, b)): (x, b))\
.groupByKey().mapValues(sum).collect()
If you would like to benefit from combiner this would be a better choice.
from operator import add
sumRdd = rdd.map(lambda (x, (a, b)): (x, b)).reduceByKey(add)

Resources