Creating a new rdd from two different rdds - python-3.x

I have two rdd as following:
rdd1=sc.parallelize([(('a','b'),10),(('c','d'),20)])
rdd2=sc.parallelize([('a',2),('b',3),('c',4)])
I need to make a new rdd as following: (value for ('a', 'b') => value(a,b)/value(a) => 10/2
[(('a','b'), 5.0), (('c','d'), 5.0)]

You requirement says that you want the number rdd1 devided by the value from rdd2 which matches the key of rdd2 with first value of rdd1 key.
If my understanding is correct then your requirement can be fulfilled by doing the following where rdd1 is transformed to make the first value as key so that join between two rdds can be perfomed.
rdd1.map(lambda x: (x[0][0], x)).join(rdd2).map(lambda x: (x[1][0][0], float(x[1][0][1]/x[1][1])))
#[(('a', 'b'), 5.0), (('c', 'd'), 5.0)]

Related

Element-wise addition of RDDs in PySpark

Suppose you have two vectors of the same size that are stored as rdd1 and rdd2. Please write a function where the inputs are rdd1 and rdd2, and the output is a rdd which is the element-wise addition of rdd1 and rdd2. You should not load all data to the driver program.
Hint: You may use zip() in Spark, not the zip() in Python.
I do not understand what it wrong with the below code, and whether it is correct or not. When I run it, it takes forever. Would you be able to help me with this? Thanks.
spark = SparkSession(sc)
numPartitions = 10
rdd1 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[0]))
rdd2 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[1]))
def ele_wise_add(rdd1, rdd2):
rdd3 = rdd1.zip(rdd2).map(lambda x,y: x + y)
return rdd3
rdd3 = ele_wise_add(rdd1, rdd2)
print(rdd3.collect())
rdd1 and rdd2 have 10000 numbers each, and below are the first 10 numbers in it.
rdd1 = [47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057]
rdd2 = [30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315]
expected output = [78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]
rdd1.zip(rdd2) would create a single tuple for each pair, so when writing lambda function, you only have x and not y. So you'd want to sum(x) or x[0] + x[1], not x + y.
rdd1 = spark.sparkContext.parallelize((47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057))
rdd2 = spark.sparkContext.parallelize((30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315))
rdd1.zip(rdd2).map(lambda x: sum(x)).collect()
[78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]

Can I chain groupByKey calls on pair_rdd in Pyspark?

Is it possible in Pyspark to chain a groupByKey() call on a pair_rdd twice?
I have two levels of keys I want to group by before I aggregate by creating a special list of all values.
Here's my code. First groupByKey() call groups by the outer key and is then given to a map function in which I hope to turn the resultIterable object into a pair_rdd again so I can do the second groupByKey() and map my function to it.
(Since I'm reducing I guess I could also use reduceByKey() there?)
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '4')\
.getOrCreate()
sc = spark.sparkContext
def group_by(ws):
L = ws[0]
E = ...ws[1]... <-- Do something here to turn this from resultIterable to Pair_RDD
rr = E.groupByKey().map(output_lists)
return (L, rr)
def output_lists(ws):
el = [e[0] for e in ws[1]]
res = [ws[0]] + el
return (ws[0], res)
input_data = (('A', ('G', ('xyz',))),
('A', ('G', ('xys',))),
('A', ('H', ('asd',))),
('B', ('K', ('qwe',))),
('B', ('K', ('wer',))))
data = sc.parallelize(input_data)
data = data.groupByKey().map(group_by)
print(data.take(5))
Now, is this even doable or do I need a different approach.
I know two other ways around:
Concatenate both keys into one.
Use a SparkSQL dataframe.
But I'm curious if there is a way with the above approach as I'm still learning Spark.
I found out I can use tuples as keys in pair RDDs. Remapping my input data like this means only one groupByKey() is needed and the problem can be solved:
input_data = ((('A', 'G'), 'xyz'),
(('A', 'G'), 'xys'),
(('A', 'H'), 'asd'),
(('B', 'K'), 'qwe'),
(('B', 'K'), 'wer'))

PySpark split DataFrame into multiple frames based on a column key and train an ML lib model on each

I have a PySpark dataframe with a column "group". I also have feature columns and a label column. I want to split the dataframe for each group and then train a model and end up with a dictionary where the keys are the "group" names and the values are the trained models.
This question essentially give an answer to this problem. This method is inefficient.
The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation.
The answer is old and I am hoping there have been improvements in PySpark since then. For my use case I have 10k groups, with heavy skew in the data sizes. The largest group can have 1 Billion records and the smallest group can have 1 record.
Edit: As suggested here is a small reproducible example.
df = sc.createDataFrame(
[
('A', 1, 0, True),
('A', 3, 0, False),
('B', 2, 2, True),
('B', 3, 3, True),
('B', 5, 2, False)
],
('group', 'feature_1', 'feature_2', 'label')
)
I can split the data as suggested in the above link:
from itertools import chain
from pyspark.sql.functions import col
groups = chain(*df.select("group").distinct().collect())
df_by_group = {group:
train_model(df.where(col("group").eqNullSafe(group))) for group in groups}
Where train_model is a function that takes a dataframe with columns=[feature_1, feature_2, label] and returns a trained model on that dataframe.

Spark RDD: lookup from other RDD

I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.

Find sum of second values in key/value pair

just started with PySpark
I have a key/value pair like following (key,(value1,value2))
I'd like to find a sum of value2 for each key
example of input data
(22, (33, 17.0)),(22, (34, 15.0)),(20, (3, 5.5)),(20, (11, 0.0))
Thanks !
At the end I created a new RDD contains key,value2 only , then just sum values of the new RDD
sumRdd = rdd.map(lambda (x, (a, b)): (x, b))\
.groupByKey().mapValues(sum).collect()
If you would like to benefit from combiner this would be a better choice.
from operator import add
sumRdd = rdd.map(lambda (x, (a, b)): (x, b)).reduceByKey(add)

Resources