Can I chain groupByKey calls on pair_rdd in Pyspark? - apache-spark

Is it possible in Pyspark to chain a groupByKey() call on a pair_rdd twice?
I have two levels of keys I want to group by before I aggregate by creating a special list of all values.
Here's my code. First groupByKey() call groups by the outer key and is then given to a map function in which I hope to turn the resultIterable object into a pair_rdd again so I can do the second groupByKey() and map my function to it.
(Since I'm reducing I guess I could also use reduceByKey() there?)
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '4')\
.getOrCreate()
sc = spark.sparkContext
def group_by(ws):
L = ws[0]
E = ...ws[1]... <-- Do something here to turn this from resultIterable to Pair_RDD
rr = E.groupByKey().map(output_lists)
return (L, rr)
def output_lists(ws):
el = [e[0] for e in ws[1]]
res = [ws[0]] + el
return (ws[0], res)
input_data = (('A', ('G', ('xyz',))),
('A', ('G', ('xys',))),
('A', ('H', ('asd',))),
('B', ('K', ('qwe',))),
('B', ('K', ('wer',))))
data = sc.parallelize(input_data)
data = data.groupByKey().map(group_by)
print(data.take(5))
Now, is this even doable or do I need a different approach.
I know two other ways around:
Concatenate both keys into one.
Use a SparkSQL dataframe.
But I'm curious if there is a way with the above approach as I'm still learning Spark.

I found out I can use tuples as keys in pair RDDs. Remapping my input data like this means only one groupByKey() is needed and the problem can be solved:
input_data = ((('A', 'G'), 'xyz'),
(('A', 'G'), 'xys'),
(('A', 'H'), 'asd'),
(('B', 'K'), 'qwe'),
(('B', 'K'), 'wer'))

Related

PySpark split DataFrame into multiple frames based on a column key and train an ML lib model on each

I have a PySpark dataframe with a column "group". I also have feature columns and a label column. I want to split the dataframe for each group and then train a model and end up with a dictionary where the keys are the "group" names and the values are the trained models.
This question essentially give an answer to this problem. This method is inefficient.
The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation.
The answer is old and I am hoping there have been improvements in PySpark since then. For my use case I have 10k groups, with heavy skew in the data sizes. The largest group can have 1 Billion records and the smallest group can have 1 record.
Edit: As suggested here is a small reproducible example.
df = sc.createDataFrame(
[
('A', 1, 0, True),
('A', 3, 0, False),
('B', 2, 2, True),
('B', 3, 3, True),
('B', 5, 2, False)
],
('group', 'feature_1', 'feature_2', 'label')
)
I can split the data as suggested in the above link:
from itertools import chain
from pyspark.sql.functions import col
groups = chain(*df.select("group").distinct().collect())
df_by_group = {group:
train_model(df.where(col("group").eqNullSafe(group))) for group in groups}
Where train_model is a function that takes a dataframe with columns=[feature_1, feature_2, label] and returns a trained model on that dataframe.

Using reduceByKey method in Pyspark to update a dictionary

I have the following rdd data.
[(13, 'Munich#en'), (13, 'Munchen#de'), (14, 'Vienna#en'), (14, 'Wien#de'),(15, 'Paris#en')]
I want to combine the above rdd , using reduceByKey method, that would result the following output, i.e to join the entries into a dictionary based on entry's language.
[
(13, {'en':'Munich','de':'Munchen'}),
(14, {'en':'Vienna', 'de': 'Wien'}),
(15, {'en':'Paris', 'de':''})
]
The examples for reduceByKey were all numerical operations such as addition, so I am not very sure how to go about updating a dictionary in each reduce step.
This is my code:
rd0 = sc.parallelize(
[(13, 'munich#en'),(13, 'munchen#de'), (14, 'Vienna#en'),(14,'Wien#de'),(15,'Paris#en')]
)
def updateDict(x,xDict):
xDict[x[:-3]]=x[-2:]
rd0.map(lambda x: (x[0],(x[1],{'en':'','de':''}))).reduceByKey(updateDict).collect()
I am getting the following error message but not sure what I am doing wrong.
return f(*args, **kwargs)
File "<ipython-input-209-16cfa907be76>", line 2, in ff
TypeError: 'tuple' object does not support item assignment
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
There are some problems with your code - for instance, your updateDict does not return a value. Here is a different approach:
First, map the values into dictionaries. One way is to split on "#", reverse, and pass the result into the dict constructor.
rd1 = rd0.mapValues(lambda x: dict([reversed(x.split("#"))]))
print(rd1.collect())
#[(13, {'en': 'munich'}),
# (13, {'de': 'munchen'}),
# (14, {'en': 'Vienna'}),
# (14, {'de': 'Wien'}),
# (15, {'en': 'Paris'})]
Now you can call reduceByKey and merge the two dictionaries. Finally add in the missing keys with a dictionary comprehension over the required keys, defaulting to empty string if the key is missing.
def merge_two_dicts(x, y):
# from https://stackoverflow.com/a/26853961/5858851
# works for python 2 and 3
z = x.copy() # start with x's keys and values
z.update(y) # modifies z with y's keys and values & returns None
return z
rd2 = rd1.reduceByKey(merge_two_dicts)\
.mapValues(lambda x: {k: x.get(k, '') for k in ['en', 'de']})
print(rd2.collect())
#[(14, {'de': 'Wien', 'en': 'Vienna'}),
# (13, {'de': 'munchen', 'en': 'munich'}),
# (15, {'de': '', 'en': 'Paris'})]

Spark RDD: lookup from other RDD

I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.

Creating a new rdd from two different rdds

I have two rdd as following:
rdd1=sc.parallelize([(('a','b'),10),(('c','d'),20)])
rdd2=sc.parallelize([('a',2),('b',3),('c',4)])
I need to make a new rdd as following: (value for ('a', 'b') => value(a,b)/value(a) => 10/2
[(('a','b'), 5.0), (('c','d'), 5.0)]
You requirement says that you want the number rdd1 devided by the value from rdd2 which matches the key of rdd2 with first value of rdd1 key.
If my understanding is correct then your requirement can be fulfilled by doing the following where rdd1 is transformed to make the first value as key so that join between two rdds can be perfomed.
rdd1.map(lambda x: (x[0][0], x)).join(rdd2).map(lambda x: (x[1][0][0], float(x[1][0][1]/x[1][1])))
#[(('a', 'b'), 5.0), (('c', 'd'), 5.0)]

Find sum of second values in key/value pair

just started with PySpark
I have a key/value pair like following (key,(value1,value2))
I'd like to find a sum of value2 for each key
example of input data
(22, (33, 17.0)),(22, (34, 15.0)),(20, (3, 5.5)),(20, (11, 0.0))
Thanks !
At the end I created a new RDD contains key,value2 only , then just sum values of the new RDD
sumRdd = rdd.map(lambda (x, (a, b)): (x, b))\
.groupByKey().mapValues(sum).collect()
If you would like to benefit from combiner this would be a better choice.
from operator import add
sumRdd = rdd.map(lambda (x, (a, b)): (x, b)).reduceByKey(add)

Resources