Related
Is it possible in Pyspark to chain a groupByKey() call on a pair_rdd twice?
I have two levels of keys I want to group by before I aggregate by creating a special list of all values.
Here's my code. First groupByKey() call groups by the outer key and is then given to a map function in which I hope to turn the resultIterable object into a pair_rdd again so I can do the second groupByKey() and map my function to it.
(Since I'm reducing I guess I could also use reduceByKey() there?)
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '4')\
.getOrCreate()
sc = spark.sparkContext
def group_by(ws):
L = ws[0]
E = ...ws[1]... <-- Do something here to turn this from resultIterable to Pair_RDD
rr = E.groupByKey().map(output_lists)
return (L, rr)
def output_lists(ws):
el = [e[0] for e in ws[1]]
res = [ws[0]] + el
return (ws[0], res)
input_data = (('A', ('G', ('xyz',))),
('A', ('G', ('xys',))),
('A', ('H', ('asd',))),
('B', ('K', ('qwe',))),
('B', ('K', ('wer',))))
data = sc.parallelize(input_data)
data = data.groupByKey().map(group_by)
print(data.take(5))
Now, is this even doable or do I need a different approach.
I know two other ways around:
Concatenate both keys into one.
Use a SparkSQL dataframe.
But I'm curious if there is a way with the above approach as I'm still learning Spark.
I found out I can use tuples as keys in pair RDDs. Remapping my input data like this means only one groupByKey() is needed and the problem can be solved:
input_data = ((('A', 'G'), 'xyz'),
(('A', 'G'), 'xys'),
(('A', 'H'), 'asd'),
(('B', 'K'), 'qwe'),
(('B', 'K'), 'wer'))
I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.
I'm trying to learn Spark in Python, and am stuck with combineByKey for averaging the values in key-value pairs. In fact, my confusion is not with the combineByKey syntax, but what comes afterward. The typical example (from the O'Rielly 2015 Learning Spark Book) can be seen on the web in many places; here's one.
The problem is with the sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count)).collectAsMap() statement. Using spark 2.0.1 and iPython 3.5.2, this throws a syntax error exception. Simplifying it to something that should work (and is what's in the O'Reilly book): sumCount.map(lambda key,vals: (key, vals[0]/vals[1])).collectAsMap() causes Spark to go bats**t crazy with java exceptions, but I do note a TypeError: <lambda>() missing 1 required positional argument: 'v' error.
Can anyone point me to an example of this functionality that actually works with a recent version of Spark & Python? For completeness, I've included my own minimum working (or rather, non-working) example:
In: pRDD = sc.parallelize([("s",5),("g",3),("g",10),("c",2),("s",10),("s",3),("g",-1),("c",20),("c",2)])
In: cbk = pRDD.combineByKey(lambda x:(x,1), lambda x,y:(x[0]+y,x[1]+1),lambda x,y:(x[0]+y[0],x[1]+y[1]))
In: cbk.collect()
Out: [('s', (18, 3)), ('g', (12, 3)), ('c', (24, 3))]
In: cbk.map(lambda key,val:(k,val[0]/val[1])).collectAsMap() <-- errors
It's easy enough to compute [(e[0],e[1][0]/e[1][1]) for e in cbk.collect()], but I'd rather get the "Sparkic" way working.
Step by step:
lambda (key, (totalSum, count)): ... is so-called Tuple Parameter Unpacking which has been removed in Python.
RDD.map takes a function which expect as single argument. Function you try to use:
lambda key, vals: ...
Is a function which expects two arguments, not a one. A valid translation from 2.x syntax would be
lambda key_vals: (key_vals[0], key_vals[1][0] / key_vals[1][1])
or:
def get_mean(key_vals):
key, (total, cnt) = key_vals
return key, total / cnt
cbk.map(get_mean)
You can also make this much simpler with mapValues:
cbk.mapValues(lambda x: x[0] / x[1])
Finally a numerically stable solution would be:
from pyspark.statcounter import StatCounter
(pRDD
.combineByKey(
lambda x: StatCounter([x]),
StatCounter.merge,
StatCounter.mergeStats)
.mapValues(StatCounter.mean))
Averaging over a specific column value can be done by using the Window concept. Consider the following code:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([('a', 2), ('b', 3), ('a', 6), ('b', 5)],
['a', 'i'])
win = Window.partitionBy('a')
df.withColumn('avg', F.avg('i').over(win)).show()
Would yield:
+---+---+---+
| a| i|avg|
+---+---+---+
| b| 3|4.0|
| b| 5|4.0|
| a| 2|4.0|
| a| 6|4.0|
+---+---+---+
The average aggregation is done on each worker separately, requires no round trip to the host, and therefore efficient.
My list of tuples looks like this:
Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
I want to sum all values up, in this case, 2+1+2+2=7
I can use Tup.reduceByKey() in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?
I've tried Tup.sum() but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)
BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect() and sum it up out of Spark.
This is pretty easy.
Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those
In Scala
val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()
In Python
x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()
in both cases the sum of 7 is printed.
I am converting a user input list of strings into tuples. The user inputs a list of fractions ie: (Please no "import fractions" suggestions)
fractions = ["1/2","3/5","4/3","3/8","1/9","4/7"]
I normally would use the following code that works:
user_input = 0
list_frac = []
print('Enter fractions into a list until you type "stop" in lower case:')
while user_input != 'stop':
user_input = input('Enter a fraction ie: "1/2" >>>')
list_frac.append(user_input)
list_frac.pop() # pop "stop" off the list
result = []
for i in list_frac:
result.append(tuple(i.split('/')))
print(result)
The result is a list of tuples:
fractions = [('1','2'),('3','5')('4','3'),('3','8'),('1','9'),('4','7')]
I want to change the values in the tuples to integers as well and I dont know how
However I also wish to learn lambda functions so I am practicing on simple code like this. This is my attempt at the same code using lambda function syntax:
tup_result = tuple(map(lambda i: result.append(i.split('/')), result))
But the result is an empty list and no error codes to help me.
The Question: How to change the strings in the list of tuples to ints, and then accomplish all this with the lambda function one liner.
Any suggestions, I have the general concept pf a lambda function down but actually implementing this is a little confusing, thanks for the help folks!
I used comprehensitions to solve the task:
fractions = ["1/2","3/5","4/3","3/8","1/9","4/7"]
print([(int(x),int(y)) for (x,y) in [k.split('/') for k in fractions]])
>>>[(1, 2), (3, 5), (4, 3), (3, 8), (1, 9), (4, 7)]
I started with python not long time ago myself and was confused how to use lambda in the beginning as well. Then I read, that Guido van Rossum had suggested, that lambda forms would disappear in Python3.0 AlternateLambdaSyntax, since then I have not used lambda at all and have no problem with the issue at all. You have to understand how it works when you see it in some code, but you can almost always can write more readable code without using lambda (though I can be wrong). I hope, it helped.
Update
there is a solution solution with map() and lambda, though I would not wish to see it in my code on my worst enemy:
print([(int(x),int(y)) for [x,y] in list(map(lambda frac: frac.split('/'),fractions))])
>>>[(1, 2), (3, 5), (4, 3), (3, 8), (1, 9), (4, 7)]