Making data ready for FP growth in pyspark - apache-spark

I am trying to implement FP growth algorith. I have data in following format:
Food rank
apple 1
caterpillar 1
banana 2
monkey 2
dog 3
bone 3
oath 3
How do I transform it into [[apple,caterpillar],[banana,monkey],[dog,bone,oath]]?

Assuming your data is a DataFrame, we first convert it to an rdd, then define the key's, use them to group your data and finally map the values into a list and extract them. We can do this two ways, either use groupByKey():
(df.rdd
.map(lambda x: (x[1],x[0]))
.groupByKey()
.mapValues(list)
.values())
Or use reduceByKey(), which is going to be more efficient:
(df.rdd
.map(lambda x: (x[1],[x[0]]))
.reduceByKey(lambda x,y: x+y)
.values())
Data:
df = sc.parallelize([("apple", 1),
("caterpillar", 1),
("banana", 2),
("monkey", 2),
("dog", 3),
("bone", 3),
("oath", 3)]).toDF(["Food", "rank"])

Related

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

Spark RDD: lookup from other RDD

I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.

Python Spark combineByKey Average

I'm trying to learn Spark in Python, and am stuck with combineByKey for averaging the values in key-value pairs. In fact, my confusion is not with the combineByKey syntax, but what comes afterward. The typical example (from the O'Rielly 2015 Learning Spark Book) can be seen on the web in many places; here's one.
The problem is with the sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count)).collectAsMap() statement. Using spark 2.0.1 and iPython 3.5.2, this throws a syntax error exception. Simplifying it to something that should work (and is what's in the O'Reilly book): sumCount.map(lambda key,vals: (key, vals[0]/vals[1])).collectAsMap() causes Spark to go bats**t crazy with java exceptions, but I do note a TypeError: <lambda>() missing 1 required positional argument: 'v' error.
Can anyone point me to an example of this functionality that actually works with a recent version of Spark & Python? For completeness, I've included my own minimum working (or rather, non-working) example:
In: pRDD = sc.parallelize([("s",5),("g",3),("g",10),("c",2),("s",10),("s",3),("g",-1),("c",20),("c",2)])
In: cbk = pRDD.combineByKey(lambda x:(x,1), lambda x,y:(x[0]+y,x[1]+1),lambda x,y:(x[0]+y[0],x[1]+y[1]))
In: cbk.collect()
Out: [('s', (18, 3)), ('g', (12, 3)), ('c', (24, 3))]
In: cbk.map(lambda key,val:(k,val[0]/val[1])).collectAsMap() <-- errors
It's easy enough to compute [(e[0],e[1][0]/e[1][1]) for e in cbk.collect()], but I'd rather get the "Sparkic" way working.
Step by step:
lambda (key, (totalSum, count)): ... is so-called Tuple Parameter Unpacking which has been removed in Python.
RDD.map takes a function which expect as single argument. Function you try to use:
lambda key, vals: ...
Is a function which expects two arguments, not a one. A valid translation from 2.x syntax would be
lambda key_vals: (key_vals[0], key_vals[1][0] / key_vals[1][1])
or:
def get_mean(key_vals):
key, (total, cnt) = key_vals
return key, total / cnt
cbk.map(get_mean)
You can also make this much simpler with mapValues:
cbk.mapValues(lambda x: x[0] / x[1])
Finally a numerically stable solution would be:
from pyspark.statcounter import StatCounter
(pRDD
.combineByKey(
lambda x: StatCounter([x]),
StatCounter.merge,
StatCounter.mergeStats)
.mapValues(StatCounter.mean))
Averaging over a specific column value can be done by using the Window concept. Consider the following code:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([('a', 2), ('b', 3), ('a', 6), ('b', 5)],
['a', 'i'])
win = Window.partitionBy('a')
df.withColumn('avg', F.avg('i').over(win)).show()
Would yield:
+---+---+---+
| a| i|avg|
+---+---+---+
| b| 3|4.0|
| b| 5|4.0|
| a| 2|4.0|
| a| 6|4.0|
+---+---+---+
The average aggregation is done on each worker separately, requires no round trip to the host, and therefore efficient.

Find sum of second values in key/value pair

just started with PySpark
I have a key/value pair like following (key,(value1,value2))
I'd like to find a sum of value2 for each key
example of input data
(22, (33, 17.0)),(22, (34, 15.0)),(20, (3, 5.5)),(20, (11, 0.0))
Thanks !
At the end I created a new RDD contains key,value2 only , then just sum values of the new RDD
sumRdd = rdd.map(lambda (x, (a, b)): (x, b))\
.groupByKey().mapValues(sum).collect()
If you would like to benefit from combiner this would be a better choice.
from operator import add
sumRdd = rdd.map(lambda (x, (a, b)): (x, b)).reduceByKey(add)

Spark sum up values regardless of keys

My list of tuples looks like this:
Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
I want to sum all values up, in this case, 2+1+2+2=7
I can use Tup.reduceByKey() in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?
I've tried Tup.sum() but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)
BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect() and sum it up out of Spark.
This is pretty easy.
Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those
In Scala
val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()
In Python
x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()
in both cases the sum of 7 is printed.

Resources