just started with PySpark
I have a key/value pair like following (key,(value1,value2))
I'd like to find a sum of value2 for each key
example of input data
(22, (33, 17.0)),(22, (34, 15.0)),(20, (3, 5.5)),(20, (11, 0.0))
Thanks !
At the end I created a new RDD contains key,value2 only , then just sum values of the new RDD
sumRdd = rdd.map(lambda (x, (a, b)): (x, b))\
.groupByKey().mapValues(sum).collect()
If you would like to benefit from combiner this would be a better choice.
from operator import add
sumRdd = rdd.map(lambda (x, (a, b)): (x, b)).reduceByKey(add)
Related
Suppose you have two vectors of the same size that are stored as rdd1 and rdd2. Please write a function where the inputs are rdd1 and rdd2, and the output is a rdd which is the element-wise addition of rdd1 and rdd2. You should not load all data to the driver program.
Hint: You may use zip() in Spark, not the zip() in Python.
I do not understand what it wrong with the below code, and whether it is correct or not. When I run it, it takes forever. Would you be able to help me with this? Thanks.
spark = SparkSession(sc)
numPartitions = 10
rdd1 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[0]))
rdd2 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[1]))
def ele_wise_add(rdd1, rdd2):
rdd3 = rdd1.zip(rdd2).map(lambda x,y: x + y)
return rdd3
rdd3 = ele_wise_add(rdd1, rdd2)
print(rdd3.collect())
rdd1 and rdd2 have 10000 numbers each, and below are the first 10 numbers in it.
rdd1 = [47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057]
rdd2 = [30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315]
expected output = [78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]
rdd1.zip(rdd2) would create a single tuple for each pair, so when writing lambda function, you only have x and not y. So you'd want to sum(x) or x[0] + x[1], not x + y.
rdd1 = spark.sparkContext.parallelize((47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057))
rdd2 = spark.sparkContext.parallelize((30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315))
rdd1.zip(rdd2).map(lambda x: sum(x)).collect()
[78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]
I want to plot each list of tuples generated by groupby command.
import more_itertools as mit
df=pd.DataFrame({'a': [0,1,2,0,1,2,3], 'b':[2,10,24,56,90,1,3]})
for group in mit.consecutive_groups(zip(df['a'],df['b']),ordering=lambda t:t[0]):
print(list(group))
output:
[(0, 2), (1, 10),(2,24)]
[(0,56),(1,90),(2,1),(3,3)]
I want to plot first index of group [(0, 2), (1, 10),(2,24)] taking first element as x and second element of tuple as y ( x=0,y=2). The same applies to following list of tuples. I am still trying, but have not figured yet.
You are looking for:
df.assign(grp = df.a.diff().ne(1).cumsum()).groupby('grp').plot('a','b')
I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?
If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.
I am trying to implement FP growth algorith. I have data in following format:
Food rank
apple 1
caterpillar 1
banana 2
monkey 2
dog 3
bone 3
oath 3
How do I transform it into [[apple,caterpillar],[banana,monkey],[dog,bone,oath]]?
Assuming your data is a DataFrame, we first convert it to an rdd, then define the key's, use them to group your data and finally map the values into a list and extract them. We can do this two ways, either use groupByKey():
(df.rdd
.map(lambda x: (x[1],x[0]))
.groupByKey()
.mapValues(list)
.values())
Or use reduceByKey(), which is going to be more efficient:
(df.rdd
.map(lambda x: (x[1],[x[0]]))
.reduceByKey(lambda x,y: x+y)
.values())
Data:
df = sc.parallelize([("apple", 1),
("caterpillar", 1),
("banana", 2),
("monkey", 2),
("dog", 3),
("bone", 3),
("oath", 3)]).toDF(["Food", "rank"])
My list of tuples looks like this:
Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
I want to sum all values up, in this case, 2+1+2+2=7
I can use Tup.reduceByKey() in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?
I've tried Tup.sum() but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)
BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect() and sum it up out of Spark.
This is pretty easy.
Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those
In Scala
val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()
In Python
x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()
in both cases the sum of 7 is printed.