Efficient way to reduceByKey ignoring keys not in another RDD?

Efficient way to reduceByKey ignoring keys not in another RDD? - apache-spark

I have a large collection of data in pyspark. The format is key-value pairs, and I need to do a reducebykey operation, but ignoring all data whose key isn't in an RDD of 'interesting' keys that I also have.
I found a solution on SO that utilizes the subtractbykey operation to achieve this. It works, but crashes due to low memory on my cluster. I have not been able to change this with tweaking the settings, so I'm hoping there's a more efficient solution.
Here's my solution that works on smaller datasets:
# The keys I'm interested in
edges = sc.parallelize([("a", "b"), ("b", "c"), ("a", "d")])
# Data containing both interesting and uninteresting stuff
data1 = sc.parallelize([(("a", "b"), [42]), (("a", "c"), [60]), (("a", "d"), [13, 37])])
data2 = sc.parallelize([(("a", "b"), [43]), (("b", "c"), [23, 24]), (("a", "c"), [13, 37])])
all_data = [data1, data2]
mask = edges.map(lambda t: (tuple(t), None))
rdds = []
for datum in all_data:
combined = datum.reduceByKey(lambda a, b: a+b)
unwanted = combined.subtractByKey(mask)
wanted = combined.subtractByKey(unwanted)
rdds.append(wanted)
edge_alltimes = sc.union(rdds).reduceByKey(lambda a,b: a+b)
edge_alltimes.collect()
As desired, this outputs [(('a', 'd'), [13, 37]), (('a', 'b'), [42, 43]), (('b', 'c'), [23, 24])]
(i.e. data for the 'interesting' key tuples have been combined and the rest has been dropped).
The reason I have the data in several RDDs is to mimic behavior on my cluster where I can't load all the data simultaneously due to its size.
Any help would be great.

Example with join. A small drawback is that you need to have RDD of pairs before join and you need to strip extra data after join.
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val goodKeys = sc.parallelize(Seq(1, 2))
val allData = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "c")))
val goodPairs = goodKeys.map(v => (v, 0))
val goodData = allData.join(goodPairs).mapValues(p => p._1)
goodData.collect().foreach(println)
}
}
Output:
(1,a)
(2,b)

Related

Spark RDD find ratio of for key-value pairs

My rdd contains key-value pairs such as this:
(key1, 5),
(key2, 10),
(key3, 20),
I want to perform a map operation that associates each key with its respect ratio in the entire rdd, such as this:
(key1, 5/35),
(key2, 10/35),
(key3, 20/35),
I am struggling to find a method to do this using standard functions, any help will be appreciated.

You can calculate the sum and divide each value by the sum:
from operator import add
rdd = sc.parallelize([('key1', 5), ('key2', 10), ('key3', 20)])
total = rdd.values().reduce(add)
rdd2 = rdd.mapValues(lambda x: x/total)
rdd2.collect()
# [('key1', 0.14285714285714285), ('key2', 0.2857142857142857), ('key3', 0.5714285714285714)]
In Scala it would be
val rdd = sc.parallelize(List(("key1", 5), ("key2", 10), ("key3", 20)))
val total = rdd.values.reduce(_+_)
val rdd2 = rdd.mapValues(1.0*_/total)
rdd2.collect
// Array[(String, Double)] = Array((key1,0.14285714285714285), (key2,0.2857142857142857), (key3,0.5714285714285714))

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates.
Example
val df1 = Seq(
("a", 1L),
("a", 1L),
("a", 1L),
("b", 2L)
).toDF("id", "value")
val df2 = Seq(
("a", 1L),
("b", 2L)
).toDF("id", "value")
df1.exceptAll(df2).collect()
// will return
Seq(("a", 1L),("a", 1L))
However I can only use Spark 2.3.0.
What is the best way to implement this using only functions from Spark 2.3.0?

One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows.
PySpark solution shown here.
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w1 = Window.partitionBy(df1.id).orderBy(df1.value)
w2 = Window.partitionBy(df2.id).orderBy(df2.value)
df1 = df1.withColumn("rnum", row_number().over(w1))
df2 = df2.withColumn("rnum", row_number().over(w2))
res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
.filter(df2.id.isNull()) \ #Identifies missing rows
.select(df1.id,df1.value)
res_like_exceptAll.show()

Wordcount Program for Apache PySpark

I have been given a task to create a wordcount program in Python Spark. I am supposed to count the number of words starting with each alphabet.
Here's the code I have written but I can't seem to get the result. Could anyone help me with troubleshooting?
in.txt content:
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
import re
import sys
from pyspark import SparkConf, SparkContext
conf = SparkConf()
sc = SparkContext(conf=conf)
inRDD = sc.textFile("in.txt")
words = inRDD.flatMap(lambda l: re.split(" ",l))
LetterCount = words.map(lambda s : (s[0],1))
result = LetterCount.reduceByKey(lambda n1, n2 : n1 + n2)

Your code is OK.
Just need to add the collect at the end :
result.collect()
[('s', 1),
('l', 2),
('a', 10),
('n', 1),
('t', 8),
('c', 1),
('p', 1),
('b', 2),
('w', 1),
('o', 2)]
And you can replace
import re
words = inRDD.flatMap(lambda l: re.split(" ",l))
with
words = inRDD.flatMap(str.split)

Wordcount Program for Apache PySpark using sparkSQL function Easiest way
import pyspark.sql.functions as f
wordsDF = spark.read.text("path/log.txt")
df = wordsDF.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df.createOrReplaceTempView("wc")
df2 = spark.sql("SELECT SUM(wordCount) as Total FROM wc").show()
+-----+
|Total|
+-----+
| 147|
+-----+

Add incrementing variable in RDD

Assuming that I have the following RDD:
test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
How can I create the following rdd:
((1,'trial1',[1,2]),(2,'trial2',[3,4]))
I tried with accumulators but it doesnt work as accumulators cannot be accessed in tasks:
def increm(keyvalue):
global acc
acc +=1
return (acc.value,keyvalue[0],keyvalue[1])
acc = sc.accumulator(0)
test1RDD.map(lambda x: increm(x)).collect()
Any idea how can this be done?

You can use zipWithIndex
zipWithIndex()
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the
ordering of items within each partition. So the first item in the
first partition gets index 0, and the last item in the last partition
receives the largest index.
This method needs to trigger a spark job when this RDD contains more
than one partitions.
>>> sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]
and use map to transform the RDD to have the index in front of the new RDD
This is untested as I dont have any environment:
test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
test1RDD.zipWithIndex().map(lambda x : (x[1],x[0]))

Spark Multiple Joins

Using spark context, I would like to perform multiple joins between
rdd's, where the number of rdd's to be joined should be dynamic.
I would like the result to be unfolded, for example:
val rdd1 = sc.parallelize(List((1,1.0),(11,11.0), (111,111.0)))
val rdd2 = sc.parallelize(List((1,2.0),(11,12.0), (111,112.0)))
val rdd3 = sc.parallelize(List((1,3.0),(11,13.0), (111,113.0)))
val rdd4 = sc.parallelize(List((1,4.0),(11,14.0), (111,114.0)))
val rdd11 = rdd1.join(rdd2).join(rdd3).join(rdd4)
.foreach(println)
generates the following output:
(11,(((11.0,12.0),13.0),14.0))
(111,(((111.0,112.0),113.0),114.0))
(1,(((1.0,2.0),3.0),4.0))
I would like to:
Unfold the values, e.g first line should read:
(11, 11.0, 12.0, 13.0, 14.0).
Do it dynamically so that it can work on any dynamic number
of rdd's to be joined.
Any ideas would be appreciated,
Eli.

Instead of using join, I would use union followed by groupByKey to achieve what you desire.
Here is what I would do -
val emptyRdd = sc.emptyRDD[(Int, Double)]
val listRdds = List(rdd1, rdd2, rdd3, rdd4) // satisfy your dynamic number of rdds
val unioned = listRdds.fold(emptyRdd)(_.union(_))
val grouped = unioned.groupByKey
grouped.collect().foreach(println(_))
This will yields the result:
(1,CompactBuffer(1.0, 2.0, 3.0, 4.0))
(11,CompactBuffer(11.0, 12.0, 13.0, 14.0))
(111,CompactBuffer(111.0, 112.0, 113.0, 114.0))
Updated:
If you would still like to use join, this is how to do it with somewhat complicated foldLeft functions -
val joined = rddList match {
case head::tail => tail.foldLeft(head.mapValues(Array(_)))(_.join(_).mapValues {
case (arr: Array[Double], d: Double) => arr :+ d
})
case Nil => sc.emptyRDD[(Int, Array[Double])]
}
And joined.collect will yield
res14: Array[(Int, Array[Double])] = Array((1,Array(1.0, 2.0, 3.0, 4.0)), (11,Array(11.0, 12.0, 13.0, 14.0)), (111,Array(111.0, 112.0, 113.0, 114.0)))

Others with this problem may find groupWith helpful. From the docs:
>>> w = sc.parallelize([("a", 5), ("b", 6)])
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> z = sc.parallelize([("b", 42)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(w.groupWith(x, y, z).collect()))]
[('a', ([5], [1], [2], [])), ('b', ([6], [4], [], [42]))]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficient way to reduceByKey ignoring keys not in another RDD? - apache-spark

Related

Spark RDD find ratio of for key-value pairs

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Wordcount Program for Apache PySpark

Add incrementing variable in RDD

Spark Multiple Joins

Categories

Resources