Spark:How to compare two rdd by key - apache-spark

I want to compare two rdd by their common keys. So I filter the rdd by key firstly, then compare the sub rdds.
For examples,
def compare(rdd1,rdd2):
do_something()
rdd = sc.textFile(path1) # each Rdd is dict type
rdd2 = sc.textFile(path2)
pair_rdd = rdd.flatMap(lambda x: x.keys()).zip(rdd.flatMap(lambda x:x.values()))
pair_rdd2 = rdd2.flatMap(lambda x: x.keys()).zip(rdd2.flatMap(lambda x:x.values()))
for feat in set(pair_rdd.keys().distinct().collect()) & \
set(pair_rdd2.keys().distinct().collect()):
pair_rdd_filter = pair_rdd.filter(lambda x: x[0] == feat).map(lambda x:x[1])
pair_rdd_filter2 = pair_rdd.filter(lambda x: x[0] == feat).map(lambda x:x[1])
compare(pair_rdd_filter, pair_rdd_filter2)
For convenience, I give an example of the rdd.
rdd = sc.parallelize([{'f':[1,2,3]},{'f':[1,20],'a':[1]}])
rdd2 = sc.parallelize([{'f':[3,4],'a':[23]},{'f':[2,100,10,2],'a':[3,10,3],'b':[3]}])
But, I find that if using collect() to get common keys, the rdd will start to reduce, which costs much time.
How to make this code run effectively.

The issue here is calling .collect() will move all the data to driver which then does the set intersection. To utilise distributed execution, use join instead:
pair_rdd.join(pair_rdd2)
This will output a RDD with common keys and values as tuple (pair_rdd element, pair_rdd2 element)
It can also be used to e.g. get common keys:
pair_rdd.join(pair_rdd2).keys().distinct()

Related

Union with an existing RDD which is a set in pyspark

Given a set U, which is stored in RDD named rdd.
What is the recommended way to merge any given RDD rdd_not_set and rdd such that the resultant rdd is also an set.
rdd = sc.union([rdd, U])
rdd = rdd.reduceBykey(reduce_func)
Ex: rdd = sc.parallelize([(1,2), (2,3)]) and rdd_not_set = sc.parallelize([(1,4), (3,4)]) and resultant final_rdd = sc.parallelize([(1,4), (2,3), (3,4)])
Naive solution is to perform union and then reduceByKey which would be very inefficient as rdd will be huge in size.

Avoiding a shuffle in Spark by pre-partitioning files (PySpark)

I have a dataset dataset which is partitioned on values 00-99 and want to create an RDD first_rdd to read in the data.
I then want to count how many times the word "foo" occurs in the second element of each partition and store the records of each partition in a list. My output would be final_rdd where each record is of the form (partition_key, (count, record_list)).
def to_list(a):
return [a]
def append(a, b):
a.append(b)
return a
def extend(a, b):
a.extend(b)
return a
first_rdd = sqlContext.sql("select * from dataset").rdd
kv_rdd = first_rdd.map(lambda x: (x[4], x)) # x[4] is the partition value
# Group each partition to (partition_key, [list_of_records])
grouped_rdd = kv_rdd.combineByKey(to_list, append, extend)
def count_foo(x):
count = 0
for record in x:
if record[1] == "foo":
count = count + 1
return (count, x)
final_rdd = grouped_rdd.mapValues(count_foo)
print("Counted 'foo' for %s partitions" % (final_rdd.count))
Since each partition of the dataset is totally independent from one another computationally, Spark shouldn't need to shuffle, yet when I look at the SparkUI, I notice that the combineByKey is resulting in a very large shuffle.
I have the correct number of initial partitions, and have also tried reading from the partitioned data in HDFS. Each way I try it, I still get a shuffle. What am I doing wrong?
I've solved my problem by using the mapPartitions function and passing it my own reduce function so that it "reduces" locally on each node and will never perform a shuffle.
In the scenario where data are isolated between each partition, it works perfectly. When the same key exists on more than one partition, this is where a shuffle would be necessary, but this case needs to be detected and handled separately.

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

Randomly shuffle column in Spark RDD or dataframe

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task.
What about selecting the column to shuffle, orderBy(rand) the column and zip it by index to the existing dataframe?
import org.apache.spark.sql.functions.rand
def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
case class Entry(name: String, salary: Double)
val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)
val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))
df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)
+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+
If you don't need a global shuffle across your data, you can shuffle within partitions using the mapPartitions method.
rdd.mapPartitions(Random.shuffle(_));
For a PairRDD (RDDs of type RDD[(K, V)]), if you are interested in shuffling the key-value mappings (mapping an arbitrary key to an arbitrary value):
pairRDD.mapPartitions(iterator => {
val (keySequence, valueSequence) = iterator.toSeq.unzip
val shuffledValueSequence = Random.shuffle(valueSequence)
keySequence.zip(shuffledValueSequence).toIterator
}, true)
The boolean flag at the end denotes that partitioning is preserved (keys are not changed) for this operation so that downstream operations e.g. reduceByKey can be optimized (avoid shuffles).
While one can not not just shuffle a single column directly - it is possible to permute the records in an RDD via RandomRDDs. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/random/RandomRDDs.html
A potential approach to having only a single column permuted might be:
use mapPartitions to do some setup/teardown on each Worker task
suck all of the records into memory. i.e. iterator.toList. Make sure you have many (/small) partitions of data to avoid OOME
using the Row object rewrite all back out as original except for the given column
within the mapPartitions create an in-memory sorted list
for the desired column drop its values in a separate collection and randomly sample the collection for replacing each record's entry
return the result as list.toIterator from the mapPartitions
You can add one additional column random generated, and then sort the record based on this random generated column. By this way, you are randomly shuffle your destined column.
In this way, you do not need to have all data in memory, which can easily cause OOM. Spark will take care of sorting and memory limitation issue by spill to disk if necessary.
If you don't want the extra column, you can remove it after sorting.
In case someone is looking for a PySpark equivalent of Sascha Vetter's post, you can find it below:
from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *
def add_index_to_row(row, index):
print(index)
row_dict = row.asDict()
row_dict["index"] = index
return Row(**row_dict)
def add_index_to_df(df):
df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
return spark.createDataFrame(df_with_index, new_schema)
def shuffle_single_column(df, column_name):
df_cols = df.columns
# select the desired column and shuffle it (i.e. order it by column with random numbers)
shuffled_col = df.select(column_name).orderBy(F.rand())
# add explicit index to the shuffled column
shuffled_col_index = add_index_to_df(shuffled_col)
# add explicit index to the original dataframe
df_index = add_index_to_df(df)
# drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
# reorder columns so that the shuffled column comes back to its initial position instead of the last position
df_shuffled = df_shuffled.select(df_cols)
return df_shuffled
# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")

How to filter dstream using transform operation and external RDD?

I used transform method in a similar use case as described in Transform Operation section of Transformations on DStreams:
spamInfoRDD = sc.pickleFile(...) # RDD containing spam information
# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
My code is as follows:
sc = SparkContext("local[4]", "myapp")
ssc = StreamingContext(sc, 5)
ssc.checkpoint('hdfs://localhost:9000/user/spark/checkpoint/')
lines = ssc.socketTextStream("localhost", 9999)
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a+b)
filter_rdd = sc.parallelize([(u'A', 1), (u'B', 1)], 2)
filtered_count = counts.transform(
lambda rdd: rdd.join(filter_rdd).filter(lambda k, (v1, v2): v1 and not v2)
)
filtered_count.pprint()
ssc.start()
ssc.awaitTermination()
But I get the following error
It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.
RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
How should I be using my external RDD to filter elements out of a dstream?
The difference between the Spark doc example and your code is the use of ssc.checkpoint().
Although the specific code example you provided will work without checkpoint, I guess you actually require it. But the concept of introducing an external RDD into the scope of a checkpointed DStream is potentially invalid: when recovering from a checkpoint, the external RDD may have changed.
I tried to checkpoint the external RDD, but I had no luck with it either.

Resources