Spark splitting a DStream into several RDDs - apache-spark

The same question also applies to splitting an RDD into several new RDDs.
A DStream or RDD contains several different case classes and I need to turn them into separate RDDs based on case class type.
I'm aware of
val newRDD = rdd.filter { a => a.getClass.getSimpleName == "CaseClass1" }
or
val newRDD = rdd.filter {
a => a match {
case _: CC1 => true
case _ => false
}
}
But this requires many runs through the original RDD, one per case class type.
There must be a more concise way to do the above matching filter?
Is there a way to split an rdd into several by the element type with one parallel pass?

1) A more concise way of filtering for a given type is to use rdd.collect(PartialFunction[T,U])
The equivalent of
val newRDD = rdd.filter { a => a.getClass.getSimpleName == "CaseClass1" }
would be:
val newRDD = rdd.collect{case c:CaseClass1 => c}
It could even be combined with additional filtering and transformation:
val budgetRDD = rdd.collect{case c:CaseClass1 if (c.customer == "important") => c.getBudget}
rdd.collect(p:PartialFunction[T,U]) should not be confused with rdd.collect() which delivers data back to the driver.
2) To split an RDD (or a DStream for that matter), filter is the way to go. One must remember that an RDD is a distributed collection. Filter will let you apply a function to a subset of that distributed collection, in parallel, over the cluster.
A structural creation of 2 or more RDDs from an original RDD would incur a 1-to-many shuffle stage, which will be substantially more expensive.

Looks like with rdd.filter I was on the right track with the long form. A slightly more concise version is:
val newRDD = rdd.filter { case _: CC1 => true ; case _ => false }
You can't leave out the case _ => false or the test for class is not exhaustive and you'll get errors. I couldn't get the collect to work correctly.
#maasg gets credit for the right answer about doing separate filter passes rather than hacking a way to split input in one pass.

Related

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

How to write DataFrame (built from RDD inside foreach) to Kafka?

I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taset[org.apache.spa‌​rk.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,Str‌​ing] Error occurred in an application involving default arguments.
Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.

Spark Streaming - how to use reduceByKey within a partition on the Iterator

I am trying to consume Kafka DirectStream, process the RDDs for each partition and write the processed values to DB. When I try to perform reduceByKey(per partition, that is without the shuffle), I get the following error. Usually on the driver node, we can use sc.parallelize(Iterator) to solve this issue. But I would like to solve it in spark streaming.
value reduceByKey is not a member of Iterator[((String, String), (Int, Int))]
Is there a way to perform transformations on Iterator within the partition?
myKafkaDS
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val commonIter = rdd.mapPartitionsWithIndex ( (i,iter) => {
val offset = offsetRanges(i)
val records = iter.filter(item => {
(some_filter_condition)
}).map(r1 => {
// Some processing
((field2, field2), (field3, field4))
})
val records.reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // Getting reduceByKey() is not a member of Iterator
// Code to write to DB
Iterator.empty // I just want to store the processed records in DB. So returning empty iterator
})
}
Is there a more elegant way to do this(process kafka RDDs for each partition and store them in a DB)?
So... We can not use spark transformations within mapPartitionsWithIndex. However using scala transform and reduce methods like groupby helped me solve this issue.
yours records value is a iterator and Not a RDD. Hence you are unable to invoke reduceByKey on records relation.
Syntax issues:
1)reduceByKey logic looks ok, please remove val before statement(if not typo) & attach reduceByKey() after map:
.map(r1 => {
// Some processing
((field2, field2), (field3, field4))
}).reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
2)Add iter.next after end of each iteration.
3)iter.empty is wrongly placed. Put after coming out of mapPartitionsWithIndex()
4)Add iterator condition for safety:
val commonIter = rdd.mapPartitionsWithIndex ((i,iter) => if (i == 0 && iter.hasNext){
....
}else iter),true)

Apache Spark: stepwise execution

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?
Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)
You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

Resources