Spark shuffle when double repartition - apache-spark

I am trying to join some datasets in Spark and I trying to do that without shuffle.
Unfortunately after running some test I saw something odd.
Let's say I have dataset A in S3 as plaintext (Json string).
First, I read the dataset A and repartition it by a specific field. This is a dummy code as example:
val rddA = sc.textFile(s3InputPath)
.map(x => objectMapper.readValue(x, classOf[APojo]))
.map(x => (x.getId(), x)) // "id" field is type of String
.partitionBy(new HashPartitioner(10))
.map(x => objectMapper.writeValueAsString(x._2))
rddA.saveAsTextFile(s3OutputPath)
Then I try to read the previous dataset and run the exact same repartition:
val rddAClone = sc.textFile(s3OutputPath)
.map(x => objectMapper.readValue(x, classOf[APojo]))
.map(x => (x.getId(), x))
.partitionBy(new HashPartitioner(10))
.map(x => objectMapper.writeValueAsString(x._2))
rddAClone.collect //any operation to force the spark to process the RDD
My expectation was that, because I priorly repartitioned my dataset by "id", when I try to repartition it again, the operation will finish without any shuffle. Unfortunately, I still see some shuffle in the Spark's Application Monitor (<5% of the input dataset as shuffle)
Do any of you have confronted with the same problem? Is this behaviour intended or is something wrong with my assumption/expectancy?
Thanks!

Related

Spark converting dataframe to RDD takes a huge amount of time, lazy execution or real issue?

In my spark application, I am loading data from Solr into a dataframe, running an SQL query on it, and then writing the resulting dataframe to MongoDB.
I am using spark-solr library to read data from Solr and mongo-spark-connector to write results to MongoDB.
The problem is that it is very slow, for datasets as small as 90 rows in an RDD, the spark job takes around 6 minutes to complete (4 nodes, 96gb RAM, 32 cores each).
I am sure that reading from Solr and writing to MongoDB is not slow because outside Spark they perform very fast.
When I inspect running jobs/stages/tasks on application master UI, it always shows a specific line in this function as taking 99% of the time:
override def exportData(spark: SparkSession, result: DataFrame): Unit = {
try {
val mongoWriteConfig = configureWriteConfig
MongoSpark.save(result.withColumn("resultOrder", monotonically_increasing_id())
.rdd
.map(row => {
implicit val formats: DefaultFormats.type = org.json4s.DefaultFormats
val rMap = Map(row.getValuesMap(row.schema.fieldNames.filterNot(_.equals("resultOrder"))).toSeq: _*)
val m = Map[String, Any](
"queryId" -> queryId,
"queryIndex" -> opIndex,
"resultOrder" -> row.getAs[Long]("resultOrder"),
"result" -> rMap
)
Document.parse(Serialization.write(m))
}), mongoWriteConfig);
} catch {
case e: SparkException => handleMongoException(e)
}
}
The line .rdd is shown to take most of the time to execute. Other stages take a few seconds or less.
I know that converting a dataframe to an rdd is not an inexpensive call but for 90 rows it should not take this long. My local standalone spark instance can do it in a few seconds.
I understand that Spark executes transformations lazily. Does it mean that operations before .rdd call is taking a long time and it's just a display issue on application master UI? Or is it really the dataframe to rdd conversion taking too long? What can cause this?
By the way, SQL queries run on the dataframe are pretty simple ones, just a single group by etc.

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"

Spark Streaming - how to use reduceByKey within a partition on the Iterator

I am trying to consume Kafka DirectStream, process the RDDs for each partition and write the processed values to DB. When I try to perform reduceByKey(per partition, that is without the shuffle), I get the following error. Usually on the driver node, we can use sc.parallelize(Iterator) to solve this issue. But I would like to solve it in spark streaming.
value reduceByKey is not a member of Iterator[((String, String), (Int, Int))]
Is there a way to perform transformations on Iterator within the partition?
myKafkaDS
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val commonIter = rdd.mapPartitionsWithIndex ( (i,iter) => {
val offset = offsetRanges(i)
val records = iter.filter(item => {
(some_filter_condition)
}).map(r1 => {
// Some processing
((field2, field2), (field3, field4))
})
val records.reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // Getting reduceByKey() is not a member of Iterator
// Code to write to DB
Iterator.empty // I just want to store the processed records in DB. So returning empty iterator
})
}
Is there a more elegant way to do this(process kafka RDDs for each partition and store them in a DB)?
So... We can not use spark transformations within mapPartitionsWithIndex. However using scala transform and reduce methods like groupby helped me solve this issue.
yours records value is a iterator and Not a RDD. Hence you are unable to invoke reduceByKey on records relation.
Syntax issues:
1)reduceByKey logic looks ok, please remove val before statement(if not typo) & attach reduceByKey() after map:
.map(r1 => {
// Some processing
((field2, field2), (field3, field4))
}).reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
2)Add iter.next after end of each iteration.
3)iter.empty is wrongly placed. Put after coming out of mapPartitionsWithIndex()
4)Add iterator condition for safety:
val commonIter = rdd.mapPartitionsWithIndex ((i,iter) => if (i == 0 && iter.hasNext){
....
}else iter),true)

Is foreachRDD executed on the Driver?

I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some of my static data in form of Dataframes already loaded.
But as per API documentation for foreachRdd method on DStream:
it gets executed on Driver, so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
API Documentation
foreachRDD(func)
The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
so does that mean all processing logic will only run on Driver and not
get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD. The inner functions that you'll use on the RDD, such as foreachPartition, map, filter etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect, which do.
To make this clear, if you run the following, you will see "monkey" on the driver's stdout:
myDStream.foreachRDD { rdd =>
println("monkey")
}
If you run the following, you will see "monkey" on the driver's stdout, and the filter work will be done on whatever executors the rdd is distributed across:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
}
Let's add the simplification that myDStream only ever receives one RDD, and that this RDD is spread across a set of partitions that we'll call PartitionSetA that exist on MachineSetB where ExecutorSetC are running. If you run the following, you will see "monkey" on the driver's stdout, you will see "turtle" on the stdouts of all executors in ExecutorSetC ("turtle" will appear once for each partition -- many partitions could be on the machine where an executor is running), and the work of both the filter and addition operations will be done across ExecutorSetC:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
}
}
One more thing to note is that in the following code, y would end up being sent across the network from the driver to all of ExecutorSetC for each rdd:
val y = 2
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + y
}
}
To avoid this overhead, you can use broadcast variables, which send the value from the driver to the executors just once. For example:
val y = 2
val broadcastY = sc.broadcast(y)
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + broadcastY.value
}
}
For sending more complex things over as broadcast variables, such as objects that aren't easily serializable once instantiated, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration.html

microbatch operation in foreachRDD block of spark streaming

How can we microbatch operations in foreachRDD block . For example , I read logs from HDFS and perform operation in foreachRDD
val lines = ssc.textFileStream(hadoopPath)
lines.foreachRDD{ rdd =>
val newRDD = rdd.map(line =>
ScalaObject.process(line)
}
The code will call ScalaObject.process for each line in logs. Is is possbile to call ScalaObject.process for a batch of lines ?
Thanks
In the context of your example, what you're asking to do will not make use of Spark's parallelism. If you really need to do this for some reason, you can run the processing only once per partition. The parallelism you get with this will only be N, where N is the number of partitions.
Use rdd.foreachPartition:
val lines = ssc.textFileStream(hadoopPath)
lines.foreachRDD { rdd =>
val newRDD = rdd.foreachPartition(lines =>
ScalaObject.process(lines)
)
}
The parameter lines will be of type Iterable[String].

Resources