how to do record count for different dataframe write without action using sparkListener? - apache-spark

Need to know the count of the dataframe after write without invoking additional action
I know using spark listener we can calculate like below. But Below code getting called for all task completed. Say i have dataframe1 and dataframe 2
for both dataframe write of each task onTaskEnd getting called. so i need a flag to segregate this call for dataframe1 and datafarme2 to increase counter.
var dataFrame_1_counter = 0L
var dataFrame_2_counter = 0L
sparkSession.sparkContext.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
if(`isDataFrame1Call`){ // any way for isDataFrame1Call?
dataFrame_1_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}else{
dataFrame_2_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
Need isDataFrame1Call flag. is there any way?

this solved by using jobgroup setting for each thread in spark

Related

Spark transformation (map) is not called even after calling the action (count)

I have a map function defined on the DataFrame and when I invoke the action (count() in this case) I am not seeing the function calls invoked inside the map function getting called for each row.
Here is the code I have
def copyFilesToArchive(recordDF:DataFrame,s3Util:S3Client):Unit ={
if(s3Util !=null) {
// Copy all the Object to new Path
logger.info(".copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions ={}",recordDF.rdd.partitions.length);
recordDF.rdd.map(row => {
var key = row.getAs("object_key")
var bucketName = row.getAs("bucket_name")
var targetBucketName = row.getAs("target_bucket_name")
var targetKey = "archive/" + "/" + key
var copyObjectRequest = new CopyObjectRequest(bucketName, key, targetBucketName,targetKey )
logger.info(".copyFilesToArchive() : Copying the File from ["+key+"] to ["+targetKey+"]");
s3Util.getS3Client.copyObject(copyObjectRequest)
})
logger.info(".copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy ={}",recordDF.count());
}
else{
logger.info(".copyFilesToArchive() : Skipping Moving the Files as S3 Util is null");
}
}
And when I run my unit tests I am not seeing the logging statement of copying the files.
INFO ArchiveProcessor - .copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions =200
INFO ArchiveProcessor - .copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy =3000000
when I use collect() how ever i get OOM Error.
if I use collect() then i can see the logging output.
recordDF.collect().map(row => {
...
})
Thanks
Sateesh
Spark dataframes are immutable, if you do any transformation it will not change the original dataframe variable.
You are calling action method count() on recordDF but not on transformed version of recordDF i.e recordDF.rdd.map(//operations). Since you are not calling any action method that particular code block is not getting executed.
Since collect() is an action method, recordDF.collect().map(..)--> this is working for you. Collect method will bring all the records to driver, if memory is not enough (default is 1 GB) you will get OOM error.
You can use foreach or foreachPartition functions on dataframe -->recordDF.foreach(row ==> // transformation logic goes here) or call action on the recordDF.map.rdd(row=> //...)
val outRDD = recordDF.map.rdd(row=> //...)
logger.info("--<your message>--", outRDD.count)

Does RDD's .first() method shuffle?

Imagine, we have small_table and big_table, and need do this:
small_table.join(big_table, "left_outer")
Could it be faster if i do this:
small_table.map(row => {
val find = big_table.filter('id === row.id)
if (find.isEmpty) return Smth(row.id, null)
return Smth(row.id, find.first().name)
})
If you were able to access the data of one RDD inside a mapping of another RDD, you could run some performance tests here to see the difference. Unfortunately, the following code:
val find = big_table.filter('id === row.id)
is not possible, because this attempts to access the data of one RDD inside another RDD.

Child thread not seeing updates made by main thread

I'm implementing a SparkHealthListener by extending SparkListener class.
#Component
class ClusterHealthListener extends SparkListener with Logging {
val appRunning = new AtomicBoolean(false)
val executorCount = new AtomicInteger(0)
override def onApplicationStart(applicationStart: SparkListenerApplicationStart) = {
logger.info("Application Start called .. ")
this.appRunning.set(true)
logger.info(s"[appRunning = ${appRunning.get}]")
}
override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded) = {
logger.info("Executor add called .. ")
this.executorCount.incrementAndGet()
logger.info(s"[executorCount = ${executorCount.get}]")
}
}
appRunning and executorCount are two variables declared in ClusterHealthListener class. ClusterHealthReporterThread will only read the values.
#Component
class ClusterHealthReporterThread #Autowired() (healthListener: ClusterHealthListener) extends Logging {
new Thread {
override def run(): Unit = {
while (true) {
Thread.sleep(10 * 1000)
logger.info("Checking range health")
logger.info(s"[appRunning = ${healthListener.appRunning.get}] [executorCount=${healthListener.executorCount.get}]"
}
}
}.start()
}
ClusterHealthReporterThread is always reporting the initialized values regardless of the changes made to the variable by main thread? What am I doing wrong? Is this because I inject healthListener to ClusterHealthReporterThread?
Update
I played around a bit and looks like it has something to do with the way i initiate spark listener.
If I add the spark listener like this
val sparkContext = SparkContext.getOrCreate(sparkConf)
sparkContext.addSparkListener(healthListener)
Parent thread will show appRunning as 'false' always but shows executor count correctly. Child thread (health reporter) will also show proper executor counts but appRunning was always reporting 'false' like that of the main thread.
Then I stumbled across this Why is SparkListenerApplicationStart never fired? and tried setting listener at the spark config level,
.set("spark.extraListeners", "HealthListener class path")
If I do this, main thread will report 'true' for appRunning and will report correct executor counts but child thread will always report 'false' and '0' value for executors.
I can't immediately see what's wrong here, you might have found an interesting edge case.
I think #m4gic's comment might be correct, that the logging library is perhaps caching that interpolated string? It looks like you are using https://github.com/lightbend/scala-logging which claims that this interpolation "has no effect on behavior", so maybe not. Please could you follow his suggestion to retry without using that feature and report back?
A second possibility: I wonder if there is only one ClusterHealthListener in the system? Perhaps the autowiring is causing a second instance to be created? Can you log the object ids of the ClusterHealthListener reference in both locations and verify that they are the same object?
If neither of those suggestions fix this, are you able to post a working example that I can play with?

Spark kafka streaming - how to determine end of a batch

I am consuming from a Kafka topic by using Kafka Streaming. (kafka direct stream)
The data in this topic arrives after every 5 minutes from another source.
Now i need to process the data that arrives after every 5 minutes and convert that into a Spark DataFrame.
Now, stream is continuous flow of data.
My issue is , how do i determine that i am done reading first set of data that was loaded in Kafka topic? (So that i can convert that into DataFrame and start my work)
I know i can mention the batch interval( in JavaStreamingContext) to a certain number, but even then i can never be sure on how much time the source will take to push the data to the topic.
Any suggestions are welcome.
If I understand your question correctly you would like to not create a batch till all the data for the 5 minute worth of input is read.
Spark out of the box does not provide any API like that.
You can how ever use a sliding window on your received stream to achieve part of what you want. (See last example code)
The other way(harder way) is to implement your own org.apache.spark.streaming.util.ManualClock to achieve what you need.
ManualClock is a private class so the override happens within the name space.
package org.apache.spark.streaming
import org.apache.spark.util.ManualClock
object ClockWrapper {
def advance(ssc: StreamingContext, timeToAdd: Duration): Unit = {
val manualClock = ssc.scheduler.clock.asInstanceOf[ManualClock]
manualClock.advance(timeToAdd.milliseconds)
}
}
Then in your own class
import org.apache.spark.streaming.{ClockWrapper, Duration, Seconds, StreamingContext}
//elided.
override def sparkConfig: Map[String, String] = {
super.sparkConfig + ("spark.streaming.clock" -> "org.apache.spark.streaming.util.ManualClock")
}
def ssc: StreamingContext = _ssc
def advanceClock(timeToAdd: Duration): Unit = {
//Only if some other conditions are met..
ClockWrapper.advance(_ssc, timeToAdd)
}
def advanceClockOneBatch(): Unit = {
advanceClock(Duration(batchDuration.milliseconds))
}
State based stream management can be done by using mapWithState API.
object StatefulStreamOperation {
val sparkConf = new SparkConf().setAppName("")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(".")
val incoming: DStream[(batchTimeMultipleOf5Minute, UserClass)]
val mappingFunc = (key: batchTime, incoming: Option[Int], state: State[UserClass]) => {
//Do what ever you need to do to the data.
//say (result, newState) = Some_Cool_operation(incoming, state)
state.update(newState)
result
}
val stateDstream = dataDstream.mapWithState(
StateSpec.function(mappingFunc).initialState(initialRDD))
//Do something with the result.
ssc.start()
ssc.awaitTermination()
}
}

Remove duplicates from a Spark JavaPairDStream / JavaDStream

I'm building a Spark Streaming Application which receives data via a SocketTextStream. The problem is, that the sended data has some duplicates. I would like to remove them on the Spark-side (without a pre-filtering on sender side). Can I use the JavaPairRDD's distinct function via DStream's foreach (I can't find a way how to do that)??? I need the "filtered" Java(Pair)DStream for later actions...
Thank you!
The .transform() method can be used to do arbitrary operations on each time slice of RDDs. Assuming your data are just strings:
someDStream.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
#Override
public JavaRDD<String> call(JavaRDD<String> rows) throws Exception {
return rows.distinct();
}
});

Resources