Spark : Call a custom method before processing rdd on each executor - apache-spark

I am working on a spark Streaming application . I have a requirement where i need to verify certain condition( by reading file present in local FS).
I tried doing:
lines.foreachRDD{rdd =>
verifyCondition
rdd.map() ..
}
def verifyCondition(){
...
}
But verifyCondition is being executed only by Driver. Is there any way we can execute it by each executors?
Thanks

You can move verifyCondition function inside rdd.map() like
rdd.map{
verifyCondition
...
}
because inside map is a closure(a closure is a record storing a function together with an environment) and spark will distribute it over executors and it will be executed by each executor.

lines.foreachRDD { rdd =>
rdd.foreachPartition => partition
verifyCondition(...) // This will be executed by executors, once per every partition
partition.map(...)
}
}

Related

Spark transformation (map) is not called even after calling the action (count)

I have a map function defined on the DataFrame and when I invoke the action (count() in this case) I am not seeing the function calls invoked inside the map function getting called for each row.
Here is the code I have
def copyFilesToArchive(recordDF:DataFrame,s3Util:S3Client):Unit ={
if(s3Util !=null) {
// Copy all the Object to new Path
logger.info(".copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions ={}",recordDF.rdd.partitions.length);
recordDF.rdd.map(row => {
var key = row.getAs("object_key")
var bucketName = row.getAs("bucket_name")
var targetBucketName = row.getAs("target_bucket_name")
var targetKey = "archive/" + "/" + key
var copyObjectRequest = new CopyObjectRequest(bucketName, key, targetBucketName,targetKey )
logger.info(".copyFilesToArchive() : Copying the File from ["+key+"] to ["+targetKey+"]");
s3Util.getS3Client.copyObject(copyObjectRequest)
})
logger.info(".copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy ={}",recordDF.count());
}
else{
logger.info(".copyFilesToArchive() : Skipping Moving the Files as S3 Util is null");
}
}
And when I run my unit tests I am not seeing the logging statement of copying the files.
INFO ArchiveProcessor - .copyFilesToArchive() : Before Copying the Files to Archive and no.of RDD Partitions =200
INFO ArchiveProcessor - .copyFilesToArchive() : Copying the Files to Archive Folder. No.of Files to Copy =3000000
when I use collect() how ever i get OOM Error.
if I use collect() then i can see the logging output.
recordDF.collect().map(row => {
...
})
Thanks
Sateesh
Spark dataframes are immutable, if you do any transformation it will not change the original dataframe variable.
You are calling action method count() on recordDF but not on transformed version of recordDF i.e recordDF.rdd.map(//operations). Since you are not calling any action method that particular code block is not getting executed.
Since collect() is an action method, recordDF.collect().map(..)--> this is working for you. Collect method will bring all the records to driver, if memory is not enough (default is 1 GB) you will get OOM error.
You can use foreach or foreachPartition functions on dataframe -->recordDF.foreach(row ==> // transformation logic goes here) or call action on the recordDF.map.rdd(row=> //...)
val outRDD = recordDF.map.rdd(row=> //...)
logger.info("--<your message>--", outRDD.count)

how to do record count for different dataframe write without action using sparkListener?

Need to know the count of the dataframe after write without invoking additional action
I know using spark listener we can calculate like below. But Below code getting called for all task completed. Say i have dataframe1 and dataframe 2
for both dataframe write of each task onTaskEnd getting called. so i need a flag to segregate this call for dataframe1 and datafarme2 to increase counter.
var dataFrame_1_counter = 0L
var dataFrame_2_counter = 0L
sparkSession.sparkContext.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
if(`isDataFrame1Call`){ // any way for isDataFrame1Call?
dataFrame_1_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}else{
dataFrame_2_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
Need isDataFrame1Call flag. is there any way?
this solved by using jobgroup setting for each thread in spark

Does RDD's .first() method shuffle?

Imagine, we have small_table and big_table, and need do this:
small_table.join(big_table, "left_outer")
Could it be faster if i do this:
small_table.map(row => {
val find = big_table.filter('id === row.id)
if (find.isEmpty) return Smth(row.id, null)
return Smth(row.id, find.first().name)
})
If you were able to access the data of one RDD inside a mapping of another RDD, you could run some performance tests here to see the difference. Unfortunately, the following code:
val find = big_table.filter('id === row.id)
is not possible, because this attempts to access the data of one RDD inside another RDD.

Spark - ignoring corrupted files

In the ETL process that we are managing, we are receiving sometimes corrupted files.
We tried this Spark configuration and it seems it works (the Spark job is not failing because the corrupted files are discarded):
spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")
But I don't know if there is anyway to know which files were ignored. Is there anyway to get those filenames?
Thanks in advance
One way is look through your executor logs. If you have setup following configuratios to true in your spark configuration.
RDD: spark.files.ignoreCorruptFiles
DataFrame: spark.sql.files.ignoreCorruptFiles
Then spark will log corrupted file as a WARN message in your executor logs.
Here is code snippet from Spark that does that:
if (ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
// vectorized Parquet reader. Here we use lazy val to delay the creation of
// iterator so that we will throw exception in `getNext`.
private lazy val internalIter = readCurrentFile()
override def getNext(): AnyRef = {
try {
if (internalIter.hasNext) {
internalIter.next()
} else {
finished = true
null
}
} catch {
// Throw FileNotFoundException even `ignoreCorruptFiles` is true
case e: FileNotFoundException => throw e
case e # (_: RuntimeException | _: IOException) =>
logWarning(
s"Skipped the rest of the content in the corrupted file: $currentFile", e)
finished = true
null
}
}
Did you solve it?
If not, may be you can try the below approach:
Read everything from the location with that ignoreCorruptFiles setting
You can get the file names each record belongs to using the input_file_name UDF. Get distinct names out.
Separately get list of all the objects in the respective directory
Find the difference.
Did you use a different approach?

How to insert (not save or update) RDD into Cassandra?

I am working with Apache Spark and Cassandra, and I want to save my RDD to Cassandra with spark-cassandra-connector.
Here's the code:
def saveToCassandra(step: RDD[(String, String, Date, Int, Int)]) = {
step.saveToCassandra("keyspace", "table")
}
This works fine most of the time, but overrides data that's already present in the db. I would like not to override any data. Is it somehow possible ?
What I do is this:
rdd.foreachPartition(x => connector.WithSessionDo(session => {
someUpdater.UpdateEntries(x, session)
// or
x.foreach(y => someUpdater.UpdateEntry(y, session))
}))
The connector above is CassandraConnector(sparkConf).
It's not as nice as a simple saveToCassandra, but it allows for a fine-grained control.
I think it's better to use WithSessionDo outside the foreach partition instead. There's overhead involved in that call that need not be repeated.

Resources