Processing a stream batch without checkpointing - apache-spark

I have an application which has to execute two steps during processing:
Calculate statistics from a batch that is being processed
Use the statistics during the processing of the same batch
The trick is that between those 2 steps I need to recreate the SparkContext. This is because during processing in step 2 the library uses spark.default.parallelism from the SparkConf which was uses to initiate the SparkContext. The value spark.default.parallelism is derived in the Step 1.
Overriding the value in the SparkSession doesn't work and the only way for me to enforce a specific value is to recreate the entire SparkContext.
Right now the way this is done is by having something similar to this happening in step 1:
class Step1 extends BatchProcessor {
....
def run(): Unit = {
val query = df
.writeStream
.foreachBatch(foreachBatch _)
.trigger(Trigger.Once())
.option("checkpointLocation", <location>)
try {
val querySession = query.start()
querySession.awaitTermination()
querySession.stop()
} catch {
case e: ExpectedException =>
LOGGER.info("Ignoring exception, calculation of statistics done.")
}
}
override def foreachBatch(dataFrame: Dataset[Row], batchId: Long): Unit = {
// calculate statistics and store into an Object
throw new ExpectedException("Exiting stats calculation")
}
}
In the Step2 class we have basically the same thing except that the implementation of foreachBatch doesn't throw the exception and finishes normally.
Is it possible to use checkpoints without updating it automatically?
Can throwing of the exception be avoided in any other way?

Related

How to create dataframe inside ForeachWriter[Row]

I have a streaming query that I'm reading from Kafka as the source. I want to perform some logic on each batch that I receive from the stream. Here's how I have done it so far
val streamDF = spark
.readStream
...
.load()
//val bc = spark.sparkContext.broadcast(spark)
streamDF
.writeStream
.foreach( new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {true}
def process(record: String) = {
val aRDD = spark.sparkContext.parallelize(Seq('a','b','C'))
val aDF = spark.createDataframe(aRDD)
//val aDF = bc.vlaue.createDataframe(aRDD)
// do something with aDF
}
def close(errorOrNull: Throwable): Unit = {}
}
).start()
I'm using Spark 2.3.2 so I'm stuck with ForeachWriter (I cannot use foreachBatch, this would've made my life simpler). I'm also aware that the foreach() performs on executors.
So, keeping that in mind, I broadcasted sparkSession to all the executors. But that did not help either. This is the commented part of the code snippet.
I'm looking for a solution to process data as dataframe inside foreach in Spark 2.3.2 (I have to use dataframe/datasets as the operations are pretty heavy.. they include actions as well)
I found a similar question but there is no response on it --> similar q
Sorry, well not really, but NOT possible to create dataframe on an Executor.
A dataframe is a distributed collection in Spark. They are only able to be created on Driver node or via Transformation (via Actions) in your Spark App.

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

How to do parallel model training per partition in spark using scala?
The solution given here is in Pyspark. I'm looking for solution in scala.
How can you efficiently build one ML model per partition in Spark with foreachPartition?
Get the distinct partitions using partition col
Create a threadpool of say 100 threads
create future object for each threads and run
sample code may be as follows-
// Get an ExecutorService
val threadPoolExecutorService = getExecutionContext("name", 100)
// check https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/HasParallelism.scala#L50
val uniquePartitionValues: List[String] = ...//getDistingPartitionsUsingPartitionCol
// Asynchronous invocation to training. The result will be collected from the futures.
val uniquePartitionValuesFutures = uniquePartitionValues.map(partitionValue => {
Future[Double] {
try {
// get dataframe where partitionCol=partitionValue
val partitionDF = mainDF.where(s"partitionCol=$partitionValue")
// do preprocessing and training using any algo with an input partitionDF and return accuracy
} catch {
....
}(threadPoolExecutorService)
})
// Wait for metrics to be calculated
val foldMetrics = uniquePartitionValuesFutures.map(Await.result(_, Duration.Inf))
println(s"output::${foldMetrics.mkString(" ### ")}")

Spark Structured Streaming - testing one batch at a time

I'm trying to create a test for a custom MicroBatchReadSupport DataSource which I've implemented.
For that, I want to invoke one batch at a time, which will read the data using this DataSource(I've created appropriate mocks). I want to invoke a batch, verify that the correct data was read (currently by saving it to a memory sink and checking the output), and only then invoke the next batch and verify it's output.
I couldn't find a way to invoke each batch after the other.
If I use streamingQuery.processAllAvailable(), the batches are invoked one after the other, without allowing me to verify the output for each one separately. Using trigger(Trigger.Once()) doesn't help as well, because it executes one batch and I can't continue to the next one.
Is there any way to do what I want?
Currently this is my basic code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.format("memory")
.queryName("test_output")
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
What I've ended up doing is setting up the test with a DataStreamWriter which runs once, but saves the current status to a checkpoint. So each time we invoke dsw.start(), the new batch is resumed from the latest offset, according to the checkpoint. I'm also saving the data into a globalTempView, so I will be able to query the data in a similar way to using the memory sink. For doing that, I'm using foreachBatch (which is only available since Spark 2.4).
This is in code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw = getNewDataStreamWriter(dataFrame)
testFirstBatch(dsw)
testSecondBatch(dsw)
private def getNewDataStreamWriter(dataFrame: DataFrame) = {
val checkpointTempDir = Files.createTempDirectory("tests").toAbsolutePath.toString
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.trigger(Trigger.Once())
.option("checkpointLocation", checkpointTempDir)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.createOrReplaceGlobalTempView("input_data")
}
dsw
}
And the actual test code for each batch (e.g. testFirstBatch) is:
val rows = processNextBatch(dsw)
assertResult(10)(rows.length)
private def processNextBatch(dsw: DataStreamWriter[Row]) = {
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
sparkSession.sql("select * from global_temp.input_data").collect()
}

Nullability in Spark sql schemas is advisory by default. What is best way to strictly enforce it?

I am working on a simple ETL project which reads CSV files, performs
some modifications on each column, then writes the result out as JSON.
I would like downstream processes which read my results
to be confident that my output conforms to
an agreed schema, but my problem is that even if I define
my input schema with nullable=false for all fields, nulls can sneak
in and corrupt my output files, and there seems to be no (performant) way I can
make Spark enforce 'not null' for my input fields.
This seems to be a feature, as stated below in Spark, The Definitive Guide:
when you define a schema where all columns are declared to not have
null values , Spark will not enforce that and will happily let null
values into that column. The nullable signal is simply to help Spark
SQL optimize for handling that column. If you have null values in
columns that should not have null values, you can get an incorrect
result or see strange exceptions that can be hard to debug.
I have written a little check utility to go through each row of a dataframe and
raise an error if nulls are detected in any of the columns (at any level of
nesting, in the case of fields or subfields like map, struct, or array.)
I am wondering, specifically: DID I RE-INVENT THE WHEEL WITH THIS CHECK UTILITY ? Are there any existing libraries, or
Spark techniques that would do this for me (ideally in a better way than what I implemented) ?
The check utility and a simplified version of my pipeline appears below. As presented, the call to the
check utility is commented out. If you run without the check utility enabled, you would see this result in
/tmp/output.csv.
cat /tmp/output.json/*
(one + 1),(two + 1)
3,4
"",5
The second line after the header should be a number, but it is an empty string
(which is how spark writes out the null, I guess.) This output would be problematic for
downstream components that read my ETL job's output: these components just want integers.
Now, I can enable the check by un-commenting out the line
//checkNulls(inDf)
When I do this I get an exception that informs me of the invalid null value and prints
out the entirety of the offending row, like this:
java.lang.RuntimeException: found null column value in row: [null,4]
One Possible Alternate Approach Given in Spark/Definitive Guide
Spark, The Definitive Guide mentions the possibility of doing this:
<dataframe>.na.drop()
But this would (AFAIK) silently drop the bad records rather than flagging the bad ones.
I could then do a "set subtract" on the input before and after the drop, but that seems like
a heavy performance hit to find out what is null and what is not. At first glance, I'd
prefer my method.... But I am still wondering if there might be some better way out there.
The complete code is given below. Thanks !
package org
import java.io.PrintWriter
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// before running, do; rm -rf /tmp/out* /tmp/foo*
object SchemaCheckFailsToExcludeInvalidNullValue extends App {
import NullCheckMethods._
//val input = "2,3\n\"xxx\",4" // this will be dropped as malformed
val input = "2,3\n,4" // BUT.. this will be let through
new PrintWriter("/tmp/foo.csv") { write(input); close }
lazy val sparkConf = new SparkConf()
.setAppName("Learn Spark")
.setMaster("local[*]")
lazy val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val spark = sparkSession
val schema = new StructType(
Array(
StructField("one", IntegerType, nullable = false),
StructField("two", IntegerType, nullable = false)
)
)
val inDf: DataFrame =
spark.
read.
option("header", "false").
option("mode", "dropMalformed").
schema(schema).
csv("/tmp/foo.csv")
//checkNulls(inDf)
val plusOneDf = inDf.selectExpr("one+1", "two+1")
plusOneDf.show()
plusOneDf.
write.
option("header", "true").
csv("/tmp/output.csv")
}
object NullCheckMethods extends Serializable {
def checkNull(columnValue: Any): Unit = {
if (columnValue == null)
throw new RuntimeException("got null")
columnValue match {
case item: Seq[_] =>
item.foreach(checkNull)
case item: Map[_, _] =>
item.values.foreach(checkNull)
case item: Row =>
item.toSeq.foreach {
checkNull
}
case default =>
println(
s"bad object [ $default ] of type: ${default.getClass.getName}")
}
}
def checkNulls(row: Row): Unit = {
try {
row.toSeq.foreach {
checkNull
}
} catch {
case err: Throwable =>
throw new RuntimeException(
s"found null column value in row: ${row}")
}
}
def checkNulls(df: DataFrame): Unit = {
df.foreach { row => checkNulls(row) }
}
}
You can use the built-in Row method anyNull to split the dataframe and process both splits differently:
val plusOneNoNulls = plusOneDf.filter(!_.anyNull)
val plusOneWithNulls = plusOneDf.filter(_.anyNull)
If you don't plan to have a manual null-handling process, using the builtin DataFrame.na methods is simpler since it already implements all the usual ways to automatically handle nulls (i.e drop or fill them out with default values).

reading from several topics

I'm trying to develop an application that takes four different topics from a kafka server and takes take specific actions with each topic.
I have created a class that receives a DStream and has a method that should transform the DStream.
For example, the handler class:
class StreamHandler(stream:DStream[String]) {
val stream:DStream[String] = stream
def doActions():DStream[String] = {
//Do smth. to DStream
}
}
And now, imagine I call doActions() from the main class for each handler class I want, would it be repeated with each arriving DStream or only once?
val topicHandler1 = new StreamHandler(KafkaUtils.createStream(ssc, zkQuorum, "myGroup", Map("topic1"->1)).map(_._2)
val topicHandler2 = new OtherStreamHandler(KafkaUtils.createStream(ssc, zkQuorum, "myGroup", Map("topic2"->1)).map(_._2)
topicHandler1.doActions()
topicHandler2 .doActions()
ssc.start()
Is there a better approach?
The transformations declared on the StreamHandler will be applied to each batch of the DStream. The current code is quite incomplete to give you a certain answer. In the DStream transformation pipeline you will need an action that materializes the DStream, otherwise nothing will happen.
Regarding the approach, a function that takes a DStream and applies transformations to it would be sufficient and easy to test:
val pipeline:DStream[Data] => () = dstream =>
dstream.map(...).filter(...).print()
As it stands, it doesn't look like the class construction is buying much.

Resources