How to speed up spark streaming unit-tests - apache-spark

I hope this survives :)
I have a spark application using structured streaming written in scala. For unit tests I'm using https://github.com/apache/spark/blob/v3.1.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala#L43
To simulate the "real world" I want the tests to produce batches so I basically use the foreachBatch streaming sink, where in each batch I add the next set of rows of data. Currently just one row per batch.
def foreachBatch(dataset: Dataset[MyRow], batchId: Long) {
log.info(s"Processing batch id $batchId / ${batches.size}")
val batch = dataset.collect()
log.info(s"${batch.length} rows in batch")
if (batchIter.hasNext) {
memStream.addData(batchIter.next())
}
}
val query = df.writeStream
.trigger(Trigger.ProcessingTime(100.millis))
.queryName("testQuery")
.outputMode(OutputMode.Append())
.foreachBatch(foreachBatch _)
.start()
Unfortunately, even when I implement flatMapGroupsWithState as a no-op to only return an empty iterator this takes ~10-15 seconds for each batch on an otherwise idle, recent macbook pro 16".
So testing this with only 10 rows is quite painful.
I played with sparks driver/executor memory settings, the number of executors (from local[1] to 'local[*]`) - all to pretty much no avail.
Is there any way I might be missing to speed things up?
Writing one input file per batch also didn't seem faster.

Related

Spark 3 structured streaming use maxOffsetsPerTrigger in Kafka source with Trigger.Once

We need to use maxOffsetsPerTrigger in the Kafka source with Trigger.Once() in structured streaming but based on this issue it seems reads allAvailable in spark 3. Is there a way for achieving rate limit in this situation?
Here is a sample code in spark 3:
def options: Map[String, String] = Map(
"kafka.bootstrap.servers" -> conf.getStringSeq("bootstrapServers").mkString(","),
"subscribe" -> conf.getString("topic")
) ++
Option(conf.getLong("maxOffsetsPerTrigger")).map("maxOffsetsPerTrigger" -> _.toString)
val streamingQuery = sparkSession.readStream.format("kafka").options(options)
.load
.writeStream
.trigger(Trigger.Once)
.start()
There is no other way around it to properly set a rate limit. If the maxOffsetsPerTrigger is not applicable for streaming jobs with the Once trigger you could do the following to achieve identical result:
Choose another trigger and use maxOffsetsPerTrigger to limit the rate and kill this job manually after it finished processing all data.
Use options startingOffsets and endingOffsets while making the job a batch job. Repeat until you have processed all data within the topic. However, there is a reason why "Streaming in RunOnce mode is better than Batch" as detailed here.
Last option would be to look into the linked pull request and compile Spark on your own.
Here is how we "solved" this. This is basically the approach mike wrote about in the accepted answer.
In our case, the size of the message varied very little and we therefore knew how much time the processing of a batch takes. So in a nutshell we:
changed the Trigger.Once() with Trigger.ProcessingTime(<ms>) since maxOffsetsPerTrigger works with this mode
killed this running query by calling awaitTermination(<ms>) to mimic Trigger.Once()
set the processing interval to be larger than the termination interval so exactly one "batch" would fit to be processed
val kafkaOptions = Map[String, String](
"kafka.bootstrap.servers" -> "localhost:9092",
"failOnDataLoss" -> "false",
"subscribePattern" -> "testTopic",
"startingOffsets" -> "earliest",
"maxOffsetsPerTrigger" -> "10", // "batch" size
)
val streamWriterOptions = Map[String, String](
"checkpointLocation" -> "path/to/checkpoints",
)
val processingInterval = 30000L
val terminationInterval = 15000L
sparkSession
.readStream
.format("kafka")
.options(kafkaOptions)
.load()
.writeStream
.options(streamWriterOptions)
.format("Console")
.trigger(Trigger.ProcessingTime(processingInterval))
.start()
.awaitTermination(terminationInterval)
This works because the first batch will be read and processed respecting the maxOffsetsPerTrigger limit. Say, in 10 seconds. The second batch is then started to be processed but it is terminated in the middle of the operation after ~5s and never reaches the set 30s mark. But it stores the offsets correctly. picks up and processes this "killed" batch in the next run.
A downside of this approach is that you have to approximately know how much time does it take to process one "batch" - if you set the terminationInterval too low the job's output will constantly be nothing.
Of course, if you don't care about the exact number of batches you process in one run, you can easily adjust the processingInterval to be times smaller than the terminationInterval. In that case, you may process a varying number of batches in one go but still respecting the value of maxOffsetsPerTrigger.

Is it possible to ignore failed tasks in Spark

I've some large datasets where some records cause a UDF to crash. Once such a record is processed, the task will fail which leads to job failes. The problems here are native (we use a native fortran library with JNA), so I cannot catch them in the UDF.
What I'd like to have is a mechanism of fault-tolerance which would allow me to skip/ingore/blacklist bad partitions/tasks such that my spark-app will not fail.
Is there a way to do this?
The only thing I could figure at would be to process small chunks of data in a foreach loop :
val dataFilters: Seq[Column] = ???
val myUDF: UserDefinedFunction = ???
dataFilters.foreach(filter =>
try {
ss.table("sourcetable")
.where(filter)
.withColumn("udf_result", myUDF($"inputcol"))
.write.insertInto("targettable")
}
This is not ideal because spark is rel. slow in processing small amount of data. E.g. the input table is read many times

Optimal (low-latency) spark settings for small datasets

I'm aware that spark is designed for large datasets for which it's great. But under certain circumstances I don't need this scalability, e.g. for unit tests or for data exploration on small datasets. Under these conditions spark performs relatively bad compared implementation in pure scala/python/matlab/R etc.
Note that I don't want to drop spark entirely, I want to keep the framework for larger workloads without re-implementing everything.
How can I disable sparks overhead as much as possible on small datasets (say 10-1000s of records)? I'm tried using only 1 partition in local mode (setting spark.sql.shuffle.partitions=1 and spark.default.parallelism=1)? Even which these settings, simple queries on 100 records take on the order of 1-2 seconds.
Note that I'm not trying to reduce the time for SparkSession instantiation, just the execution time given SparkSession exists.
operations in spark have same signature as the scala collections.
You could implement something like:
val useSpark = false
val rdd: RDD[String]
val list: List[String] = Nil
def mapping: String => Int = s => s.length
if (useSpark) {
rdd.map(mapping)
} else {
list.map(mapping)
}
I think this code could be abstracted even more.

Spark Structured Streaming source retention policy

Consider a continuous flow of JSON data on a Kafka topic, we want to deal with it by structured streaming like this:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
I was wondering if the program runs for a long run, then df variable will become so big - in my case like 100 TB for a week. So is there any configuration available to eliminate earlier data in df or simply dequeue earliest rows?
In Spark the execution will not start until an action is triggered.
This concept is called Lazy Evaluation in Apache Spark.
“Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately”
Having said that the load operation is a transformation and no data will be read upon executing this line of code.
In order to kick off a streaming job to need to provide the following 4 logical components and call start:
The input (Kafka, file, socket, ..)
The trigger (how often to input get updated)
The result table (that is created upon a query after the tigger update)
Output (define what part of the result will be written)
The memory consumption depends on what is done in the query that will be triggered. Spark Documentation::
"Since Spark is updating the Result Table, it has full control over
updating old aggregates when there is late data, as well as cleaning
up old aggregates to limit the size of intermediate state data. Since
Spark 2.1, we have support for watermarking which allows the user to
specify the threshold of late data, and allows the engine to
accordingly clean up old state."
So you have to determine the amount of data needed to calculate the result table in order to estimate the about of required memory.
It is possible that an executor will crash with an OOM exception, if you do something like: mapGroupWithState, …

Spark Streaming appends to S3 as Parquet format, too many small partitions

I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window.
My approaches:
Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as:
val rdd1 = kinesisStream.map( rdd => /* decode the data */)
rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd =>
val spark = SparkSession...
import spark.implicits._
// convert rdd to df
val df = rdd.toDF(columnNames: _*)
df.write.parquet("s3://bucket/20161211.parquet")
}
Here is what s3://bucket/20161211.parquet looks like after a while:
As you can see, lots of fragmented small partitions (which is horrendous for read performance)...the question is, is there any way to control the number of small partitions as I stream data into this S3 parquet file?
Thanks
What I am thinking to do, is to each day do something like this:
val df = spark.read.parquet("s3://bucket/20161211.parquet")
df.coalesce(4).write.parquet("s3://bucket/20161211_4parition.parquet")
where I kind of repartition the dataframe to 4 partitions and save them back....
It works, I feel that doing this every day is not elegant solution...
That's actually pretty close to what you want to do, each partition will get written out as an individual file in Spark. However coalesce is a bit confusing since it can (effectively) apply upstream of where the coalesce is called. The warning from the Scala doc is:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
In Dataset's its a bit easier to persist and count to do wide evaluation since the default coalesce function doesn't take repartition as a flag for input (although you could construct an instance of Repartition manually).
Another option is to have a second periodic batch job (or even a second streaming job) that cleans up/merges the results, but this can be a bit complicated as it introduces a second moving part to keep track of.

Resources