I have the following process in Spark. I create a DataFrame through a process, then I sample the data frame into 2 pieces. I then perform a few more operations on these halves, and then save both. Right now, I run the two save processes sequentially. When the first halve is nearly finished, the next task does not start processing, so I have a down time in my executors. Is there a way I can submit both tasks async so that they operate in parallel, and not sequentially?
val split = 0.99
val frac1 = 0.7
val frac2 = (1 - frac1)
val frac3 = split
val validated = sqlContext.sql("Select * from Validated")
val splitDF = validated.selectExpr("srch_id").distinct.cache.randomSplit(Array(
frac1 * frac3, frac2 * frac3, (1 - frac1 * frac3 - frac2 * frac3)
), 1122334455)
val traindf = validated.as("val").join(splitDF(0).as("train"), $"val.srch_id" === $"train.srch_id", "inner").selectExpr(sel.toSeq: _*)
val scoredf = validated.as("val").join(splitDF(1).as("score"), $"val.srch_id" === $"score.srch_id", "inner").selectExpr(sel.toSeq: _*)
traindf.write.mode(SaveMode.Overwrite).saveAsTable("train") //Holds until finished saving
scoredf.write.mode(SaveMode.Overwrite).saveAsTable("score")
I am considering using threads, but am unsure if DataFrames are threadsafe. It seems that sometimes both tables do not save with the following approach.
import scala.concurrent._
val pool = new forkjoin.ForkJoinPool(8)
val ectx = ExecutionContext.fromExecutorService(pool)
sqlContext.sql("use testpx")
ectx.execute(
new Runnable {
def run {
traindf.write.mode(SaveMode.Overwrite).saveAsTable("train")
}
}
)
ectx.execute(
new Runnable {
def run {
scoredf.write.mode(SaveMode.Overwrite).saveAsTable("test")
}
}
)
ectx.shutdown
ectx.awaitTermination(Long.MaxValue, java.util.concurrent.TimeUnit.DAYS)
Related
In my Spark application, I see the same task getting executed in multiple stages. But these statements have been defined only once in the code. Moreover, the same tasks in different stages are taking different times to execute. I understand that in case of loss of RDD, the task lineage is used to recompute the RDD. How can I find out if this the case, because the same phenomenon was seen in all the runs of this application. Can someone please explain what is happening here and under what conditions a task can get scheduled in multiple stages.
The code very much looks like the following:
val events = getEventsDF()
events.cache()
metricCounter.inc("scec", events.count())
val scEvents = events.filter(_.totalChunks == 1)
.repartition(NUM_PARTITIONS, lit(col("eventId")))
val sortedEvents = events.filter(e => e.totalChunks > 1 && e.totalChunks <= maxNumberOfChunks)
.map(PartitionUtil.createKeyValueTuple)
.rdd
.repartitionAndSortWithinPartitions(new EventDataPartitioner(NUM_PARTITIONS))
val largeEvents = events.filter(_.totalChunks > maxNumberOfChunks).count()
val mcEvents = sortedEvents.mapPartitionsWithIndex[CFEventLog](
(index: Int, iter: Iterator[Tuple2]) => doSomething())
val mcEventsDF = session.sqlContext.createDataset[CFEventLog](mcEvents)
metricCounter.inc("mcec", mcEventsDF.count())
val currentDf = scEvents.unionByName(mcEventsDF)
val distinctDateHour = currentDf.select(col("eventDate"), col("eventHour"))
.distinct
.collect
val prevEventsDF = getAnotherDF(distinctDateHour)
val finalDf = currentDf.unionByName(prevEventsDF).dropDuplicates(Seq("eventId"))
finalDf
.write.mode(SaveMode.Overwrite)
.partitionBy("event_date", "event_hour")
.saveAsTable("table")
val finalEventsCount = finalDf.count()
Is every count() action resulting in re-execution of the RDD transformation before the action?
Thanks,
Devj
This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.
In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models
and
streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.
An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.
But this is illegal because:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.
What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)
References:
https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407
I know that for spark we could set different pools as Fair or FIFO and the behavior can be different. However, inside the fairscheduler.xml we could also set individual pool to be Fair or FIFO and I tested several times as their behavior seems to be the same. Then I took a look at the spark source code and the schedulingAlgorithm is like this:
/**
* An interface for sort algorithm
* FIFO: FIFO algorithm between TaskSetManagers
* FS: FS algorithm between Pools, and FIFO or FS within Pools
*/
private[spark] trait SchedulingAlgorithm {
def comparator(s1: Schedulable, s2: Schedulable): Boolean
}
private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm{
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val priority1 = s1.priority
val priority2 = s2.priority
var res = math.signum(priority1 - priority2)
if (res == 0) {
val stageId1 = s1.stageId
val stageId2 = s2.stageId
res = math.signum(stageId1 - stageId2)
}
res < 0
}
}
private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm{
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val minShare1 = s1.minShare
val minShare2 = s2.minShare
val runningTasks1 = s1.runningTasks
val runningTasks2 = s2.runningTasks
val s1Needy = runningTasks1 < minShare1
val s2Needy = runningTasks2 < minShare2
val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
var compare = 0
if (s1Needy && !s2Needy) {
return true
} else if (!s1Needy && s2Needy) {
return false
} else if (s1Needy && s2Needy) {
compare = minShareRatio1.compareTo(minShareRatio2)
} else {
compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
}
if (compare < 0) {
true
} else if (compare > 0) {
false
} else {
s1.name < s2.name
}
}
}
In the fairSchedulingAlgorithm, if s1 and s2 are from a same pool, the minshare, runningtask and weight should the same value and in this way we could always get the return value as false. So they are not Fair but FIFO. My fairscheduler.xml is like this:
<allocations>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>2</minShare>
</pool>
<pool name="cubepublishing">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>0</minShare>
</pool>
</allocations>
And spark.scheduler.mode is:
# job scheduler
spark.scheduler.mode FAIR
spark.scheduler.allocation.file conf/fairscheduler.xml
Thanks for your help!
When you submit your jobs in the cluster either with spark-submit or any other mean, it will be given to Spark schedulers which is responsible to materialize logical plan of your jobs. In spark, we have two modes.
1. FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into stages (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
2. FAIR
The fair scheduler also supports grouping jobs into pools and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a high-priority pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares. This approach is modeled after the Hadoop Fair Scheduler.
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool "local property" to the SparkContext in the thread that’s submitting them.
In Spark Structured Streaming, we can do window operations on event time with groupBy like:
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
Does groupByKey also supports window operations?
Thanks.
It is possible to write a helper function that makes it easier to generate a time-windowing function to give to groupByKey.
object windowing {
import java.sql.Timestamp
import java.time.Instant
/** given:
* a row type R
* a function from R to the Timestamp
* a windowing width in seconds
* return: a function that allows groupByKey to do windowing
*/
def windowBy[R](f:R=>Timestamp, width: Int) = {
val w = width.toLong * 1000L
(row: R) => {
val tsCur = f(row)
val msCur = tsCur.getTime()
val msLB = (msCur / w) * w
val instLB = Instant.ofEpochMilli(msLB)
val instUB = Instant.ofEpochMilli(msLB+w)
(Timestamp.from(instLB), Timestamp.from(instUB))
}
}
}
And in your example, it might be used like this:
case class MyRow(timestamp: Timestamp, word: String)
val windowBy60 = windowing.windowBy[MyRow](_.timestamp, 60)
// count words by time window
words.as[MyRow]
.groupByKey(windowBy60)
.count()
Or counting by (window, word) pairs:
words.as[MyRow]
.groupByKey(row => (windowBy60(row), row.word))
.count()
Yes and no. It cannot be used directly, as it is applicable only to SQL / DataFrame API, but you can always extend the record with window field:
val dfWithWindow = df.withColumn("window", window(...)))
case class Window(start: java.sql.Timestamp. end: java.sql.Timestamp)
case class MyRecordWithWindow(..., window: Window)
and use it for grouping:
dfWithWindow.as[MyRecordWithWindow].groupByKey(_.window).mapGroups(...)
What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()