I have tried loading in a thread a dataframe from an avro file. It doesn't seem to error out, but I am unable to access that dataframe in other threads. My understanding is all threads will have the same spark context.
Have anyone accomplished this?
I tried:
val thread = new Thread {
override def run {
var a = 1
while (a > 0) {
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
} }
}
thread.start
someDF.show
Related
I have a spark program and I have a metrics defined like this
object CustomESMetrics {
lazy val metrics = new CustomESMetrics
}
class CustomESMetrics extends Source with Serializable {
lazy val metricsPrefix = "dscc_harmony_sync_handlers"
override lazy val sourceName: String = "CustomMetricSource"
override lazy val metricRegistry: MetricRegistry = new MetricRegistry
lazy val inputKafkaRecord: Counter =
metricRegistry.counter(MetricRegistry.name("input_kafka_record"))
}
Now I have spark executor code that operates on a dataset
val kafkaDs: Dataset[A] = ............. // get from kafka
kafkaDs.map { ka =>
// executor code starts
SparkEnv.get.metricsSystem.registerSource(CustomESMetrics.metrics) // line X
CustomESMetrics.metrics.inputKafkaRecord.inc() // line Y
// do something else
........
}
For line Y to increment properly line X has to be executed on each element of DS, which not efficient. It gives a warning
{"timestamp":"18/10/2022 16:05:05","logLevel":"INFO","class":"MetricsSystem","thread":"Executor task launch worker for task 2.0 in stage 0.0 (TID 2)","message":"Metrics already registered"}
Is there way that I executed line X only once for executor ?
I am new to Spark.
I want to keep getting message from kafka, and then save to S3 once the message size over 100000.
I implemented it by Dataset.collectAsList(), but it throw error with Total size of serialized results of 3 tasks (1389.3 MiB) is bigger than spark.driver.maxResultSize
So I turned to use foreach, and it said null point exception when used SparkSession to createDataFrame.
Any idea about it? Thanks.
---Code---
SparkSession spark = generateSparkSession();
registerUdf4AddPartition(spark);
Dataset<Row> dataset = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", args[0])
.option("kafka.group.id", args[1])
.option("subscribe", args[2])
.option("kafka.security.protocol", SecurityProtocol.SASL_PLAINTEXT.name)
.load();
DataStreamWriter<Row> console = dataset.toDF().writeStream().foreachBatch((rawDataset, time) -> {
Dataset<Row> rowDataset = rawDataset.selectExpr("CAST(value AS STRING)");
//using foreach
rowDataset.foreach(row -> {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
});
// using collectAsList
List<Row> rows = rowDataset.collectAsList();
for (Row row : rows) {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
}
});
StreamingQuery start = console.start();
start.awaitTermination();
public static void batchSave(SparkSession spark){
synchronized (spans){
if(spans.size() == 100000){
System.out.println(spans.isEmpty());
Dataset<Row> spanDataSet = spark.createDataFrame(spans, Span.class);
Dataset<Row> finalResult = addCustomizedTimeByUdf(spanDataSet);
StringBuilder pathBuilder = new StringBuilder("s3a://fwk-dataplatform-np/datalake/log/FWK/ART2/test/leftAndRight");
finalResult.repartition(1).write().partitionBy("year","month","day","hour").format("csv").mode("append").save(pathBuilder.toString());
spans.clear();
}
}
}
Since the main SparkSession is running in driver, and tasks in foreach... is running distributed in executors, so the spark is not defined to all other executors.
BTW, there is no meaning to use synchronized inside foreach task since everything is distributed.
In Kotlin, we can use coroutines to offload network API from the main thread. I wonder which is better:
val r1=launch(Dispatchers.IO) {
api1()
}
val r2=launch(Dispatchers.IO) {
api2(r1)
}
val r3=launch(Dispatchers.IO) {
api3(r2)
}
or
launch(Dispatchers.IO){
val r1=api1()
val r2=api2(r1)
val r3=api3(r2)
}
I am using onTaskEnd Spark listener to get the number of records written into file like this:
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
var recordsWritten: Long = 0L
val rowCountListener: SparkListener = new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
def rowCountOf(proc: => Unit): Long = {
recordsWritten = 0L
spark.sparkContext.addSparkListener(rowCountListener)
try {
proc
} finally {
spark.sparkContext.removeSparkListener(rowCountListener)
}
recordsWritten
}
val rc = rowCountOf { (1 to 100).toDF.write.csv(s"test.csv") }
println(rc)
=> 100
However trying to run multiple actions in threads this obviously breaks:
Seq(1, 2, 3).par.foreach { i =>
val rc = rowCountOf { (1 to 100).toDF.write.csv(s"test${i}.csv") }
println(rc)
}
=> 600
=> 700
=> 750
I can have each thread declare its own variable, but spark context is still shared and I am unable to reckognize to which thread does specific SparkListenerTaskEnd event belong to. Is there any way to make it work?
(Right, maybe I could just make it separate spark jobs. But it's just a single piece of the program, so for the sake of simplicity I would prefer to stay with threads. In the worst case I'll just execute it serially or forget about counting records...)
A bit hackish but you could use accumulators as a filtering side-effect
val acc = spark.sparkContext.longAccumulator("write count")
df.filter { _ =>
acc.add(1)
true
}.write.csv(...)
println(s"rows written ${acc.count}")
the pseudo code :
object myApp {
var myStaticRDD: RDD[Int]
def main() {
... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path
//complex transformation using the two DStream
val new_stream = streamA.transformWith(StreamB, (a, b, t) => {
a.join(b).map(...)
}
)
//join the new_stream's rdd with myStaticRDD
new_stream.foreachRDD(rdd =>
myStaticRDD = myStaticRDD.union(cur_stream)
)
// do complex model training every two hours.
if (hour is 0, 2, 4, 6...) {
model_training(myStaticRDD) //will take 1 hour
}
}
}
I don't know how to write the code to realize training model every two hours using that moment's myStaticRDD.
And when the model-training is running, the streaming task could also run normally simultaneously and the streamA, streamB, new_stream, myStaticRDD could be updated in real-time. maybe I need to use multi-thread?
one possible solution maybe:
object myApp {
var myStaticRDD: RDD[Int]
def main() {
... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path
//complex transformation using the two DStream
val new_stream = streamA.transformWith(StreamB, (a, b, t) => {
a.join(b).map(...)
}
)
//join the new_stream's rdd with myStaticRDD
new_stream.foreachRDD(rdd =>
myStaticRDD = myStaticRDD.union(cur_stream)
// do complex model training every two hours.
if (hour is 0, 2, 4, 6...) {
val tmp_rdd = myStaticRDD
new Thread(model_training(tmp_rdd)) //start a child-thread to train the model...
}
)
}
}