Structured Streaming with mapGroupState causing GC and Performance Issues

Structured Streaming with mapGroupState causing GC and Performance Issues - apache-spark

In our application we are using structured streaming with MapGroupWithState in combination with read from Kafka.
After starting the application, during the initial batches the performance is good, if i see the kafka lastProgress almost 65K per second. After few batches the performance is reduced completely to around 2000 per second.
in MapGroupWithState Function basically an update and comparison to the value from state store is happening(code snippet provided below).
Number of Offsets from Kafka - 100000
After starting the application, during the initial batches the performance is good, if i see the kafka lastProgress almost 65K per second. After few batches the performance is reduced completely to around 2000 per second.
If we see the Thread Dump from one of executor then there is no suspicious except Blocked threads from spark UI
GC Stats from one of the executor as below , seems
Didn't see much difference after GC
Code Snippet
case class MonitoringEvent(InternalID: String, monStartTimestamp: Timestamp, EndTimestamp: Timestamp, Stream: String, ParentID: Option[String])
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", Config.uatKafkaUrl)
.option("subscribe", Config.interBranchInputTopic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "true")
.option("maxOffsetsPerTrigger", "100000")
.option("request.required.acks", "all")
.load()
.selectExpr("CAST(value AS STRING)")
val me: Dataset[MonitoringEvent] = df.select(from_json($"value", schema).as("data")).select($"data.*").as[MonitoringEvent]
val IB = me.groupByKey(x => (x.ParentID.getOrElse(x.InternalID)))
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(IBTransformer.mappingFunctionIB _)
.flatMap(x => x)
val IBStream = IB
.select(to_json(struct($"*")).as("value"), $"InternalID".as("key"))
.writeStream
.format("kafka")
.queryName("InterBranch_Events_KafkaWriter")
.option("kafka.bootstrap.servers", Config.uatKafkaUrl)
.option("topic", Config.interBranchTopicComplete)
.option("checkpointLocation", Config.interBranchCheckPointDir)
.outputMode("update")
.start()
object IBTransformer extends Serializable {
case class IBStateStore(InternalID: String, monStartTimestamp: Timestamp)
def mappingFunctionIB(intrKey: String, intrValue: Iterator[MonitoringEvent], intrState: GroupState[IBStateStore]): Seq[MonitoringEvent] = {
try {
if (intrState.hasTimedOut) {
intrState.remove()
Seq.empty
} else {
val events = intrValue.toSeq
if (events.map(_.Status).contains(Started)) {
val tmp = events.filter(x => (x.Status == Started && x.InternalID == intrKey)).head
val toStore = IBStateStore(tmp.InternalID, tmp.monStartTimestamp)
intrState.update(toStore)
intrState.setTimeoutDuration(1200000)
}
val IB = events.filter(_.ParentID.isDefined)
if (intrState.exists && IB.nonEmpty) {
val startEvent = intrState.get
val IBUpdate = IB.map {x => x.copy(InternalID = startEvent.InternalID, monStartTimestamp = startEvent.monStartTimestamp) }
IBUpdate.foreach(id => intrState.update((IBStateStore(id.InternalID, id.monStartTimestamp)))) // updates the state with new IDs
IBUpdate
} else {
Seq.empty
}
}
}
catch
.
.
.
}
}
Number of executers used - 8
Exector Memory - 8G
Driver Memory - 8G
Java options and memory i provide in my spark Submit script
--executor-memory 8G \
--executor-cores 8 \
--num-executors 4 \
--driver-memory 8G \
--driver-java-options "-Dsun.security.krb5.debug=true -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dconfig.file=configIB.conf -Dlog4j.configuration=IBprocessor.log4j.properties" \
Tried using G1GC in java options, but there is no improvement. The keys we hold is also less than the size provided, so not sure where it is going wrong .
Any suggestions to improve performance and eliminate GC Issues ?

Related

Is there a way to ensure scale of records while streaming from kafka?

I'm new to Spark and Kafka, using pyspark (spark 2.4.8).
Assume we have a kafka streaming source and we want to stream at least N records to our database. What is the best way to ensure the wanted number of records and stop after reaching it?
I thought maybe to count the number of micro-batches using a global parameter, and to limit the number of offsets per micro-batch but I guess it isn't the right way to get over the problem.
My code in general:
raw_stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_server) \
.option("subscribe", "topic1, topic2") \
.option("startingOffsets", "earliest") \
.option("maxOffsetsPerTrigger", offsets_number) \
.load()
...
# define schema (not relevant)
...
counter = 0
def foreach_batch_function(df, epoch_id):
global counter
counter += 1
query = streaming_df \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("query1") \
.foreachBatch(foreach_batch_function) \
.start()
Buy It didn't work. I tried to stop the query after reaching a const number of micro-batches but the counter even didn't increase.
Back to my question, what is the right way to pass the lower bound of requested records and than just stop?

Read whole Kafka topic as spark dataframe in offsets batches

I am trying to read all data in a kafka topic in batches (reading between two offset values) and load them to spark dataframes, without using readStream in spark streaming.
My idea is:
I first get the total number of data lines in the topic finding the maximum offset value.
I define step, namely the total number of data per batch.
With a for loop I read the data batch from the kafka topic setting startingOffsets and endingOffsets parameters.
This is my code (for a topic with a single partition) to print the count in each batch:
val maxOffsetValue = {
Process(s"kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic topicname")
.!!
.split(":")
.last
.trim
.toInt
}
val step = 1000
for (i <- 0 until maxOffsetValue by step) {
val df: DataFrame = {
spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicname")
.option("startingOffsets", s"""{"topicname":{"0":${i}}}""")
.option("endingOffsets", s"""{"topicname":{"0":${i+step}}}""")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.select(from_json(col("value"), dataSchema) as "data")
.select("data.*")
}
println(s"i: ${i}, i+step: ${i+step}, count: ${df.count()}")
}
However, it seems that the json format for startingOffsets and endingOffsets is not flexible, as apparently all offsets indices need to be specified for each partition, e.g something like {"0":${i}, "1": ${i}}} if there are two partitions.
My questions are:
Is there a better way to achieve the same results, possibly that can be extended directly to a multi partition topic?
Is there a way to read the maximum offset without using a shell command?

how to determine where the Kafka consumption is made on Spark

We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))

Issue with partioning sql table data when reading from Spark

I have written a Scala program for loading data from an MS SQL Server and writing it to BigQuery. I execute this in a Spark cluster (Google Dataproc). My issue is that even though I have a cluster with 64 cores, and I specify the executor parameters when running the job, and I partition the data I'm reading, Spark only reads data from a single executor. When I start the job I can see all the executors firing up and on the SQL Server I can see connections from all 4 workers, but within a minute, they all shut down again, leaving only one, which then runs for over an hour before finishing.
The data set is 65 million records, and I'm trying to partition it into 60 partitions.
This is my cluster:
gcloud dataproc clusters create my-cluster \
--properties dataproc:dataproc.conscrypt.provider.enable=false,spark:spark.executor.userClassPathFirst=true,spark:spark.driver.userClassPathFirst=true \
--region europe-north1 \
--subnet my-subnet \
--master-machine-type n1-standard-4 \
--worker-machine-type n1-highmem-16 \
--master-boot-disk-size 15GB \
--worker-boot-disk-size 500GB \
--image-version 1.4 \
--master-boot-disk-type=pd-ssd \
--worker-boot-disk-type=pd-ssd \
--num-worker-local-ssds=1 \
--num-workers=4
This is how I run the job:
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region europe-north1 \
--jars gs://mybucket/mycode.jar,gs://hadoop-lib/bigquery/bigquery-connector-hadoop3-latest.jar \
--class Main \
--properties \
spark.executor.memory=19g, \
spark.executor.cores=4, \
spark.executor.instances=11 \
-- yarn
This is the code I use to read the data:
val data = sqlQuery(ss,
serverName,
portNumber,
databaseName,
userName,
password,
tableName)
writeToBigQuery(
bqConfig,
data,
dataSetName,
replaceInvalidCharactersInTableName(r.getAs[String]("TableName")),
"WRITE_TRUNCATE")
def sqlQuery(ss: SparkSession,
hostName: String,
port: String,
databaseName: String,
user: String,
password: String,
query: String): DataFrame = {
val result = ss.read.format("jdbc")
.option("url", getJdbcUrl(hostName, port, databaseName))
.option("dbtable", query)
.option("user", user)
.option("password", password)
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("numPartitions", 60)
.option("partitionColumn", "entityid")
.option("lowerBound", 1)
.option("upperBound", 198012).load()
result
}
def writeToBigQuery(bqConf: Configuration,
df: DataFrame,
dataset: String,
table: String,
writeDisposition: String = "WRITE_APPEND"): Unit = {
//Convert illegal characters in column names
var legalColumnNamesDf = df
for (col <- df.columns) {
legalColumnNamesDf = legalColumnNamesDf.withColumnRenamed(
col,
col
.replaceAll("-", "_")
.replaceAll("\\s", "_")
.replaceAll("æ", "ae")
.replaceAll("ø", "oe")
.replaceAll("å", "aa")
.replaceAll("Æ", "AE")
.replaceAll("Ø", "OE")
.replaceAll("Å", "AA")
)
}
val outputGcsPath = s"gs://$bucket/" + HardcodedValues.SparkTempFolderRelativePath + UUID
.randomUUID()
.toString
val outputTableId = s"$projectId:$dataset.$table"
//Apply explicit schema since to avoid creativity of BigQuery auto config
val uniqBqConf = new Configuration(bqConf)
BigQueryOutputConfiguration.configure(
uniqBqConf,
outputTableId,
s"""{"fields":${Json(DefaultFormats).write(
legalColumnNamesDf.schema.map(
f =>
Map(
"name" -> f.name,
"type" -> f.dataType.sql
.replace("BIGINT", "INT")
.replace("INT", "INT64")
.replaceAll("DECIMAL\\(\\d+,\\d+\\)", "NUMERIC"),
"mode" -> (if (f.nullable) "NULLABLE"
else "REQUIRED")
))
)} }""",
outputGcsPath,
BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
classOf[TextOutputFormat[_, _]]
)
uniqBqConf.set(
BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,
if (Array("WRITE_APPEND", "WRITE_TRUNCATE") contains writeDisposition)
writeDisposition
else "WRITE_APPEND"
)
//Save to BigQuery
legalColumnNamesDf.rdd
.map(
row =>
(null,
Json(DefaultFormats).write(
ListMap(row.schema.fieldNames.toSeq.zip(row.toSeq): _*))))
.saveAsNewAPIHadoopDataset(uniqBqConf)
}
Any ideas would be appreciated.

If you look at the Spark UI, is there a lot of skew where one task is reading most of the data? My guess is that you're picking a poor partition key, so most of the data ends up in one partition.
This stackoverflow answer provides a detailed explanation: What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?. I think your entity ids would need to be evenly distributed between 1 and 198012 for it to be a good column to partition on.

In the end I tried stopping to tell spark how many executors to run and just do dynamic allocation, and now it works. I asked for 24 partitions and it dynamically allocates 8 executors with 3 cores each, running 24 tasks in parallel.

Spark Streaming with Kafka ensure loss less processing

I have a very simple Spark + Kafka application. I'm reading from Kafka and printing in Console. I have 2 lines in below i.e Good-line and Bad-line
Initially I process with good line, and then I switch to bad-line for a while, when I change back to good line I expect to process from where it left off. Surprisingly it starts from latest.
1
2
3
missing
missing
7
8
9
In the below code how can I ensure I read all the messages. I did not find a code or place where I can control the offset. Even if there is a duplicate processing I'm fine .. coz I'll have unique-id in my message
public static void main(String[] args) throws Exception {
String brokers = "quickstart:9092";
String topics = "simple_topic_1";
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(SimpleKafkaProcessor.class.getName())
.master(master).getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
Dataset<Row> rawDataSet = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
//.option("enable.auto.commit", "false")
.option("auto.offset.reset", "earliest")
.option("group.id", "safe_message_landing_app_2")
.option("subscribe", topics).load();
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("basicView");
// Good-Line
sqlContext.sql("select string(Value) as StrValue from basicView").writeStream()
// Bad-Line
//sqlContext.sql("select fieldNotFound as StrValue from basicView").writeStream()
.format("console")
.option("checkpointLocation", "cp/" + UUID.randomUUID().toString())
.trigger(ProcessingTime.create("15 seconds"))
.start()
.awaitTermination();
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Structured Streaming with mapGroupState causing GC and Performance Issues - apache-spark

Related

Is there a way to ensure scale of records while streaming from kafka?

Read whole Kafka topic as spark dataframe in offsets batches

how to determine where the Kafka consumption is made on Spark

Issue with partioning sql table data when reading from Spark

Spark Streaming with Kafka ensure loss less processing

Categories

Resources