Databricks Streaming Jobs Mysteriously Terminating after 5 minutes - apache-spark

I'm trying to use Databricks on Azure with a Spark structured streaming job and an having very mysterious issue.
I boiled the job down it it's basics for testing, reading from a Kafka topic and writing to console in a forEachBatch.
On local, everything works fine indefinitely.
On Databricks, the task terminates after just over 5 minutes with a "Cancelled" status.
There are no errors in the log, just this, which appears to be a graceful shutdown request of some kind, but I don't know where it's coming from
22/11/04 18:31:30 INFO DriverCorral$: Cleaning the wrapper ReplId-1ea30-8e4c0-48422-a (currently in status Running(ReplId-1ea30-8e4c0-48422-a,ExecutionId(job-774316032912321-run-84401-action-5645198327600153),RunnableCommandId(9102993760433650959)))
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverLocal: cancelled jobGroup:2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverWrapper: Stopping streams for commandId pattern: CommandIdPattern(2207618020913201706,None,Some(job-774316032912321-run-84401-action-5645198327600153)).
22/11/04 18:31:30 INFO DatabricksStreamingQueryListener: Stopping the stream [id=d41eff2a-4de6-4f17-8d1c-659d1c1b8d98, runId=5bae9fb4-b5e1-45a0-af1e-a2f2553592c9]
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 5bae9fb4-b5e1-45a0-af1e-a2f2553592c9
22/11/04 18:31:30 INFO TaskSchedulerImpl: Cancelling stage 366
22/11/04 18:31:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 366: Stage cancelled
22/11/04 18:31:30 INFO MicroBatchExecution: QueryExecutionThread.interruptAndAwaitExecutionThreadTermination called with streaming query exit timeout=15000 ms
For reference, here are is the code:
val incomingStream = spark.readStream
.format("kafka")
.option("subscribe",ehName)
.option("kafka.bootstrap.servers",topicUriWithPort)
.option("kafka.sasl.mechanism","PLAIN")
.option("kafka.security.protocol","SASL_SSL")
.option("kafka.sasl.jaas.config",jaas)
.option("startingOffsets", "earliest")
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger", 1) //todo make config
.load()
val processedWriteStream = incomingStream
.writeStream
.queryName("query2")
.foreachBatch((d: DataFrame, b: Long) => {
d.show()
})
.start()
processedWriteStream.awaitTermination()

Structured Streaming provides fault-tolerance and data consistency for streaming queries; using Databricks workflows, you can easily configure your Structured Streaming queries to restart on failure automatically.
You can restart the query after a failure by enabling checkpointing for a streaming query.
The restarted query continues where the failed one left off.

Related

Cosmos Changefeed Spark Streaming Stops randomly

I have a Spark streaming job which reads Cosmos Changefeed data as below, running in a Databricks cluster with DBR 8.2.
cosmos_config = {
"spark.cosmos.accountEndpoint": cosmos_endpoint,
"spark.cosmos.accountKey": cosmos_key,
"spark.cosmos.database": cosmos_database,
"spark.cosmos.container": collection,
"spark.cosmos.read.partitioning.strategy": "Default",
"spark.cosmos.read.inferSchema.enabled" : "false",
"spark.cosmos.changeFeed.startFrom" : "Now",
"spark.cosmos.changeFeed.mode" : "Incremental"
}
df_ read = (spark.readStream
.format("cosmos.oltp.changeFeed")
.options(**cosmos_config)
.schema(cosmos_schema)
.load())
df_write = (df_ read.withColumn("partition_date",current_date())
.writeStream
.partitionBy("partition_date")
.format('delta')
.option("path", master_path)
.option("checkpointLocation", f"{master_path}_checkpointLocation")
.queryName("cosmosStream")
.trigger(processingTime='10 seconds')
.start()
)
While the job works well ordinarily, occasionally, the streaming stops all of a sudden and the below appears in a loop in the log4j output. Restarting the job processes all the data in the 'backlog'. Has anyone experienced something like this before? I'm not sure what could be causing this. Any ideas?
22/02/27 00:57:58 INFO HiveMetaStore: 1: get_database: default
22/02/27 00:57:58 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
22/02/27 00:57:58 INFO DriverCorral: Metastore health check ok
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Starting...
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Start completed.
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Shutdown initiated...
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Shutdown completed.
22/02/27 00:58:07 INFO MetastoreMonitor: Metastore healthcheck successful (connection duration = 88 milliseconds)
22/02/27 00:58:50 INFO RxDocumentClientImpl: Getting database account endpoint from https://<cosmosdb_endpoint>.documents.azure.com:443
Which version of the Cosmos Spark connector are you using? Between 4.3.0 and 4.6.2 there were several bug fixes made in the bulk ingestion code path.
See https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/CHANGELOG.md for more details.

Unable to process sample word count as Spark job

I have the spark-master and spark-worker running on SAP Kyma environment (different flavor Kubernetes) along with the Jupyter Lab with ample of CPU and RAM allocation.
I can access the Spark Master UI and see that workers are registered as well (screen shot below).
I am using Python3 to submit the job (snippet below)
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('spark://spark-master:7077')
sc = pyspark.SparkContext(conf=conf)
sc
and can see the spark context as output of the sc. After this, I am preparing the data to submit to the spark-master (snippet below)
words = 'the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog'
seq = words.split()
data = sc.parallelize(seq)
counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
dict(counts)
sc.stop()
but it start to log warning messages on notebook(snippet below) and goes forever till I kill the process from spark-master UI.
22/01/27 19:42:39 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/27 19:42:54 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am new to Kyma (Kubernetes) and Spark. Any help would be much appreciated.
Thanks
For those who stumble upon the same question.
Check your infrastructure certificate. Turned out that the Kubernetes was issuing wrong internal certificate which was not recognised by the pods.
After we fixed the certificate, all started working.

How to subscribe to a new topic with subscribePattern?

I am using Spark Structured streaming with Kafka and topic been subscribed as pattern:
option("subscribePattern", "topic.*")
// Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
Once I start the job and a new topic is listed say topic.new_topic, the job automatically doesn't start listening to the new topic and it requires a restart.
Is there a way to automatically subscribe to a new pattern without restarting the job?
Spark: 3.0.0
The default behavior of a KafkaConsumer is to check every 5 minutes if there are new partitions to be consumed. This configuration is set through the Consumer config
metadata.max.age.ms: The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
According to the Spark + Kafka Integration Guide on Kafka Specific Configuration you can set this configuration by using the prefix kafka. as shown below:
.option("kafka.metadata.max.age.ms", "1000")
Through this setting the newly created topic will be consumed 1 second after its creation.
(Tested with Spark 3.0.0 and Kafka Broker 2.5.0)

Driver stops executors without a reason

I have an application based on spark structured streaming 3 with kafka, which is processing some user logs and after some time the driver is starting to kill the executors and I don't understand why.
The executors doesn't contain any errors. I'm leaving bellow the logs from executor and driver
On the executor 1:
0/08/31 10:01:31 INFO executor.Executor: Finished task 5.0 in stage 791.0 (TID 46411). 1759 bytes result sent to driver
20/08/31 10:01:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
On the executor 2:
20/08/31 10:14:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
20/08/31 10:14:34 INFO memory.MemoryStore: MemoryStore cleared
20/08/31 10:14:34 INFO storage.BlockManager: BlockManager stopped
20/08/31 10:14:34 INFO util.ShutdownHookManager: Shutdown hook called
On the driver:
20/08/31 10:01:33 ERROR cluster.YarnScheduler: Lost executor 3 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 130392 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Lost executor 2 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 125773 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129308 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129314 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129311 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129305 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
Is there anyone which had the same problem and solved it?
Looking at the available information at hand:
no errors
Driver commanded a shutdown
Yarn logs showing "state FINISHED"
this seems to be expected behavior.
This typically happens if you forget to await the termination of the spark streaming query. If you do not conclude your code with
query.awaitTermination()
your streaming application will just shutdown after all data was processed.

Spark - How to identify a failed Job through 'SparkLauncher'

I am using Spark 2.0 and sometimes my job fails due to problems with input. For example, I am reading CSV files off from a S3 folder based on the date, and if there's no data for the current date, my job has nothing to process so it throws an exception as follows. This gets printed in the driver's logs.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: s3n://data/2016-08-31/*.csv;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
...
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/03 10:51:54 INFO SparkContext: Invoking stop() from shutdown hook
16/09/03 10:51:54 INFO SparkUI: Stopped Spark web UI at http://192.168.1.33:4040
16/09/03 10:51:54 INFO StandaloneSchedulerBackend: Shutting down all executors
16/09/03 10:51:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Spark App app-20160903105040-0007 state changed to FINISHED
However, despite this uncaught exception, my Spark Job status is 'FINISHED'. I would expect it to be in 'FAILED' status because there was an exception. Why is it marked as FINISHED? How can I find out whether the job failed or not?
Note: I am spawning the Spark jobs using SparkLauncher, and listening to state changes through AppHandle. But the state change I receive is FINISHED whereas I am expecting FAILED.
The one FINISHED you see is for Spark application not a job. It is FINISHED since the Spark context was able to start and stop properly.
You can see any job information using JavaSparkStatusTracker.
For active jobs nothing additional should be done, since it has ".getActiveJobIds" method.
For getting finished/failed you will need to setup the job group ID in the thread from which you are calling for a spark execution:
JavaSparkContext sc;
...
sc.setJobGroup(MY_JOB_ID, "Some description");
Then whenever you need, you can read the status of each job with in specified job group:
JavaSparkStatusTracker statusTracker = sc.statusTracker();
for (int jobId : statusTracker.getJobIdsForGroup(JOB_GROUP_ALL)) {
final SparkJobInfo jobInfo = statusTracker.getJobInfo(jobId);
final JobExecutionStatus status = jobInfo.status();
}
The JobExecutionStatus can be one of RUNNING, SUCCEEDED, FAILED, UNKNOWN; The last one is for case of job is submitted, but not actually started.
Note: all this is available from Spark driver, which is jar you are launching using SparkLauncher. So above code should be placed into the jar.
If you want to check in general is there any failures from the side of Spark Launcher, you can exit the application started by Jar with exit code different than 0 using kind of System.exit(1), if detected a job failure. The Process returned by SparkLauncher::launch contains exitValue method, so you can detect is it failed or no.
you can always go to spark history server and click on your job id to
get the job details.
If you are using yarn then you can go to resource manager web UI to
track your job status.

Resources