Zeppelin TimeoutLifecycleManager make spark application status: failed - apache-spark

Why I use TimeoutLifecycleManager:
because default spark interpreter can't stop normal when my notebook run finished, so I use this to stop it when it's idle
for now, the interpreter can stop normally, but the final status of spark application is "Faild". I know it because the interpreter killed by the TimeoutLifecycleManager.
so, I want know how to get a right status of the spark application when job finished.
zeppelin version is 0.10.1 and I submit spark job mode is "yarn-cluster"
I set spark interpreter option:
The interpreter will be instantiated Per Note in ioslated process
I have read the source code of TimeoutLifecycleManager, but not too much. I only found it shutdown code.
public TimeoutLifecycleManager(ZeppelinConfiguration zConf,
RemoteInterpreterServer remoteInterpreterServer) {
super(zConf, remoteInterpreterServer);
long checkInterval = zConf.getTime(ZeppelinConfiguration.ConfVars
.ZEPPELIN_INTERPRETER_LIFECYCLE_MANAGER_TIMEOUT_CHECK_INTERVAL);
long timeoutThreshold = zConf.getTime(
ZeppelinConfiguration.ConfVars.ZEPPELIN_INTERPRETER_LIFECYCLE_MANAGER_TIMEOUT_THRESHOLD);
ScheduledExecutorService checkScheduler = ExecutorFactory.singleton()
.createOrGetScheduled("TimeoutLifecycleManager", 1);
checkScheduler.scheduleAtFixedRate(() -> {
if ((System.currentTimeMillis() - lastBusyTimeInMillis) > timeoutThreshold) {
LOGGER.info("Interpreter process idle time exceed threshold, try to stop it");
try {
remoteInterpreterServer.shutdown();
} catch (TException e) {
LOGGER.error("Fail to shutdown RemoteInterpreterServer", e);
}
} else {
LOGGER.debug("Check idle time of interpreter");
}
}, checkInterval, checkInterval, MILLISECONDS);
LOGGER.info("TimeoutLifecycleManager is started with checkInterval: {}, timeoutThreshold: ΒΈ{}", checkInterval,
timeoutThreshold);
}
if any way could set right status to the spark application before shutdown?

Related

Getting LeaseExpiredException in spark streaming randomly

i have a spark streaming (2.1.1 with cloudera 5.12). with input kafka and output HDFS (in parquet format)
the problem is , i'm getting LeaseExpiredException randomly (not in all mini-batch)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/qoe_fixe/data_tv/tmp/cleanData/_temporary/0/_temporary/attempt_20180629132202_0215_m_000000_0/year=2018/month=6/day=29/hour=11/source=LYO2/part-00000-c6f21a40-4088-4d97-ae0c-24fa463550ab.snappy.parquet (inode 135532024): File does not exist. Holder DFSClient_attempt_20180629132202_0215_m_000000_0_-1048963677_900 does not have any open files.
i'm using the dataset API for writing to hdfs
if (!InputWithDatePartition.rdd.isEmpty() ) InputWithDatePartition.repartition(1).write.partitionBy("year", "month", "day","hour","source").mode("append").parquet(cleanPath)
my job fails after few hours because of this error
Two jobs write to the same directory share the same _temporary folder.
So when the first job finishes this code is executed (FileOutputCommitter class):
public void cleanupJob(JobContext context) throws IOException {
if (hasOutputPath()) {
Path pendingJobAttemptsPath = getPendingJobAttemptsPath();
FileSystem fs = pendingJobAttemptsPath
.getFileSystem(context.getConfiguration());
// if job allow repeatable commit and pendingJobAttemptsPath could be
// deleted by previous AM, we should tolerate FileNotFoundException in
// this case.
try {
fs.delete(pendingJobAttemptsPath, true);
} catch (FileNotFoundException e) {
if (!isCommitJobRepeatable(context)) {
throw e;
}
}
} else {
LOG.warn("Output Path is null in cleanupJob()");
}
}
it deletes pendingJobAttemptsPath(_temporary) while the second job is still running
This may be helpful:
Multiple spark jobs appending parquet data to same base path with partitioning

How to set the Spark streaming receiver frequency?

My requirement is to process the hourly data of a stock market.
i.e, get the data from source once per streaming interval and process it via DStream.
I have implemented a custom receiver to scrap/monitor the website by implementing onStart() and onStop() methods and its working.
Challenges encountered:
Receiver thread is fetching the data continuously i.e, multiples times per interval.
Unable to coordinate receiver and DStream execution time interval.
Options I tried:
Receiver Thread to sleep for few seconds (equal to streaming interval).
In this case data is not the latest data while processing.
class CustomReceiver(interval: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("Website Scrapper") {
override def run() { receive() }
}.start()
}
def onStop() {
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
println("Entering receive:" + new Date());
try {
while (!isStopped) {
val scriptsLTP = StockMarket.getLiveStockData()
for ((script, ltp) <- scriptsLTP) {
store(script + "," + ltp)
}
println("sent data")
System.out.println("going to sleep:" + new Date());
Thread.sleep(3600 * 1000);
System.out.println("awaken from sleep:" + new Date());
}
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
println("Exiting receive:" + new Date());
}
}
How to make the Spark Streaming receiver in sync with DStream processing?
This use case doesn't seem a good fit for Spark Streaming. The interval is long enough to consider this as a regular batch job instead. That way, we can make better use of the cluster resources.
I would rewrite it as a Spark Job by parallelizing the target tickers, using a mapPartitions to use the executors as distributed web scrappers and then process as intended.
Then schedule the Spark job to run each hour with cron or more advanced alternatives, such as Chronos at the exact times wanted.

Limit apache spark job running duration

I want to submit the job in a cluster environment with a timeout parameter, is there a way to make spark kill a running job if it exeeded the allowed duration?
At Spark 2.1.0, there is no built-in solution (a very good feature to add!).
You can play with speculation feature to re-launch long task and spark.task.maxFailures to kill too many re-launched tasks.
But this is absolutely not clean, Spark is missing a real "circuit breaker" to stop long task (such as the noob SELECT * FROM DB)
In other side, you could use the Spark web UI web API:
1) Get running jobs: GET http://SPARK_CLUSTER_PROD/api/v1/applications/application_1502112083252_1942/jobs?status=running
(this will give you an array with submissionTime field that you can use to find long jobs)
2) Kill the job: POST http://SPARK_CLUSTER_PROD/stages/stage/kill/?id=23881&terminate=true for each job stages.
I believe Spark has a hidden API too, you can try to use.
You can use YARN REST api to kill the spark application from your service. I am using the following code to stop the long running spark application. The following code is using httpclient library.
def killApplication(applicationId: String) : Boolean = {
val appKillPut = new HttpPut(s"http://xx.xx.xx.xx:8088//ws/v1/cluster/apps/$applicationId/state")
val json = new JSONObject(Map("state"-> "KILLED"))
val params = new StringEntity(json.toString(),"UTF-8")
params.setContentType("application/json")
appKillPut.addHeader("Content-Type", "application/json")
appKillPut.addHeader("Accept", "*/*")
appKillPut.setEntity(params)
println(s"Request payload ${json.toString}")
val client: CloseableHttpClient = HttpClientBuilder.create().build()
val response: CloseableHttpResponse = client.execute(appKillPut)
val responseBody = EntityUtils.toString(response.getEntity)
println(s"Response payload ${responseBody}")
val statusCode: Int = response.getStatusLine.getStatusCode
if(statusCode == 200 || statusCode == 201 || statusCode == 202) {
println(s"Successfully stopped the application : ${applicationId}")
true
} else {
false
}
}
Hope this helps.
Ravi

What is the best way to restart spark streaming application?

I basically want to write an event callback in my driver program which will restart the spark streaming application on arrival of that event.
My driver program is setting up the streams and the execution logic by reading configurations from a file.
Whenever the file is changed (new configs added) the driver program has to do the following steps in a sequence,
Restart,
Read the config file (as part of the main method) and
Set up the streams
What is the best way to achieve this?
In some cases you may want to reload streaming context dynamically (for example to reloading of streaming operations).
In that cases you may (Scala example):
val sparkContext = new SparkContext()
val stopEvent = false
var streamingContext = Option.empty[StreamingContext]
val shouldReload = false
val processThread = new Thread {
override def run(): Unit = {
while (!stopEvent){
if (streamingContext.isEmpty) {
// new context
streamingContext = Option(new StreamingContext(sparkContext, Seconds(1)))
// create DStreams
val lines = streamingContext.socketTextStream(...)
// your transformations and actions
// and decision to reload streaming context
// ...
streamingContext.get.start()
} else {
if (shouldReload) {
streamingContext.get.stop(stopSparkContext = false, stopGracefully = true)
streamingContext.get.awaitTermination()
streamingContext = Option.empty[StreamingContext]
} else {
Thread.sleep(1000)
}
}
}
streamingContext.get.stop(stopSparkContext =true, stopGracefully = true)
streamingContext.get.awaitTermination()
}
}
// and start it in separate thread
processThread.start()
processThread.join()
or in python:
spark_context = SparkContext()
stop_event = Event()
spark_streaming_context = None
should_reload = False
def process(self):
while not stop_event.is_set():
if spark_streaming_context is None:
# new context
spark_streaming_context = StreamingContext(spark_context, 0.5)
# create DStreams
lines = spark_streaming_context.socketTextStream(...)
# your transformations and actions
# and decision to reload streaming context
# ...
self.spark_streaming_context.start()
else:
# TODO move to config
if should_reload:
spark_streaming_context.stop(stopSparkContext=False, stopGraceFully=True)
spark_streaming_context.awaitTermination()
spark_streaming_context = None
else:
time.sleep(1)
else:
self.spark_streaming_context.stop(stopGraceFully=True)
self.spark_streaming_context.awaitTermination()
# and start it in separate thread
process_thread = threading.Thread(target=process)
process_thread.start()
process_thread.join()
If you want to prevent you code from crashes and restart streaming context from the last place use checkpointing mechanism.
It allow you to restore your job state after failure.
The best way to Restart the Spark is actually according to your environment.But it is always suggestible to use spark-submit console.
You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN.
Cloudera blog
One way that we explored recently (in a spark meetup here) was to achieve this by using Zookeeper in Tandem with Spark. This in a nutshell uses Apache Curator to watch for changes on Zookeeper (changes in config of ZK this can be triggered by your external event) that then causes a listener to restart.
The referenced code base is here , you will find that a change in config causes the Watcher (a spark streaming app) to reboot after a graceful shutdown and reload changes. Hope this is a pointer!
I am currently solving this issue as follows,
Listen to external events by subscribing to a MQTT topic
In the MQTT callback, stop the streaming context ssc.stop(true,true) which will gracefully shutdown the streams and underlying
spark config
Start the spark application again by creating a spark conf and
setting up the streams by reading the config file
// Contents of startSparkApplication() method
sparkConf = new SparkConf().setAppName("SparkAppName")
ssc = new StreamingContext(sparkConf, Seconds(1))
val myStream = MQTTUtils.createStream(ssc,...) //provide other options
myStream.print()
ssc.start()
The application is built as Spring boot application
In Scala, stopping sparkStreamingContext may involve stopping SparkContext. I have found that when a receiver hangs, it is best to restart the SparkCintext and the SparkStreamingContext.
I am sure the code below can be written much more elegantly, but it allows for the restarting of SparkContext and SparkStreamingContext programatically. Once this is done, you can restart your receivers programatically as well.
package coname.utilobjects
import com.typesafe.config.ConfigFactory
import grizzled.slf4j.Logging
import coname.conameMLException
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
object SparkConfProviderWithStreaming extends Logging
{
val sparkVariables: mutable.HashMap[String, Any] = new mutable.HashMap
}
trait SparkConfProviderWithStreaming extends Logging{
private val keySSC = "SSC"
private val keyConf = "conf"
private val keySparkSession = "spark"
lazy val packagesversion=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.packagesversion")
lazy val sparkcassandraconnectionhost=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkcassandraconnectionhost")
lazy val sparkdrivermaxResultSize=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkdrivermaxResultSize")
lazy val sparknetworktimeout=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparknetworktimeout")
#throws(classOf[conameMLException])
def intitializeSpark(): Unit =
{
getSparkConf()
getSparkStreamingContext()
getSparkSession()
}
#throws(classOf[conameMLException])
def getSparkConf(): SparkConf = {
try {
if (!SparkConfProviderWithStreaming.sparkVariables.get(keyConf).isDefined) {
logger.info("\n\nLoading new conf\n\n")
val conf = new SparkConf().setMaster("local[4]").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
SparkConfProviderWithStreaming.sparkVariables.put(keyConf, conf)
logger.info("Loaded new conf")
getSparkConf()
}
else {
logger.info("Returning initialized conf")
SparkConfProviderWithStreaming.sparkVariables.get(keyConf).get.asInstanceOf[SparkConf]
}
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
#throws(classOf[conameMLException])
def killSparkStreamingContext
{
try
{
if(SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined)
{
SparkConfProviderWithStreaming.sparkVariables -= keySSC
SparkConfProviderWithStreaming.sparkVariables -= keyConf
}
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
#throws(classOf[conameMLException])
def getSparkStreamingContext(): StreamingContext = {
try {
if (!SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined) {
logger.info("\n\nLoading new streaming\n\n")
SparkConfProviderWithStreaming.sparkVariables.put(keySSC, new StreamingContext(getSparkConf(), Seconds(6)))
logger.info("Loaded streaming")
getSparkStreamingContext()
}
else {
SparkConfProviderWithStreaming.sparkVariables.get(keySSC).get.asInstanceOf[StreamingContext]
}
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
def getSparkSession():SparkSession=
{
if(!SparkSession.getActiveSession.isDefined)
{
SparkSession.builder.config(getSparkConf()).getOrCreate()
}
else
{
SparkSession.getActiveSession.get
}
}
}

How can we run different task in different thread in gradle

I was trying to find a solution to run different task in different threads (depends/independents)
I have scenario where I need to run one task (which internally runs a server) in different thread before running another task (test, depends on above server) in gradle, after 2nd task completed I need to kill first task.
Again, same as above scenario, run another set of server/test/kill tasks.
task exp{
doFirst{
run1stServerTask.execute()
}
def pool = Executors.newFixedThreadPool(5)
try {
def defer = { closure -> pool.submit(closure as Callable) }
defer {
run1stTest.execute()
// After tests are finished, kill 1st server tasks
}
defer {
run2ndServerTask.execute()
}
defer {
run2ndTest.execute()
// After tests are finished, kill 2nd server tasks
}
}
finally {
pool.shutdown()
}
}
Hope, All above make sense... I am open for another approach if its possible in build.gradle.

Resources