My Spark application sometimes stops due to issues like HDFS failure, OutOfMemoryError or some other issues.
I know we can regularly store the data for the history server, but that may affect the space and performance.
I wish to record only the relevant error messages (not all INFO messages) in the history server.
Is it possible to control which messages will be printed by the history server?
You can set this property
log4j.logger.org.apache.spark.deploy.history=ERROR
in the log4j.properties file.
Related
I am working on spark Structured Streaming which is pretty easy use case.
I will be reading data from Kafka and persist in hdfs sink after parsing JSON.
I have almost completed the part. Now problem is we should have good way of shutting down the streaming job without having to close abruptly (ctrl+c or yarn -kill).
I have used the below options
sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true") but no use.
My requirement is when streaming job is running, it should get stop when some touch file is created in hdfs or Linux EN path.
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-demo-StreamingQueryManager-awaitAnyTermination-resetTerminated.html
In this above link, they create thread for fixed duration. But I need something like that which comes out of execution when some dummy file is created.
I am a newbie, so please need your help for that.
Thanks in advance.
I am not sure if actually works currently sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true"). Some claim it does work, though some don't.
In any event it is about direct kill or graceful stopping.
You need to kill the JVM though, or if in Databricks they have a whole lot of utilities.
But you will not lose data due to check-pointing and write ahead logs that Spark Structured Streaming provides. That is to say ability to recover state and offsets without any issues, Spark maintains own offset management. So, how you stop it seems less of an issue which may explain the confusion & the "but no use".
We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.
I am creating a long running spark application. After spark session has been created and application starts to run, I am not able to see it after click on the "show incomplete applications" on the spark history server. However, If I force my application to close, I can see it under the "completed applications" page.
I have spark parameters configured correctly on both client and server, as follow:
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://10.18.51.117:8020/history/ (a hdfs path on my spark history server)
I also configured the same on server side. So configuration shouldn't be a concern (since completed applications can also show up after I force my application to stop).
Do you guys have any thoughts on this behavior??
I look at the hdfs files on spark history server, I see a very small size .inprogress file associated with my running application (close to empty, see the picture below). It seems that the results get flushed to the file only when the application stops, which is not ideal for my long running application...Is there any way or parameters we can tweak to force flushing the log?
Very small size .inprogress file shown on hdfs during application is running
I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.
What is the process followed to make some changes on production in Spark-Streaming without any downtime?
If you are looking for Upgrading Application Code , Please refer to spark-streaming documentation .
Upgrading Application Code If a running Spark Streaming application
needs to be upgraded with new application code, then there are two
possible mechanisms.
The upgraded Spark Streaming application is started and run in
parallel to the existing application. Once the new one (receiving the
same data as the old one) has been warmed up and is ready for prime
time, the old one be can be brought down. Note that this can be done
for data sources that support sending the data to two destinations
(i.e., the earlier and upgraded applications).
The existing application is shutdown gracefully (see
StreamingContext.stop(...) or JavaStreamingContext.stop(...) for
graceful shutdown options) which ensure data that has been received is
completely processed before shutdown. Then the upgraded application
can be started, which will start processing from the same point where
the earlier application left off. Note that this can be done only with
input sources that support source-side buffering (like Kafka, and
Flume) as data needs to be buffered while the previous application was
down and the upgraded application is not yet up. And restarting from
earlier checkpoint information of pre-upgrade code cannot be done. The
checkpoint information essentially contains serialized
Scala/Java/Python objects and trying to deserialize objects with new,
modified classes may lead to errors. In this case, either start the
upgraded app with a different checkpoint directory, or delete the
previous checkpoint directory.
https://spark.apache.org/docs/latest/streaming-programming-guide.html