Spark Structured Streaming - Graceful shutdown using a touch file - apache-spark

I am working on spark Structured Streaming which is pretty easy use case.
I will be reading data from Kafka and persist in hdfs sink after parsing JSON.
I have almost completed the part. Now problem is we should have good way of shutting down the streaming job without having to close abruptly (ctrl+c or yarn -kill).
I have used the below options
sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true") but no use.
My requirement is when streaming job is running, it should get stop when some touch file is created in hdfs or Linux EN path.
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-demo-StreamingQueryManager-awaitAnyTermination-resetTerminated.html
In this above link, they create thread for fixed duration. But I need something like that which comes out of execution when some dummy file is created.
I am a newbie, so please need your help for that.
Thanks in advance.

I am not sure if actually works currently sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true"). Some claim it does work, though some don't.
In any event it is about direct kill or graceful stopping.
You need to kill the JVM though, or if in Databricks they have a whole lot of utilities.
But you will not lose data due to check-pointing and write ahead logs that Spark Structured Streaming provides. That is to say ability to recover state and offsets without any issues, Spark maintains own offset management. So, how you stop it seems less of an issue which may explain the confusion & the "but no use".

Related

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

Shutdown spark structured streaming gracefully

There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala).
Is the shutdown process different in structured streaming? Or is it simply not implemented yet?
This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and offsets without any issues.
Try this github example code.
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/jobs/RealTimeFraudDetection/StructuredStreamingFraudDetection.scala#L76
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/GracefulShutdown.scala#L26
This Feature is not implemented yet and will also give you duplicates if you kill the job from the resource Manager while the batch is running.
Corrected: The duplicates will only be in the output directory, Write ahead logs handle everything beautifully; you don't need to worry about anything. Feel free to kill it any time.

Storing Data in Spark In Memory

I have got a requirement of keeping the data in Spark's in memory in table format even when the SparkContext object dies, so that Tableau can access it.
I have used registerTempTable , but data gets removed once the SparkContext object dies.
Is it possible to store data like this?If not what possible way I can look into to feed data to Tableau without reading it from HDFS location.
You will need to do one of the below:
run your Spark application as a long running application. Spark streaming usually does that out of the box (when you do StreamingContext.awaitTermination()). I have never tried it myself but I think YARN and MESOS have support for long running tasks. As you mentioned whenever your SparkContext dies, all the data is lost (because all the information is stored in the context). I consider spark-shell a long running application, that's why most Tableau/Spark demos use it because the context never dies.
store it into a data store (HDFS, database, etc.)
Try to use some distributed in-memory framework/file system like Tachyon - not sure if it has Tableau connectors, though.
Does Tableau read data from custom Spark Application?
I use PowerBi (instead Tableau) and it queries Spark through Thrift client, so each time it dies and restarts, I send him "cache table myTable" query through odbc/jdbc driver
I came to know a very interesting answer to the question asked above.
TACHYON.
http://ampcamp.berkeley.edu/5/exercises/tachyon.html

stack for loading log files into cassandra

I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.

Resources