How to start and stop spark Context Manually - apache-spark

I am new to spark.In my current spark application script, I can send queries to spark in-memory saved table and getting the desired result using spark-submit.The problem is, each time spark context stops automatically after completing result. I want to send multiple queries sequentially.for that I need to keep alive spark context. how could I do that ? my point is
Manual start and stop sparkcontext by user
kindly suggest me.I am using pyspark 2.1.0.Thanks in advance

To answer your question, this works
import pyspark
# start
sc = pyspark.SparkContext()
#stop
sc.stop()

Try this code:
conf = SparkConf().setAppName("RatingsHistogram").setMaster("local")
sc = SparkContext.getOrCreate(conf)
This ensures to don;t have always stop your context and at the same time, if existing Spark Context are available, it will be reused.

Related

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often).
The sql query just performs create table as select from registered temp view from DataFrame.
The first idea was spark.catalog.clearCache() in the end of the job (did't help).
Also I found some post on databricks forum about using object ... extends App (Scala) instead of main method (didn't help again)
P.S. current_date() is the built-in function and it should be provided automatically (expected)
Spark 2.4.4, Scala 2.11, Databricks Runtime 6.2
org.apache.spark.sql.AnalysisException: Undefined function: 'current_date'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 21 pos 4
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1317)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)```
Solution, ensure spark initialized every time when job is executed.
TL;DR,
I had similar issue and that object extends App solution pointed me in right direction. So, in my case I was creating spark session outside of the "main" but within object and when job was executed first time cluster/driver loaded jar and initialised spark variable and once job has finished execution successfully (first time) the jar is kept it in memory but link to spark is lost for some reason and any subsequent execution does not reinitialize spark as jar is already loaded and in my case spark initilisation was outside main and hence was not re-initilised. I think it's not an issue for Databricks jobs that create cluster and run or start cluster before execution (as these are similar to first time start case) and only related to clusters that already up and running as jars are loaded during either cluster start up or job execution.
So, I moved spark creation i.e. SparkSession.builder()...getOrCreate() to the "main" and so when job called so does spark session gets reinitialized.
current_date() is the built-in function and it should be provided
automatically (expected)
This expectation is wrong. you have to import the functions
for scala
import org.apache.spark.sql.functions._
where current_date function is available.
from pyspark.sql import functions as F
for pyspark

Spark create new spark session/context and pick up from failure

The Spark platform where I work is not stable and keep failing my jobs with various reason each time. The job just not die on Hadoop manager but linger as Running, so I want to kill it.
In the same python script, I would like to kill the current spark session once there is failure, create another sparkcontext/session and pick up from the last checkpoint. I do have frequent checkpoint to avoid DAG getting too long. The part where it tends to fail is a while loop, so I can afford to pick up with the current df
Any idea how I can achieve that ?
My initial thought is
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("test_Terminal").config("spark.sql.broadcastTimeout", "36000").getOrCreate()
flag_finish = False
flag_fail=False
while (!flag_finish) :
if flag_fail : #kill current erroneous session
sc.stop()
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '60g'),
('spark.driver.memory','30g'),('spark.executor.cores', '16'),
('spark.driver.cores', '24'),('spark.cores.max', '32')])
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
df = ...#read back from checkpoint or disk
#process with current df or df picked up
while .. :#this is where server tend to fail my job due after some time
try :
##df processing and update
...
df.checkpoint()
df.count() #activate checkpoint
if complete :
flag_finished = True
exception Exception as e:
flag_fail=True
continue
Another question is how to explicitly read from checkpoint (which has been done by df.checkpoint())
Checkpointing in non-Streaming is to used sever lineage. It is not designed for sharing data between different applications or different Spark Contexts.
What you would like is not possible in fact.

Why does SparkContext randomly close, and how do you restart it from Zeppelin?

I am working in Zeppelin writing spark-sql queries and sometimes I suddenly start getting this error (after not changing code):
Cannot call methods on a stopped SparkContext.
Then the output says further down:
The currently active SparkContext was created at:
(No active SparkContext.)
This obviously doesn't make sense. Is this a bug in Zeppelin? Or am I doing something wrong? How can I restart the SparkContext?
Thank you
I have faced this problem a couple of times.
If you are setting your master as yarn-client, it might be due to the stop / restart of Resource Manager, the interpreter process may still be running but the Spark Context (which is a Yarn application) does not exists any more.
You could check if Spark Context is still running by consulting your Resource manager web Interface and check if there is an application named Zeppelin running.
Sometimes restarting the interpreter process from within Zeppelin (interpreter tab --> spark --> restart) will solve the problem.
Other times you need to:
kill the Spark interpreter process from the command line
remove the Spark Interpreter PID file
and the next time you start a paragraph it will start new spark context
I'm facing the same problem running multiple jobs in PySpark. Seems that in Spark 2.0.0, with SparkSession, when I call spark.stop() SparkSession calls the following trace:
# SparkSession
self._sc.stop()
# SparkContext.stop()
self._jsc = None
Then, when I try to create a new job with new a SparkContext, SparkSession return the same SparkContext than before with self.jsc = None.
I solved setting SparkSession._instantiatedContext = None after spark.stop() forcing SparkSession to create a new SparkContext next time that I demand.
It's not the best option, but meanwhile it's solving my issue.
I've noticed this issue more when running pyspark commands even with trivial variable declarations that a cell execution hangs in running state.
As mentioned above by user1314742, just killing the relevant PID solves this issue for me.
e.g.:
ps -ef | grep zeppelin
This is where restarting the Spark interpreter and restarting zeppelin notebook does not solve the issue. I guess because it cannot control the hung PID itself.
Could you check your driver memory is enough or not ? I solved this issue by
enlarge driver memory
tune GC:
--conf spark.cleaner.periodicGC.interval=60
--conf spark.cleaner.referenceTracking.blocking=false

How a Spark Streaming application be loaded and run?

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?
Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job.
sys.ShutdownHookThread {
logger.info("Gracefully stopping Application...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
logger.info("Application stopped gracefully")
}
It seems to work but does not look like the cleanest way to stop the job. Am I missing something here?
From a code perspective it may make sense but how do you use this in a cluster environment? If we start a spark streaming job (we distribute the jobs on all the nodes in the cluster) we will have to keep track of the PID for the job and the node on which it was running. Finally when we have to stop the process, we need to keep track which node the job was running at and the PID for that. I was just hoping that there would be a simpler way of job control for streaming jobs.
You can stop your streaming context in cluster mode by running the following command without needing to sending a SIGTERM. This will stop the streaming context without you needing to explicitly stop it using a thread hook.
$SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID
-$MASTER_REST_URL is the rest url of the spark driver, ie something like spark://localhost:6066
-$DRIVER_ID is something like driver-20150915145601-0000
If you want spark to stop your app gracefully, you can try setting the following system property when your spark app is initially submitted (see http://spark.apache.org/docs/latest/submitting-applications.html on setting spark configuration properties).
spark.streaming.stopGracefullyOnShutdown=true
This is not officially documented, and I gathered this from looking at the 1.4 source code. This flag is honored in standalone mode. I haven't tested it in clustered mode yet.
I am working with spark 1.4.*
Depends on the use case and how driver can be used.
Consider the case you wanted to collect some N records(tweets) from the Spark Structured Streaming, store them in Postgresql and stop the stream once the count crosses N records.
One way of doing this is to use accumulator and python threading.
Create a Python thread with stream query object and the accumulator, stop the query once the count is crossed
While starting the stream query pass the accumulator variable and update the value for each batch of the stream.
Sharing the code snippet for understanding/illustration purpose...
import threading
import time
def check_n_stop_streaming(query, acc, num_records=3500):
while (True):
if acc.value > num_records:
print_info(f"Number of records received so far {acc.value}")
query.stop()
break
else:
print_info(f"Number of records received so far {acc.value}")
time.sleep(1)
...
count_acc = spark.sparkContext.accumulator(0)
...
def postgresql_all_tweets_data_dump(df,
epoch_id,
raw_tweet_table_name,
count_acc):
print_info("Raw Tweets...")
df.select(["text"]).show(50, False)
count_acc += df.count()
mode = "append"
url = "jdbc:postgresql://{}:{}/{}".format(self._postgresql_host,
self._postgresql_port,
self._postgresql_database)
properties = {"user": self._postgresql_user,
"password": self._postgresql_password,
"driver": "org.postgresql.Driver"}
df.write.jdbc(url=url, table=raw_tweet_table_name, mode=mode, properties=properties)
...
query = tweet_stream.writeStream.outputMode("append"). \
foreachBatch(lambda df, id :
postgresql_all_tweets_data_dump(df=df,
epoch_id=id,
raw_tweet_table_name=raw_tweet_table_name,
count_acc=count_acc)).start()
stop_thread = threading.Thread(target=self.check_n_stop_streaming, args=(query, num_records, raw_tweet_table_name, ))
stop_thread.setDaemon(True)
stop_thread.start()
query.awaitTermination()
stop_thread.join()
If all you need is just stop running streaming application, then simplest way is via Spark admin UI (you can find it's URL in the startup logs of Spark master).
There is a section in the UI, that shows running streaming applications, and there are tiny (kill) url buttons near each application ID.
It is official now,please look into original apache documentation here-
http://spark.apache.org/docs/latest/configuration.html#spark-streaming

Resources