Luigi doesn't work as expected with Spark & Redshift - apache-spark

I'm running an EMR Spark cluster (uses YARN) and I'm running Luigi tasks directly from the EMR master. I have a chain of jobs that depends on data in S3 and after a few SparkSubmitTasks will eventually end up in Redshift.
import luigi
import luigi.format
from luigi.contrib.spark import SparkSubmitTask
from luigi.contrib.redshift import RedshiftTarget
class SomeSparkTask(SparkSubmitTask):
# Stored in /etc/luigi/client.cfg
host = luigi.Parameter(default='host')
database = luigi.Parameter(default='database')
user = luigi.Parameter(default='user')
password = luigi.Parameter(default='password')
table = luigi.Parameter(default='table')
<add-more-params-here>
app = '<app-jar>.jar'
entry_class = '<path-to-class>'
def app_options(self):
return <list-of-options>
def output(self):
return RedshiftTarget(host=self.host, database=self.database, user=self.user, password=self.password,
table=self.table, update_id=<some-unique-identifier>)
def requires(self):
return AnotherSparkSubmitTask(<params>)
I'm running into two main problems:
1) Sometimes luigi isn't able to determine when a SparkSubmitTask is done - for example, I'll see that luigi submits a job, then check YARN, which will say that the application is running, but once it's done, luigi just hangs and isn't able to determine that the job is done.
2) If for whatever reason the SparkSubmitTasks are able to run and the Task I've placed above finishes the Spark Job, the output task is never run and the marker-table is never created nor populated. However, the actual table is created in the Spark Job that's run. Am I misunderstanding how I'm supposed to call the RedshiftTarget?
In the meantime I'm trying to get acquainted with the source code.
Thanks!

Dropped the use of Luigi in my Spark applications because all my data is now streamed into S3 and I only need one large monolithic application to run all my Spark aggregations so I can take advantage of intermediate results/caching.

Related

pyspark-How to submit Spark SQL in parrallism?

Hi I've more than 1200+ SQL queries and want to submit multiple SQL queries in parallel and store each of them into CSV files,
since python has a GIL limit, how to submit in parallel,
I've seen other demos and they are all scala-based spark app.
# return about 61K records
SQL = """
SELECT * FROM TEMP_VIEW WHERE index>=1 and index<=10;
"""
# return about 60K records
SQL2 = """
SELECT * FROM TEMP_VIEW WHERE index>=11 and index<=20;
"""
....
# this will use for loop to submit
Any suggestion will be super helpful! Thanks in advance!
According to Scheduling Within an Application you can submit from different threads. Example:
from multiprocessing import Process # Process follows the API of threading.Thread
def submit_job(query):
...
for job in jobs:
Process(target=submit_job, args=('query 1',)).start()
But keep in mind that by default, Spark’s scheduler runs jobs in FIFO fashion, so I think you should consider changing the scheduler into FAIR so you can run multiple jobs in parallel

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often).
The sql query just performs create table as select from registered temp view from DataFrame.
The first idea was spark.catalog.clearCache() in the end of the job (did't help).
Also I found some post on databricks forum about using object ... extends App (Scala) instead of main method (didn't help again)
P.S. current_date() is the built-in function and it should be provided automatically (expected)
Spark 2.4.4, Scala 2.11, Databricks Runtime 6.2
org.apache.spark.sql.AnalysisException: Undefined function: 'current_date'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 21 pos 4
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1317)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)```
Solution, ensure spark initialized every time when job is executed.
TL;DR,
I had similar issue and that object extends App solution pointed me in right direction. So, in my case I was creating spark session outside of the "main" but within object and when job was executed first time cluster/driver loaded jar and initialised spark variable and once job has finished execution successfully (first time) the jar is kept it in memory but link to spark is lost for some reason and any subsequent execution does not reinitialize spark as jar is already loaded and in my case spark initilisation was outside main and hence was not re-initilised. I think it's not an issue for Databricks jobs that create cluster and run or start cluster before execution (as these are similar to first time start case) and only related to clusters that already up and running as jars are loaded during either cluster start up or job execution.
So, I moved spark creation i.e. SparkSession.builder()...getOrCreate() to the "main" and so when job called so does spark session gets reinitialized.
current_date() is the built-in function and it should be provided
automatically (expected)
This expectation is wrong. you have to import the functions
for scala
import org.apache.spark.sql.functions._
where current_date function is available.
from pyspark.sql import functions as F
for pyspark

How to do parallel processing in pyspark

I want to do parallel processing in for loop using pyspark.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn').appName('myAppName').getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
data = [a,b,c]
for i in data:
try:
df = spark.read.parquet('gs://'+i+'-data')
df.createOrReplaceTempView("people")
df2=spark.sql("""select * from people """)
df.show()
except Exception as e:
print(e)
continue
Above mentioned script is working fine but i want to do parallel processing in pyspark and which is possible in scala
Spark itself runs job parallel but if you still want parallel execution in the code you can use simple python code for parallel processing to do it (this was tested on DataBricks Only link).
data = ["a","b","c"]
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)
def fun(x):
try:
df = sqlContext.createDataFrame([(1,2, x), (2,5, "b"), (5,6, "c"), (8,19, "d")], ("st","end", "ani"))
df.show()
except Exception as e:
print(e)
pool.map( fun,data)
I have changed your code a bit but this is basically how you can run parallel tasks,
If you have some flat files that you want to run parallel just make a list with their name and pass it into pool.map( fun,data).
Change the function fun as need be.
For more details on the multiprocessing module check the documentation.
Similarly, if you want to do it in Scala you will need the following modules
import scala.concurrent.{Future, Await}
For a more detailed understanding check this out.
The code is for Databricks but with a few changes, it will work with your environment.
Here's a parallel loop on pyspark using azure databricks.
import socket
def getsock(i):
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
return s.getsockname()[0]
rdd1 = sc.parallelize(list(range(10)))
parallel=rdd1.map(getsock).collect()
On other platforms than azure you'll maybe need to create the spark context sc. On azure the variable exists by default.
Coding it up like this only makes sense if in the code that is executed parallelly (getsock here) there is no code that is already parallel. For instance, had getsock contained code to go through a pyspark DataFrame then that code is already parallel. So, it would probably not make sense to also "parallelize" that loop.

How to start multiple streaming queries in a single Spark application?

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application. So that spark application should have multiple streaming queries.
I am confused on how to build/start multiple streaming queries within same submit programmatically.
For ex: I have this code:
case class SparkJobs(prop: Properties) extends Serializable {
def run() = {
Type1SparkJobBuilder(prop).build().awaitTermination()
Type1SparkJobBuilder(prop).build().awaitTermination()
}
}
I fire this in my main class with SparkJobs(new Properties()).run()
When I see in the spark history server, only the first spark streaming job (Type1SparkJob) is running.
What is the recommended way to fire multiple streaming queries within same spark submit programatically, I could not find proper documentation either.
Since you're calling awaitTermination on the first query it's going to block until it completes before starting the second query. So you want to kick off both queries, but then use StreamingQueryManager.awaitAnyTermination.
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
In addition to the above, by default Spark uses the FIFO scheduler. Which means the first query gets all resources in the cluster while it's executing. Since you're trying to run multiple queries concurrently you should switch to the FAIR scheduler
If you have some queries that should have more resources than the others then you can also tune the individual scheduler pools.
val query1=ds.writeSteam.{...}.start()
val query2=ds.writeSteam.{...}.start()
val query3=ds.writeSteam.{...}.start()
query3.awaitTermination()
AwaitTermination() will block your process until finish, which will never happen in a streaming app, call it on your last query should fix your problem

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job.
sys.ShutdownHookThread {
logger.info("Gracefully stopping Application...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
logger.info("Application stopped gracefully")
}
It seems to work but does not look like the cleanest way to stop the job. Am I missing something here?
From a code perspective it may make sense but how do you use this in a cluster environment? If we start a spark streaming job (we distribute the jobs on all the nodes in the cluster) we will have to keep track of the PID for the job and the node on which it was running. Finally when we have to stop the process, we need to keep track which node the job was running at and the PID for that. I was just hoping that there would be a simpler way of job control for streaming jobs.
You can stop your streaming context in cluster mode by running the following command without needing to sending a SIGTERM. This will stop the streaming context without you needing to explicitly stop it using a thread hook.
$SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID
-$MASTER_REST_URL is the rest url of the spark driver, ie something like spark://localhost:6066
-$DRIVER_ID is something like driver-20150915145601-0000
If you want spark to stop your app gracefully, you can try setting the following system property when your spark app is initially submitted (see http://spark.apache.org/docs/latest/submitting-applications.html on setting spark configuration properties).
spark.streaming.stopGracefullyOnShutdown=true
This is not officially documented, and I gathered this from looking at the 1.4 source code. This flag is honored in standalone mode. I haven't tested it in clustered mode yet.
I am working with spark 1.4.*
Depends on the use case and how driver can be used.
Consider the case you wanted to collect some N records(tweets) from the Spark Structured Streaming, store them in Postgresql and stop the stream once the count crosses N records.
One way of doing this is to use accumulator and python threading.
Create a Python thread with stream query object and the accumulator, stop the query once the count is crossed
While starting the stream query pass the accumulator variable and update the value for each batch of the stream.
Sharing the code snippet for understanding/illustration purpose...
import threading
import time
def check_n_stop_streaming(query, acc, num_records=3500):
while (True):
if acc.value > num_records:
print_info(f"Number of records received so far {acc.value}")
query.stop()
break
else:
print_info(f"Number of records received so far {acc.value}")
time.sleep(1)
...
count_acc = spark.sparkContext.accumulator(0)
...
def postgresql_all_tweets_data_dump(df,
epoch_id,
raw_tweet_table_name,
count_acc):
print_info("Raw Tweets...")
df.select(["text"]).show(50, False)
count_acc += df.count()
mode = "append"
url = "jdbc:postgresql://{}:{}/{}".format(self._postgresql_host,
self._postgresql_port,
self._postgresql_database)
properties = {"user": self._postgresql_user,
"password": self._postgresql_password,
"driver": "org.postgresql.Driver"}
df.write.jdbc(url=url, table=raw_tweet_table_name, mode=mode, properties=properties)
...
query = tweet_stream.writeStream.outputMode("append"). \
foreachBatch(lambda df, id :
postgresql_all_tweets_data_dump(df=df,
epoch_id=id,
raw_tweet_table_name=raw_tweet_table_name,
count_acc=count_acc)).start()
stop_thread = threading.Thread(target=self.check_n_stop_streaming, args=(query, num_records, raw_tweet_table_name, ))
stop_thread.setDaemon(True)
stop_thread.start()
query.awaitTermination()
stop_thread.join()
If all you need is just stop running streaming application, then simplest way is via Spark admin UI (you can find it's URL in the startup logs of Spark master).
There is a section in the UI, that shows running streaming applications, and there are tiny (kill) url buttons near each application ID.
It is official now,please look into original apache documentation here-
http://spark.apache.org/docs/latest/configuration.html#spark-streaming

Resources