pyspark-How to submit Spark SQL in parrallism? - apache-spark

Hi I've more than 1200+ SQL queries and want to submit multiple SQL queries in parallel and store each of them into CSV files,
since python has a GIL limit, how to submit in parallel,
I've seen other demos and they are all scala-based spark app.
# return about 61K records
SQL = """
SELECT * FROM TEMP_VIEW WHERE index>=1 and index<=10;
"""
# return about 60K records
SQL2 = """
SELECT * FROM TEMP_VIEW WHERE index>=11 and index<=20;
"""
....
# this will use for loop to submit
Any suggestion will be super helpful! Thanks in advance!

According to Scheduling Within an Application you can submit from different threads. Example:
from multiprocessing import Process # Process follows the API of threading.Thread
def submit_job(query):
...
for job in jobs:
Process(target=submit_job, args=('query 1',)).start()
But keep in mind that by default, Spark’s scheduler runs jobs in FIFO fashion, so I think you should consider changing the scheduler into FAIR so you can run multiple jobs in parallel

Related

How to run parallel threads in AWS Glue PySpark?

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below).
This job take around 30 minutes to complete. Is there a way to run these in parallel under the same spark/glue context? I don't want to create separate glue jobs if I can avoid it.
import datetime
import os
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.functions import *
# query the runtime arguments
args = getResolvedOptions(
sys.argv,
["JOB_NAME", "redshift_catalog_connection", "target_database", "target_schema"],
)
# build the job session and context
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# set the job execution timestamp
job_execution_timestamp = datetime.datetime.utcnow()
tables = []
for table in tables:
catalog_table = glueContext.create_dynamic_frame.from_catalog(
database="test", table_name=table, transformation_ctx=table
)
data_set = catalog_table.toDF().withColumn(
"batchLoadTimestamp", lit(job_execution_timestamp)
)
# covert back to glue dynamic frame
export_frame = DynamicFrame.fromDF(data_set, glueContext, "export_frame")
# remove null rows from dynamic frame
non_null_records = DropNullFields.apply(
frame=export_frame, transformation_ctx="non_null_records"
)
temp_dir = os.path.join(args["TempDir"], redshift_table_name)
stores_redshiftSink = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=non_null_records,
catalog_connection=args["redshift_catalog_connection"],
connection_options={
"dbtable": f"{args['target_schema']}.{redshift_table_name}",
"database": args["target_database"],
"preactions": f"truncate table {args['target_schema']}.{redshift_table_name};",
},
redshift_tmp_dir=temp_dir,
transformation_ctx="stores_redshiftSink",
) ```
You can do the following things to make this process faster
Enable concurrent execution of job.
Allot sufficient number of DPU.
Pass the list of tables as a parameter
Execute the job in parallel using Glue workflows or step functions.
Now suppose you have 100 table's to ingest, you can divide the list in 10 table's each and run the job concurrently 10 times.
Since your data will be loaded parallely so time of Glue job run will be decreased hence less cost will be incurred.
Alternate approach that will be way faster is to use redshift utility direct.
Create table in redshift and keep the batchLoadTimestamp column as default to current_timestamp.
Now create the copy command and load data into the table directly from s3.
Run the copy command using Glue python shell job leveraging pg8000.
Why this approach will be faster??
Because the spark redshift jdbc connector first unloads the spark dataframe to s3 then prepares a copy command to the redshift table. And while running copy command directly you are removing the overhead of running unload command and also reading data into spark df.

how to submit to spark for many jobs in one application

I have a report stats project which use spark 2.1(scala),here is how it works:
object PtStatsDayApp extends App {
Stats A...
Stats B...
Stats C...
.....
}
someone put many stat computation(mostly not related) in one class and submit it using shell. I find it has two problems:
if one stat stuck then the other stats below can not run
if one stat failed then the application will rerun from the beginning
I have two refactor solutions:
put every stat in a single class ,but many more script needed. Does this solution get many overhead for submit so many?
run these stat in parallel .Does this issue resource stress, or spark can hand it appropriately?
Any other idea or best practice? thanks
There are several 3d party free Spark schedulers like Airflow, but I suggest to use Spark Launcher API and write a launching logic programmatically. With this API you can run your jobs in paralel, sequentially or whatever you want.
Link to doc: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/launcher/package-summary.html
Efficiency of running your jobs in parallel mostly depends on your Spark Cluster configuration. In general Spark supports such kind of workloads.
First you can set the scheduler mode to FAIR. Then you can use parallel collections to launch simultaneous Spark jobs on a multithreaded driver.
A parallel collection, lets say... a Parallel Sequence ParSeq of ten of your Stats queries, can use a foreach to fire off each of the Stats queries one by one. It will depend on how many cores the driver has as to how many threads you can use aimultaneously. By default, the global execution context has that many threads.
Check out these posts they are examples of launching concurrent spark jobs with parallel collections.
Cache and Query a Dataset In Parallel Using Spark
Launching Apache Spark SQL jobs from multi-threaded driver

How to start multiple streaming queries in a single Spark application?

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application. So that spark application should have multiple streaming queries.
I am confused on how to build/start multiple streaming queries within same submit programmatically.
For ex: I have this code:
case class SparkJobs(prop: Properties) extends Serializable {
def run() = {
Type1SparkJobBuilder(prop).build().awaitTermination()
Type1SparkJobBuilder(prop).build().awaitTermination()
}
}
I fire this in my main class with SparkJobs(new Properties()).run()
When I see in the spark history server, only the first spark streaming job (Type1SparkJob) is running.
What is the recommended way to fire multiple streaming queries within same spark submit programatically, I could not find proper documentation either.
Since you're calling awaitTermination on the first query it's going to block until it completes before starting the second query. So you want to kick off both queries, but then use StreamingQueryManager.awaitAnyTermination.
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
In addition to the above, by default Spark uses the FIFO scheduler. Which means the first query gets all resources in the cluster while it's executing. Since you're trying to run multiple queries concurrently you should switch to the FAIR scheduler
If you have some queries that should have more resources than the others then you can also tune the individual scheduler pools.
val query1=ds.writeSteam.{...}.start()
val query2=ds.writeSteam.{...}.start()
val query3=ds.writeSteam.{...}.start()
query3.awaitTermination()
AwaitTermination() will block your process until finish, which will never happen in a streaming app, call it on your last query should fix your problem

Finding Scheduler Delay for Spark

I want to be able to generate a table for the metric for each task, like the one on the Collector Spark UI when you visit a particular stage.
One of the columns is Scheduler delay, which I cannot find in any of the REST api provided by Spark.
All the other columns exists (when I browse /api/v1/applications/[app-id]/stages/[stage-id]/[attempt]/taskList).
How is scheduler delay calculate/is there a way for me to pull that data out without scraping the Collector Spark UI webpage?
The scheduler delay is not provided in the history api, yes. For the UI, it is calculated as follows:
private[ui] def getSchedulerDelay(info: TaskInfo, metrics: TaskMetricsUIData, currentTime: Long): Long = {
if (info.finished) {
val totalExecutionTime = info.finishTime - info.launchTime
val executorOverhead = (metrics.executorDeserializeTime + metrics.resultSerializationTime)
math.max(0,totalExecutionTime - metrics.executorRunTime - executorOverhead - getGettingResultTime(info, currentTime))
} else {
// The task is still running and the metrics like executorRunTime are not available.
0L
}
}
see https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala line number 770
At least for spark 1.6, if you are looking for scheduling delay for spark streaming batch, you can look at spark streaming UI source code.
It uses a Class BatchUIData, in which scheduling delay is defined:
/**
* Time taken for the first job of this batch to start processing from the time this batch
* was submitted to the streaming scheduler. Essentially, it is
* `processingStartTime` - `submissionTime`.
*/
def schedulingDelay: Option[Long] = processingStartTime.map(_ - submissionTime)

Luigi doesn't work as expected with Spark & Redshift

I'm running an EMR Spark cluster (uses YARN) and I'm running Luigi tasks directly from the EMR master. I have a chain of jobs that depends on data in S3 and after a few SparkSubmitTasks will eventually end up in Redshift.
import luigi
import luigi.format
from luigi.contrib.spark import SparkSubmitTask
from luigi.contrib.redshift import RedshiftTarget
class SomeSparkTask(SparkSubmitTask):
# Stored in /etc/luigi/client.cfg
host = luigi.Parameter(default='host')
database = luigi.Parameter(default='database')
user = luigi.Parameter(default='user')
password = luigi.Parameter(default='password')
table = luigi.Parameter(default='table')
<add-more-params-here>
app = '<app-jar>.jar'
entry_class = '<path-to-class>'
def app_options(self):
return <list-of-options>
def output(self):
return RedshiftTarget(host=self.host, database=self.database, user=self.user, password=self.password,
table=self.table, update_id=<some-unique-identifier>)
def requires(self):
return AnotherSparkSubmitTask(<params>)
I'm running into two main problems:
1) Sometimes luigi isn't able to determine when a SparkSubmitTask is done - for example, I'll see that luigi submits a job, then check YARN, which will say that the application is running, but once it's done, luigi just hangs and isn't able to determine that the job is done.
2) If for whatever reason the SparkSubmitTasks are able to run and the Task I've placed above finishes the Spark Job, the output task is never run and the marker-table is never created nor populated. However, the actual table is created in the Spark Job that's run. Am I misunderstanding how I'm supposed to call the RedshiftTarget?
In the meantime I'm trying to get acquainted with the source code.
Thanks!
Dropped the use of Luigi in my Spark applications because all my data is now streamed into S3 and I only need one large monolithic application to run all my Spark aggregations so I can take advantage of intermediate results/caching.

Resources