How to run parallel threads in AWS Glue PySpark? - apache-spark

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below).
This job take around 30 minutes to complete. Is there a way to run these in parallel under the same spark/glue context? I don't want to create separate glue jobs if I can avoid it.
import datetime
import os
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.functions import *
# query the runtime arguments
args = getResolvedOptions(
sys.argv,
["JOB_NAME", "redshift_catalog_connection", "target_database", "target_schema"],
)
# build the job session and context
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# set the job execution timestamp
job_execution_timestamp = datetime.datetime.utcnow()
tables = []
for table in tables:
catalog_table = glueContext.create_dynamic_frame.from_catalog(
database="test", table_name=table, transformation_ctx=table
)
data_set = catalog_table.toDF().withColumn(
"batchLoadTimestamp", lit(job_execution_timestamp)
)
# covert back to glue dynamic frame
export_frame = DynamicFrame.fromDF(data_set, glueContext, "export_frame")
# remove null rows from dynamic frame
non_null_records = DropNullFields.apply(
frame=export_frame, transformation_ctx="non_null_records"
)
temp_dir = os.path.join(args["TempDir"], redshift_table_name)
stores_redshiftSink = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=non_null_records,
catalog_connection=args["redshift_catalog_connection"],
connection_options={
"dbtable": f"{args['target_schema']}.{redshift_table_name}",
"database": args["target_database"],
"preactions": f"truncate table {args['target_schema']}.{redshift_table_name};",
},
redshift_tmp_dir=temp_dir,
transformation_ctx="stores_redshiftSink",
) ```

You can do the following things to make this process faster
Enable concurrent execution of job.
Allot sufficient number of DPU.
Pass the list of tables as a parameter
Execute the job in parallel using Glue workflows or step functions.
Now suppose you have 100 table's to ingest, you can divide the list in 10 table's each and run the job concurrently 10 times.
Since your data will be loaded parallely so time of Glue job run will be decreased hence less cost will be incurred.
Alternate approach that will be way faster is to use redshift utility direct.
Create table in redshift and keep the batchLoadTimestamp column as default to current_timestamp.
Now create the copy command and load data into the table directly from s3.
Run the copy command using Glue python shell job leveraging pg8000.
Why this approach will be faster??
Because the spark redshift jdbc connector first unloads the spark dataframe to s3 then prepares a copy command to the redshift table. And while running copy command directly you are removing the overhead of running unload command and also reading data into spark df.

Related

pyspark-How to submit Spark SQL in parrallism?

Hi I've more than 1200+ SQL queries and want to submit multiple SQL queries in parallel and store each of them into CSV files,
since python has a GIL limit, how to submit in parallel,
I've seen other demos and they are all scala-based spark app.
# return about 61K records
SQL = """
SELECT * FROM TEMP_VIEW WHERE index>=1 and index<=10;
"""
# return about 60K records
SQL2 = """
SELECT * FROM TEMP_VIEW WHERE index>=11 and index<=20;
"""
....
# this will use for loop to submit
Any suggestion will be super helpful! Thanks in advance!
According to Scheduling Within an Application you can submit from different threads. Example:
from multiprocessing import Process # Process follows the API of threading.Thread
def submit_job(query):
...
for job in jobs:
Process(target=submit_job, args=('query 1',)).start()
But keep in mind that by default, Spark’s scheduler runs jobs in FIFO fashion, so I think you should consider changing the scheduler into FAIR so you can run multiple jobs in parallel

AWS Glue (Spark) very slow

I've inherited some code that runs incredibly slowly on AWS Glue.
Within the job it creates a number of dynamic frames that are then joined using spark.sql. Tables are read from a MySQL and Postgres db and then Glue is used to join them together to finally write another table back to Postgres.
Example (note dbs etc have been renamed and simplified as I can't paste my actual code directly)
jobName = args['JOB_NAME']
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(jobName, args)
# MySQL
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "trans").toDF().createOrReplaceTempView("trans")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "types").toDF().createOrReplaceTempView("types")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "currency").toDF().createOrReplaceTempView("currency")
# DB2 (Postgres)
glueContext.create_dynamic_frame.from_catalog(database = "db2", table_name = "watermark").toDF().createOrReplaceTempView("watermark")
# transactions
new_transactions_df = spark.sql("[SQL CODE HERE]")
# Write to DB
conf_g = glueContext.extract_jdbc_conf("My DB")
url = conf_g["url"] + "/reporting"
new_transactions_df.write.option("truncate", "true").jdbc(url, "staging.transactions", properties=conf_g, mode="overwrite")
The [SQL CODE HERE] is literally a simple select statement joining the three tables together to produce an output that is then written to the staging.transactions table.
When I last ran this it only wrote 150 rows but took 9 minutes to do so. Can somebody please point me in the direction of how to optimise this?
Additional info:
Maximum capacity: 6
Worker type: G.1X
Number of workers: 6
Generally, when reading/writing data in spark using JDBC drivers, the common issue is that the operations aren't parallelized. Here are some optimizations you might want to try:
Specify parallelism on read
From the code you provided it seems that all the tables data is read using one query and one spark executor.
If you use spark dataframe reader directly, you can set options partitionColumn, lowerBound, upperBound, fetchSize to read multiple partitions in parallel using multiple workers, as described in the docs. Example:
spark.read.format("jdbc") \
#...
.option("partitionColumn", "partition_key") \
.option("lowerBound", "<lb>") \
.option("upperBound", "<ub>") \
.option("numPartitions", "<np>") \
.option("fetchsize", "<fs>")
When using read partitioning, note that spark will issue multiple queries in parallel, so make sure the db engine will support it and also optimize indexes especially for the partition_column to avoid entire table scan.
In AWS Glue, this can be done by passing additional options using the parameter additional_options:
To use a JDBC connection that performs parallel reads, you can set the
hashfield, hashexpression, or hashpartitions options:
glueContext.create_dynamic_frame.from_catalog(
database = "db1",
table_name = "trans",
additional_options = {"hashfield": "transID", "hashpartitions": "10"}
).toDF().createOrReplaceTempView("trans")
This is described in the Glue docs: Reading from JDBC Tables in Parallel
Using batchsize option when writing:
In you particular case, not sure if this can help as you write only 150 rows, but you can specify this option to improve writing performance:
new_transactions_df.write.format('jdbc') \
# ...
.option("batchsize", "10000") \
.save()
Push down optimizations
You can also optimize reading by pushing down some query (filter, column selection) directly to the db engine instead of loading the entire table into dynamic frame then filter.
In Glue, this can be done using push_down_predicate parameter:
glueContext.create_dynamic_frame.from_catalog(
database = "db1",
table_name = "trans",
push_down_predicate = "(transDate > '2021-01-01' and transStatus='OK')"
).toDF().createOrReplaceTempView("trans")
See Glue programming ETL partitions pushdowns
Using database utilities to bulk insert / export tables
In some cases, you could consider exporting tables into files using the db engine and then reading from files. The same implies when writing, first write to file then use db bulk insert command. This could avoid the bottleneck of using Spark with JDBC.
The Glue spark cluster usually takes 10 minutes only for startup. So that time(9 minutes) seems reasonable(unless you run Glue2.0, but you didn't specify the glue version you are using).
https://aws.amazon.com/es/about-aws/whats-new/2020/08/aws-glue-version-2-featuring-10x-faster-job-start-times-1-minute-minimum-billing-duration/#:~:text=With%20Glue%20version%202.0%2C%20job,than%20a%2010%20minute%20minimum.
Enable Metrics:
AWS Glue provides Amazon CloudWatch metrics that can be used to provide information about the executors and the amount of done by each executor. You can enable CloudWatch metrics on your AWS Glue job by doing one of the following:
Using a special parameter: Add the following argument to your AWS Glue job. This parameter allows you to collect metrics for job profiling for your job run. These metrics are available on the AWS Glue console and the CloudWatch console.
Key: --enable-metrics
Using the AWS Glue console: To enable metrics on an existing job, do the following:
Open the AWS Glue console.
In the navigation pane, choose Jobs.
Select the job that you want to enable metrics for.
Choose Action, and then choose Edit job.
Under Monitoring options, select Job
metrics.
Choose Save.
Courtesy: https://softans.com/aws-glue-etl-job-running-for-a-long-time/

Is it possible to limit resources assigned to a Spark session?

I'm launching pySpark sessions with the following code:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import *
spark = SparkSession.builder.getOrCreate()
I've noticed that if a notebook is running a pySpark query, and a second notebook tries to start a Spark session, the second Spark session will not start until the first one has finished (i.e. the first session is taking all the resources).
Is there some way to limit the resources of a Spark session or parallelize multiple sessions somehow?

How to do parallel processing in pyspark

I want to do parallel processing in for loop using pyspark.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn').appName('myAppName').getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
data = [a,b,c]
for i in data:
try:
df = spark.read.parquet('gs://'+i+'-data')
df.createOrReplaceTempView("people")
df2=spark.sql("""select * from people """)
df.show()
except Exception as e:
print(e)
continue
Above mentioned script is working fine but i want to do parallel processing in pyspark and which is possible in scala
Spark itself runs job parallel but if you still want parallel execution in the code you can use simple python code for parallel processing to do it (this was tested on DataBricks Only link).
data = ["a","b","c"]
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)
def fun(x):
try:
df = sqlContext.createDataFrame([(1,2, x), (2,5, "b"), (5,6, "c"), (8,19, "d")], ("st","end", "ani"))
df.show()
except Exception as e:
print(e)
pool.map( fun,data)
I have changed your code a bit but this is basically how you can run parallel tasks,
If you have some flat files that you want to run parallel just make a list with their name and pass it into pool.map( fun,data).
Change the function fun as need be.
For more details on the multiprocessing module check the documentation.
Similarly, if you want to do it in Scala you will need the following modules
import scala.concurrent.{Future, Await}
For a more detailed understanding check this out.
The code is for Databricks but with a few changes, it will work with your environment.
Here's a parallel loop on pyspark using azure databricks.
import socket
def getsock(i):
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
return s.getsockname()[0]
rdd1 = sc.parallelize(list(range(10)))
parallel=rdd1.map(getsock).collect()
On other platforms than azure you'll maybe need to create the spark context sc. On azure the variable exists by default.
Coding it up like this only makes sense if in the code that is executed parallelly (getsock here) there is no code that is already parallel. For instance, had getsock contained code to go through a pyspark DataFrame then that code is already parallel. So, it would probably not make sense to also "parallelize" that loop.

Why does a single structured query run multiple SQL queries per batch?

Why does the following structured query run multiple SQL queries as can be seen in web UI's SQL tab?
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val rates = spark.
readStream.
format("rate").
option("numPartitions", 1).
load.
writeStream.
format("console").
option("truncate", false).
option("numRows", 10).
trigger(Trigger.ProcessingTime(10.seconds)).
queryName("rate-console").
start

Resources