AWS Glue (Spark) very slow - apache-spark

I've inherited some code that runs incredibly slowly on AWS Glue.
Within the job it creates a number of dynamic frames that are then joined using spark.sql. Tables are read from a MySQL and Postgres db and then Glue is used to join them together to finally write another table back to Postgres.
Example (note dbs etc have been renamed and simplified as I can't paste my actual code directly)
jobName = args['JOB_NAME']
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(jobName, args)
# MySQL
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "trans").toDF().createOrReplaceTempView("trans")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "types").toDF().createOrReplaceTempView("types")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "currency").toDF().createOrReplaceTempView("currency")
# DB2 (Postgres)
glueContext.create_dynamic_frame.from_catalog(database = "db2", table_name = "watermark").toDF().createOrReplaceTempView("watermark")
# transactions
new_transactions_df = spark.sql("[SQL CODE HERE]")
# Write to DB
conf_g = glueContext.extract_jdbc_conf("My DB")
url = conf_g["url"] + "/reporting"
new_transactions_df.write.option("truncate", "true").jdbc(url, "staging.transactions", properties=conf_g, mode="overwrite")
The [SQL CODE HERE] is literally a simple select statement joining the three tables together to produce an output that is then written to the staging.transactions table.
When I last ran this it only wrote 150 rows but took 9 minutes to do so. Can somebody please point me in the direction of how to optimise this?
Additional info:
Maximum capacity: 6
Worker type: G.1X
Number of workers: 6

Generally, when reading/writing data in spark using JDBC drivers, the common issue is that the operations aren't parallelized. Here are some optimizations you might want to try:
Specify parallelism on read
From the code you provided it seems that all the tables data is read using one query and one spark executor.
If you use spark dataframe reader directly, you can set options partitionColumn, lowerBound, upperBound, fetchSize to read multiple partitions in parallel using multiple workers, as described in the docs. Example:
spark.read.format("jdbc") \
#...
.option("partitionColumn", "partition_key") \
.option("lowerBound", "<lb>") \
.option("upperBound", "<ub>") \
.option("numPartitions", "<np>") \
.option("fetchsize", "<fs>")
When using read partitioning, note that spark will issue multiple queries in parallel, so make sure the db engine will support it and also optimize indexes especially for the partition_column to avoid entire table scan.
In AWS Glue, this can be done by passing additional options using the parameter additional_options:
To use a JDBC connection that performs parallel reads, you can set the
hashfield, hashexpression, or hashpartitions options:
glueContext.create_dynamic_frame.from_catalog(
database = "db1",
table_name = "trans",
additional_options = {"hashfield": "transID", "hashpartitions": "10"}
).toDF().createOrReplaceTempView("trans")
This is described in the Glue docs: Reading from JDBC Tables in Parallel
Using batchsize option when writing:
In you particular case, not sure if this can help as you write only 150 rows, but you can specify this option to improve writing performance:
new_transactions_df.write.format('jdbc') \
# ...
.option("batchsize", "10000") \
.save()
Push down optimizations
You can also optimize reading by pushing down some query (filter, column selection) directly to the db engine instead of loading the entire table into dynamic frame then filter.
In Glue, this can be done using push_down_predicate parameter:
glueContext.create_dynamic_frame.from_catalog(
database = "db1",
table_name = "trans",
push_down_predicate = "(transDate > '2021-01-01' and transStatus='OK')"
).toDF().createOrReplaceTempView("trans")
See Glue programming ETL partitions pushdowns
Using database utilities to bulk insert / export tables
In some cases, you could consider exporting tables into files using the db engine and then reading from files. The same implies when writing, first write to file then use db bulk insert command. This could avoid the bottleneck of using Spark with JDBC.

The Glue spark cluster usually takes 10 minutes only for startup. So that time(9 minutes) seems reasonable(unless you run Glue2.0, but you didn't specify the glue version you are using).
https://aws.amazon.com/es/about-aws/whats-new/2020/08/aws-glue-version-2-featuring-10x-faster-job-start-times-1-minute-minimum-billing-duration/#:~:text=With%20Glue%20version%202.0%2C%20job,than%20a%2010%20minute%20minimum.

Enable Metrics:
AWS Glue provides Amazon CloudWatch metrics that can be used to provide information about the executors and the amount of done by each executor. You can enable CloudWatch metrics on your AWS Glue job by doing one of the following:
Using a special parameter: Add the following argument to your AWS Glue job. This parameter allows you to collect metrics for job profiling for your job run. These metrics are available on the AWS Glue console and the CloudWatch console.
Key: --enable-metrics
Using the AWS Glue console: To enable metrics on an existing job, do the following:
Open the AWS Glue console.
In the navigation pane, choose Jobs.
Select the job that you want to enable metrics for.
Choose Action, and then choose Edit job.
Under Monitoring options, select Job
metrics.
Choose Save.
Courtesy: https://softans.com/aws-glue-etl-job-running-for-a-long-time/

Related

Spark Structured Stream Scalability and Duplicates Issue

I am using Spark Structured Streaming on Databricks Cluster to extract data from Azure Event Hub, process it, and write it to snowflake using ForEachBatch with Epoch_Id/ Batch_Id passed to the foreach batch function.
My code looks something like below:
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(EVENT_HUB_CONNECTION_STRING)
ehConf['eventhubs.consumerGroup'] = consumergroup
# Read stream data from event hub
spark_df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
Some transformations...
Write to Snowflake
def foreach_batch_function(df, epoch_id):
df.write\
.format(SNOWFLAKE_SOURCE_NAME)\
.options(**sfOptions)\
.option("dbtable", snowflake_table)\
.mode('append')\
.save()
processed_df.writeStream.outputMode('append').\
trigger(processingTime='10 seconds').\
option("checkpointLocation",f"checkpoint/P1").\
foreachBatch(foreach_batch_function).start()
Currently I am facing 2 issues:
When node failure occurs. Although on spark official web, it is mentioned that when one uses ForeachBatch along with epoch_id/batch_id during recovery form node failure there shouldn't be any duplicates, but I do find duplicates getting populated in my snowflake tables. Link for reference: [Spark Structured Streaming ForEachBatch With Epoch Id][1].
I am encountering errors a.)TransportClient: Failed to send RPC RPC 5782383376229127321 to /30.62.166.7:31116: java.nio.channels.ClosedChannelException and b.)TaskSchedulerImpl: Lost executor 1560 on 30.62.166.7: worker decommissioned: Worker Decommissioned very frequently on my databricks cluster. No matter how many executors I allocate or how much executors memory I increase, the clusters reaches to max worker limit and I receive one of the two error with duplicates being populated in my snowflake table after its recovery.
Any solution/ suggestion to any of the above points would be helpful.
Thanks in advance.
foreachBatch is by definition not idempotent because when currently executed batch fails, then it's retries, and partial results could be observed, and this is matching your observations. Idempotent writes in foreachBatch are applicable only for Delta Lake tables, not for all sink types (in some cases, like, Cassandra, it could work as well). I'm not so familiar with Snowflake, but maybe you can implement something similar to other database - write data into a temporary table (each batch will do an overwrite) and then merge from that temporary table into a target table.
Regarding 2nd issue - it looks like you're using autoscaling cluster - in this case, workers could be decommissioned because cluster managers detects that cluster isn't fully loaded. To avoid that you can disable autoscaling, and use fixed size cluster.

SQL query taking too long in azure databricks

I want to execute SQL query on a DB which is in Azure SQL managed instance using Azure Databricks. I have connected to DB using spark connector.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val config = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"queryCustom" -> "SELECT TOP 100 * FROM dbo.Clients WHERE PostalCode = 98074" //Sql query
"user" -> "username",
"password" -> "*********",
))
//Read all data in table dbo.Clients
val collection = sqlContext.read.sqlDB(config)
collection.show()
I am using above method to fetch the data(Example from MSFT doc). Table sizes are over 10M in my case. My question is How does Databricks process the query here?
Below is the documentation:
The Spark master node connects to databases in SQL Database or SQL Server and loads data from a specific table or using a specific SQL query.
The Spark master node distributes data to worker nodes for transformation.
The Worker node connects to databases that connect to SQL Database and SQL Server and writes data to the database. User can choose to use row-by-row insertion or bulk insert.
It says master node fetches the data and distributes the work to worker nodes later. In the above code, while fetching the data what if the query itself is complex and takes time? Does it spread the work to worker nodes? or I have to fetch the tables data first to Spark and then run the SQL query to get the result. Which method do you suggest?
So using the above method uses a single JDBC connection to pull the table into the Spark environment.
And if you want to use the push down predicate on the query then you can use in this way.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query,
properties=connectionProperties)
display(df)
If you want to improve the performance than you need to manage parallelism while reading.
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read. These options must all be specified if any of them is specified. lowerBound and upperBound decide the partition stride, but do not filter the rows in table. Therefore, Spark partitions and returns all rows in the table.
The following example splits the table read across executors on the emp_no column using the columnName, lowerBound, upperBound, and numPartitions parameters.
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)
For more Details : use this link

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Spark SQL - Options for deploying SQL queries on Spark Streams

I'm new to Spark and would like to run a Spark SQL query over Spark streams.
My current understanding is that I would need to define my SQL query in the code of my Spark job, as this snippet lifted from the Spark SQ home page shows:-
spark.read.json("s3n://...")
  .registerTempTable("json")
results = spark.sql(
  """SELECT *
     FROM people
     JOIN json ...""")
What I want to do is define my query on its own somewhere - eg. .sql file - and then deploy it over a Spark cluster.
Can anyone tell me if Spark currently has any support for this architecture? eg. some API?
you can use python with open to fill your purpose:
with open('filepath/filename.sql') as fr:
query = fr.read()
x = spark.sql(query)
x.show(5)
you could pass filename.sql as an argument while submitting your job using sys.argv[]
Please refer this link for more help: Spark SQL question

Apache Spark distributed sql

I use Spark DataFrameReader to perform sql query from database. For each query performed the SparkSession is required. What I would like to do is: for each of JavaPairRDDs perform map, which would invoke sql query with parameters from this RDD. This means that I need to pass SparkSession in each lambda, which seems to be bad design. What is common approach in such problems?
It could look like:
roots.map(r -> DBLoader.getData(sparkSession, r._1));
How I load data now:
JavaRDD<Row> javaRDD = sparkSession.read().format("jdbc")
.options(options)
.load()
.javaRDD();
The purpose of Big Data is to have data locality and be able to execute your code where your data resides, it is ok to do a big load of a table into memory or local disk (cache/persist), but continuous remote jdbc queries will defeat the purpose.

Resources