Performance issues in loading data from Databricks to Azure SQL - apache-spark

I am trying to load 1 million records from Delta table in Databricks to Azure SQL database using the recently released connector by Microsoft supporting Python API and Spark 3.0.
Performance does not really look awesome to me. It takes 19 minutes to load 1 million records. Below is the code which I am using. Do you think I am missing something here?
Configurations:
8 Worker nodes with 28GB memory and 8 cores.
Azure SQL database is a 4 vcore Gen5 .
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("url", url) \
.option("dbtable", "lending_club_acc_loans") \
.option("user", username) \
.option("password", password) \
.option("tableLock", "true") \
.option("batchsize", "200000") \
.option("reliabilityLevel", "BEST_EFFORT") \
.save()
except ValueError as error :
print("Connector write failed", error)
Is there something I can do to boost the performance?

Repartition the data frame. Earlier I had single partition on my source data frame which upon re-partition to 8 helped improve the performance.

Related

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

Optimizing Spark JDBC connection read time by adding query parameter

Connecting sql server to spark using the following package https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16. At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following spark.read config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan

Spark MySQL JDBC hangs without errors

I have a simple application ingesting MySQL tables into S3 running on Spark 3.0.0 (EMR 6.1).
The MySQL table is loaded using a single large executor with 48G of memory as follows:
spark.read \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("driver", driver_class) \
.option("user", secret['username']) \
.option("password", secret['password']) \
.format("jdbc").load()
The Spark jobs work without any problem on small tables where the MySQL query takes less than ~6 minutes to complete. However, in two jobs where the query goes beyond this time the Spark job gets stuck RUNNING and does not trigger any error. The stderr and stdout logs don't show any progress and the executor is completely healthy.
The dag is very simple:
In MYSQL (Aurora RDS) the query seems to be completed but the connection is still open, while checking the thread state it shows 'cleaned up'.
I have tried with the MySQL connector version 5 and 8, but both of them show the same behaviour. I guess this could be related to a Spark default timeout configuration, but I would like to have some guidance.
This is a Thread Dump of the single executor:

Very, very slow write from dataframe to SQL Server table

I'm running the code below and it works fine, but it's supper, super, super slow.
df.write.format('jdbc').options(url='jdbc:sqlserver://server_name.database.windows.net:1433;databaseName=db_name',
dbtable='dbo.my_table',
user='usr',
password='pwd',
batchsize=500000).mode('append').save()
I thought it would load records in batches of 500k at a time, but when I run the code and do record counts in SQL Server after the job kicks off, it's updating about 50 records per second. Hopefully there is an easy fix for this. Thanks!
Check if read method is configured with below params.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
spark.read.format("jdbc") \
#...
.option("partitionColumn", "partition_key") \
.option("lowerBound", "<lb>") \
.option("upperBound", "<ub>") \
.option("numPartitions", "<np>") \
.option("fetchsize", "<fs>")

Apache Spark SQL results not spilling to disk, exhausting Java heap space

According to the Spark FAQ:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
I'm querying a big table of 50 million entries. The initial data download won't fit in RAM, so Spark should spill to disk, right? And I filter out a small number of entries from those, which will fit in RAM.
SPARK_CLASSPATH=postgresql-9.4.1208.jre6-2.jar ./bin/pyspark --num-executors 4
url = \
"jdbc:postgresql://localhost:5432/mydatabase?user=postgres"
df = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "accounts") \
.option("partitionColumn", "id") \
.option("numPartitions", 10) \
.option("lowerBound", 1) \
.option("upperBound", 50000000) \
.option("password", "password") \
.load()
# get the small number of accounts whose names contain "taco"
results = df.map(lambda row: row["name"]).filter(lambda name: "taco" in name).collect()
I see some queries run on the Postgres server, then they finish, and pyspark crashes due to the Java backend crashing. java.lang.OutOfMemoryError: Java heap space
Is there something else I need to do?

Resources