Optimizing Spark JDBC connection read time by adding query parameter - apache-spark

Connecting sql server to spark using the following package https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16. At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following spark.read config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()

You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan

Related

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

Performance issues in loading data from Databricks to Azure SQL

I am trying to load 1 million records from Delta table in Databricks to Azure SQL database using the recently released connector by Microsoft supporting Python API and Spark 3.0.
Performance does not really look awesome to me. It takes 19 minutes to load 1 million records. Below is the code which I am using. Do you think I am missing something here?
Configurations:
8 Worker nodes with 28GB memory and 8 cores.
Azure SQL database is a 4 vcore Gen5 .
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("url", url) \
.option("dbtable", "lending_club_acc_loans") \
.option("user", username) \
.option("password", password) \
.option("tableLock", "true") \
.option("batchsize", "200000") \
.option("reliabilityLevel", "BEST_EFFORT") \
.save()
except ValueError as error :
print("Connector write failed", error)
Is there something I can do to boost the performance?
Repartition the data frame. Earlier I had single partition on my source data frame which upon re-partition to 8 helped improve the performance.

Parallel execution of read and write API calls in PySpark SQL

I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
t.start()
for t in threads:
t.join()
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.

How to call oracle stored procedure in queries?

In my spark project , I am using spark-sql-2.4.1v.
As part of my code , I need to call oracle stored procs in my spark job.
We are converting an old project into spark that has got lot of logic based on oracle stored procs. The middleware logic we are converting to spark ... so want to keep the procs logic as is , as there are other application uses them...hence need to call existing procs in spark code.
how to call oracle stored procs?
cx_Oracle module in python can be used to call a oracle stored procedure from python / pyspark scripts.
Documentation is here - https://cx-oracle.readthedocs.io/en/latest/user_guide/plsql_execution.html
If for any reason, cx_Oracle does not work in the hadoop environment (as it requires oracle client binaries installed), we can use below Spark JDBC method.
In PySpark JDBC option, there is a property called sessionInitStatement which can be used to execute a custom SQL statement or a PL/SQL block before the JDBC process starts reading the data. This option in a spark JDBC read can be used to call a stored proc as below.
Here first we execute the PL/SQL proc using sessionInitStatement and then read the final data set from stored proc using spark/jdbc read.
from pyspark.sql import SparkSession, HiveContext
spark = (SparkSession.builder.enableHiveSupport().getOrCreate())
# Provide PL/SQL code here - call the stored proc within BEGIN and END block.
plsql_block = """
BEGIN
SCHEMA.STORED_PROC_NAME;
END;
"""
# Read the final table that is created / updated within the stored proc.
count_query = """
(
select count(*) from SCHMA.TABLE_NAME
) t1
"""
df = spark.read \
.format("jdbc") \
.option("url", "JDBC_URL") \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.option("sessionInitStatement", plsql_block) \
.option("dbtable", count_query) \
.option("user", "USER_ID") \
.option("password", "PASSWORD") \
.load()
print("Total Records")
df.show(10, False)
Spark SQL can be used to read and write from/to a Oracle table, in other words you can do select,insert and delete. You can do update by doing delete+insert. You can call a function as part of the SQL too. But I don't think you can call a stored procedure using Spark SQL but you can do that using plain old java/scala/python syntax. The strategy I use is to use Spark SQL to populate a table then run a stored procedure based on that table using standard JDBC connection and java code. The driver will execute this stored procedure in a single thread and obviously you wouldn't expect scalability here.
spark.write.mode(SaveMode.append).jdbc(jdbcURL, tableName, jdbcProperties)
val con = DriverManager.getConnection(jdbcURL)
//execute the stored procedure using JDBC connection
con.close()
You can try doing something like this, though I have never tried this personally in any implementation
query = "exec SP_NAME"
empDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password#//hostname:portnumber/SID") \
.option("dbtable", query) \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Filtering from phoenix when loading a table

I would like to know how this exactly works,
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE") \
.option("zkUrl", "10.0.0.11:2181:/hbase-unsecure") \
.load()
if this is loading the whole table or it will delay the loading to know if a filtering will be applied.
In the first case, how is the way to tell phoenix to filter the table before loading in the spark dataframe?
Thanks
Data is not loaded until you execute an action which requires it. All filter applied in the middle:
df.where($"foo" === "bar").count
will be pushed down by Spark if it is possible. You can watch results of predicate pushdown by running explain()

Resources