How to call oracle stored procedure in queries? - apache-spark

In my spark project , I am using spark-sql-2.4.1v.
As part of my code , I need to call oracle stored procs in my spark job.
We are converting an old project into spark that has got lot of logic based on oracle stored procs. The middleware logic we are converting to spark ... so want to keep the procs logic as is , as there are other application uses them...hence need to call existing procs in spark code.
how to call oracle stored procs?

cx_Oracle module in python can be used to call a oracle stored procedure from python / pyspark scripts.
Documentation is here - https://cx-oracle.readthedocs.io/en/latest/user_guide/plsql_execution.html
If for any reason, cx_Oracle does not work in the hadoop environment (as it requires oracle client binaries installed), we can use below Spark JDBC method.
In PySpark JDBC option, there is a property called sessionInitStatement which can be used to execute a custom SQL statement or a PL/SQL block before the JDBC process starts reading the data. This option in a spark JDBC read can be used to call a stored proc as below.
Here first we execute the PL/SQL proc using sessionInitStatement and then read the final data set from stored proc using spark/jdbc read.
from pyspark.sql import SparkSession, HiveContext
spark = (SparkSession.builder.enableHiveSupport().getOrCreate())
# Provide PL/SQL code here - call the stored proc within BEGIN and END block.
plsql_block = """
BEGIN
SCHEMA.STORED_PROC_NAME;
END;
"""
# Read the final table that is created / updated within the stored proc.
count_query = """
(
select count(*) from SCHMA.TABLE_NAME
) t1
"""
df = spark.read \
.format("jdbc") \
.option("url", "JDBC_URL") \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.option("sessionInitStatement", plsql_block) \
.option("dbtable", count_query) \
.option("user", "USER_ID") \
.option("password", "PASSWORD") \
.load()
print("Total Records")
df.show(10, False)

Spark SQL can be used to read and write from/to a Oracle table, in other words you can do select,insert and delete. You can do update by doing delete+insert. You can call a function as part of the SQL too. But I don't think you can call a stored procedure using Spark SQL but you can do that using plain old java/scala/python syntax. The strategy I use is to use Spark SQL to populate a table then run a stored procedure based on that table using standard JDBC connection and java code. The driver will execute this stored procedure in a single thread and obviously you wouldn't expect scalability here.
spark.write.mode(SaveMode.append).jdbc(jdbcURL, tableName, jdbcProperties)
val con = DriverManager.getConnection(jdbcURL)
//execute the stored procedure using JDBC connection
con.close()

You can try doing something like this, though I have never tried this personally in any implementation
query = "exec SP_NAME"
empDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password#//hostname:portnumber/SID") \
.option("dbtable", query) \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Related

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

Optimizing Spark JDBC connection read time by adding query parameter

Connecting sql server to spark using the following package https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16. At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following spark.read config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan

Parallel execution of read and write API calls in PySpark SQL

I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
t.start()
for t in threads:
t.join()
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

Data validation for Oracle to Cassandra Data Migration

We are migrating the data from Oracle to Cassandra as part of an ETL process on a daily basis. I would like to perform data validation between the 2 databases once the Spark jobs are complete to ensure that both the databases are in sync. We are using DSE 5.1. Could you please provide your valuable inputs to ensure data has properly migrated
I assumed you have DSE Max with Spark support.
SparkSQL should suite best for it.
First you connect to Oracle with JDBC
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#jdbc-to-other-databases
I have no Oracle DB so following code is not tested, check JDBC URL and drivers before run it:
dse spark --driver-class-path ojdbc7.jar --jars ojdbc7.jar
scala> val oData = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#//localhost:1521/pdborcl")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
C* data is already mapped to SparkSQL table. So:
scala> cData = spark.sql("select * from keyspace.table");
You will need to check schema of both and data conversions details, to compare that tables properly. Simple integration check: All data form Oracle exist in C*:
scala> cData.except(oData).count
0: Long

Resources