I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
t.start()
for t in threads:
t.join()
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.
Related
We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.
Connecting sql server to spark using the following package https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16. At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following spark.read config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan
In my spark project , I am using spark-sql-2.4.1v.
As part of my code , I need to call oracle stored procs in my spark job.
We are converting an old project into spark that has got lot of logic based on oracle stored procs. The middleware logic we are converting to spark ... so want to keep the procs logic as is , as there are other application uses them...hence need to call existing procs in spark code.
how to call oracle stored procs?
cx_Oracle module in python can be used to call a oracle stored procedure from python / pyspark scripts.
Documentation is here - https://cx-oracle.readthedocs.io/en/latest/user_guide/plsql_execution.html
If for any reason, cx_Oracle does not work in the hadoop environment (as it requires oracle client binaries installed), we can use below Spark JDBC method.
In PySpark JDBC option, there is a property called sessionInitStatement which can be used to execute a custom SQL statement or a PL/SQL block before the JDBC process starts reading the data. This option in a spark JDBC read can be used to call a stored proc as below.
Here first we execute the PL/SQL proc using sessionInitStatement and then read the final data set from stored proc using spark/jdbc read.
from pyspark.sql import SparkSession, HiveContext
spark = (SparkSession.builder.enableHiveSupport().getOrCreate())
# Provide PL/SQL code here - call the stored proc within BEGIN and END block.
plsql_block = """
BEGIN
SCHEMA.STORED_PROC_NAME;
END;
"""
# Read the final table that is created / updated within the stored proc.
count_query = """
(
select count(*) from SCHMA.TABLE_NAME
) t1
"""
df = spark.read \
.format("jdbc") \
.option("url", "JDBC_URL") \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.option("sessionInitStatement", plsql_block) \
.option("dbtable", count_query) \
.option("user", "USER_ID") \
.option("password", "PASSWORD") \
.load()
print("Total Records")
df.show(10, False)
Spark SQL can be used to read and write from/to a Oracle table, in other words you can do select,insert and delete. You can do update by doing delete+insert. You can call a function as part of the SQL too. But I don't think you can call a stored procedure using Spark SQL but you can do that using plain old java/scala/python syntax. The strategy I use is to use Spark SQL to populate a table then run a stored procedure based on that table using standard JDBC connection and java code. The driver will execute this stored procedure in a single thread and obviously you wouldn't expect scalability here.
spark.write.mode(SaveMode.append).jdbc(jdbcURL, tableName, jdbcProperties)
val con = DriverManager.getConnection(jdbcURL)
//execute the stored procedure using JDBC connection
con.close()
You can try doing something like this, though I have never tried this personally in any implementation
query = "exec SP_NAME"
empDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password#//hostname:portnumber/SID") \
.option("dbtable", query) \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.
Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.
df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.
We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()
Env: Spark 1.6, Scala
Hi
I need to run to process parallel. First one, for receiving data and second one for transformation and saving in Hive table. I want to repeat first process with a interval of 1 min and second process with interval of 2 min.
==========First Process=== executes once per minute=============
val DFService = hivecontext.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://xx.x.x.xx:xxxx;database=myDB")
.option("dbtable", "(select Service_ID,ServiceIdentifier from myTable ) tmp")
.option("user", "userName")
.option("password", "myPassword")
.load()
DFService.registerTempTable("memTBLService")
DFService.write.mode("append").saveAsTable("hiveTable")
=============Second Process === executes once per 2 minute =========
var DF2 = hivecontext.sql("select * from hiveTable")
var data=DF2.select(DF2("Service_ID")).distinct
data.show()
How can I run this two process parallel and with desired interval in Scala?
Thanks
Hossain
Write two separate Spark applications.
Then use cron to execute each app per your desired schedule. Alternatively, you can use Apache AirFlow for scheduling your Spark apps.
Refer to the following question for how to use cron with Spark: How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?