save spark dataframe to multiple targets parallel - apache-spark

I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.

Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.

df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.

We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()

Related

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

Optimizing Spark JDBC connection read time by adding query parameter

Connecting sql server to spark using the following package https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16. At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following spark.read config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan

Parallel execution of read and write API calls in PySpark SQL

I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
t.start()
for t in threads:
t.join()
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.

Does the Spark Shell JDBC read numPartitions value depend on the number of executors?

I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load(). When I tried to measure the time taken to read a table for different numPartitions values by calling a df.rdd.count, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2 and SPARK_WORKER_CORES=1in my spark_env.sh file.
I have 2 questions:
Do the numPartitions actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?
Thanks!
Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.
In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:
spark.read("jdbc")
.option("url", url)
.option("dbtable", "table")
.option("user", user)
.option("password", password)
.option("numPartitions", numPartitions)
.option("partitionColumn", "<partition_column>")
.option("lowerBound", 1)
.option("upperBound", 10000)
.load()
That will parallel the queries from the databases to 10,000/numPartitions results of each query.
About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).
Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.

Dataset shows in Hive is inconsistent with JDBC

I am trying to query data from Hive in spark. According the spark-sql, there're two ways to do this:
The first way is Init session with enableHiveSupport
SparkSession session = SparkSession.builder().enableHiveSupport().getOrCreate();
session.sql("select dw_date from tfdw.dwd_dim_date limit 10").show();
the dataset shows the correct result
The second way is through JDBC
Dataset<Row> ds = session.read()
.format("jdbc")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://iZ11syxr6afZ:21050/;auth=noSasl")
.option("dbtable", "tfdw.dwd_dim_date")
.load();
ds.select("dw_date").limit(10).show();
But the dataset only show the column name in the result rather than the data of the column.
The two pictures should be consistent I think. Any outstanding I missed ? Many thanks!

Resources