Spark MySQL JDBC hangs without errors - apache-spark

I have a simple application ingesting MySQL tables into S3 running on Spark 3.0.0 (EMR 6.1).
The MySQL table is loaded using a single large executor with 48G of memory as follows:
spark.read \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("driver", driver_class) \
.option("user", secret['username']) \
.option("password", secret['password']) \
.format("jdbc").load()
The Spark jobs work without any problem on small tables where the MySQL query takes less than ~6 minutes to complete. However, in two jobs where the query goes beyond this time the Spark job gets stuck RUNNING and does not trigger any error. The stderr and stdout logs don't show any progress and the executor is completely healthy.
The dag is very simple:
In MYSQL (Aurora RDS) the query seems to be completed but the connection is still open, while checking the thread state it shows 'cleaned up'.
I have tried with the MySQL connector version 5 and 8, but both of them show the same behaviour. I guess this could be related to a Spark default timeout configuration, but I would like to have some guidance.
This is a Thread Dump of the single executor:

Related

spark hive to sql write job hanging

I have a spark app that writes data from hive to sql server db. The problem is that although data gets written into the db the job never gets completed in spark. Sometimes some of the data gets written, sometimes the full data gets written. I have not been able to identify a pattern.
Here is the write stage:
finalReordered_DF
.sample(jdbc_df_sample) // using sample method to test whether job hangs at some particular dataset size
.coalesce(8).write
.format(jdbc_format)
.mode(jdbc_write_save_mode)
.option("url", url)
.option("user", jdbc_user)
.option("password", secretKey)
.option("driver", jdbc_write_driver) // com.microsoft.sqlserver.jdbc.SQLServerDriver
.option("sessionInitStatement", "SET XACT_ABORT ON;")
.option("numPartitions", "8")
.option("UseInternalTransaction", "true")
.option("queryTimeout", jdbc_write_query_timeout) // 300
.option("dbtable", jdbc_write_tgt_tbl)
.option("tableLock", jdbc_write_tableLock) // false
.option("batchsize", jdbc_write_batch_size) // 200,000
.option("cancelQueryTimeout", jdbc_write_cancel_query_timeout) // 1,800
.save()
So there's no pattern I can identify. The more critical aspect is that if the job hangs for too long I need to kill it which leaves the session open on the db and the target table becomes inaccessible (I don't have permissions to kill a running session by the way) and I'm stuck. The DBA team says there's nothing wrong at the DB level and that the only things that are off are the open sessions (I am aware that by killing the job the session does not get closed properly).
And my question is: what's the proper way to handle the session from within spark? How do I gracefully stop the application in a way that allows the jdbc driver to gracefully handle the session?
Thank you!

How and where JDBC connection(s) are created in Spark Structured Streaming job's forEachBatch loop?

I have a spark structured streaming job which reads from Kafka and writes the dataframe to Oracle inside foreachBatch loop. My code is as below. I understand that number of parallel connections will depend on numPartitions configuration but confused on how the connection is reused across executor, tasks and micro batches.
If a connection is made once at all executors then will it
remain open for future micro-batches as well or a new connection
will be established for each iteration.
If a connection is made for each task inside executor (eg- 10 tasks, then 10 connections) then does it mean that every time new connections will be established for each loop iteration and its task
StreamingDF.writeStream
.trigger(Trigger.ProcessingTime("30 seconds"))
.foreachBatch{
(microBatchDF: DataFrame, batchID) =>
{
microBatchDF.write
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable", "my_schema.my_table")
.option("user", username)
.option("password", password)
.option("batchsize", 10000)
.mode(SaveMode.Overwrite)
.save(
}
What is the best way to reuse the same connections to minimise batch execution time?

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

How to run 2 process parallel in Scala and Spark?

Env: Spark 1.6, Scala
Hi
I need to run to process parallel. First one, for receiving data and second one for transformation and saving in Hive table. I want to repeat first process with a interval of 1 min and second process with interval of 2 min.
==========First Process=== executes once per minute=============
val DFService = hivecontext.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://xx.x.x.xx:xxxx;database=myDB")
.option("dbtable", "(select Service_ID,ServiceIdentifier from myTable ) tmp")
.option("user", "userName")
.option("password", "myPassword")
.load()
DFService.registerTempTable("memTBLService")
DFService.write.mode("append").saveAsTable("hiveTable")
=============Second Process === executes once per 2 minute =========
var DF2 = hivecontext.sql("select * from hiveTable")
var data=DF2.select(DF2("Service_ID")).distinct
data.show()
How can I run this two process parallel and with desired interval in Scala?
Thanks
Hossain
Write two separate Spark applications.
Then use cron to execute each app per your desired schedule. Alternatively, you can use Apache AirFlow for scheduling your Spark apps.
Refer to the following question for how to use cron with Spark: How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?

Resources