spark hive to sql write job hanging - apache-spark

I have a spark app that writes data from hive to sql server db. The problem is that although data gets written into the db the job never gets completed in spark. Sometimes some of the data gets written, sometimes the full data gets written. I have not been able to identify a pattern.
Here is the write stage:
finalReordered_DF
.sample(jdbc_df_sample) // using sample method to test whether job hangs at some particular dataset size
.coalesce(8).write
.format(jdbc_format)
.mode(jdbc_write_save_mode)
.option("url", url)
.option("user", jdbc_user)
.option("password", secretKey)
.option("driver", jdbc_write_driver) // com.microsoft.sqlserver.jdbc.SQLServerDriver
.option("sessionInitStatement", "SET XACT_ABORT ON;")
.option("numPartitions", "8")
.option("UseInternalTransaction", "true")
.option("queryTimeout", jdbc_write_query_timeout) // 300
.option("dbtable", jdbc_write_tgt_tbl)
.option("tableLock", jdbc_write_tableLock) // false
.option("batchsize", jdbc_write_batch_size) // 200,000
.option("cancelQueryTimeout", jdbc_write_cancel_query_timeout) // 1,800
.save()
So there's no pattern I can identify. The more critical aspect is that if the job hangs for too long I need to kill it which leaves the session open on the db and the target table becomes inaccessible (I don't have permissions to kill a running session by the way) and I'm stuck. The DBA team says there's nothing wrong at the DB level and that the only things that are off are the open sessions (I am aware that by killing the job the session does not get closed properly).
And my question is: what's the proper way to handle the session from within spark? How do I gracefully stop the application in a way that allows the jdbc driver to gracefully handle the session?
Thank you!

Related

Consistent SQL database snapshot using Spark

I am trying to export a snapshot of a postgresql database to parquet files using Spark.
I am dumping each table in the database to a seperate parquet file.
tables_names = ["A", "B", "C" , ...]
for table_name in tables_names:
table = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table_name)
.option("user", user)
.load())
table.write.mode("overwrite").saveAsTable(table_name)
The problem, however, is that I need the tables to be consistent with each other.
Ideally, the table loads should be executed in a single transaction so they see the same version of the database.
The only solution I can think of is to select all tables in a single query using UNION/JOIN but then I would need to identify each table columns which is something I am trying to avoid.
Unless you force all future connections to the database, not instance, to be read only and terminate those in flight, using, setting the
PostgreSQL configuration parameter default_transaction_read_only to true, then, no you cannot do this per discrete table approach as per your code.
Note that a session can override the global setting.
Means your 2nd option will work due to MVRCM, but not elegant and how performance from a Spark context for jdbc?

Processing an entire SQL table via JDBC by streaming or batching on a constrained environment

I am trying to set up a pipeline for processing entire SQL tables one by one with the initial ingestion happening through JDBC. I need to be able to use higher-level processing capabilities such as the ones available in Apache Spark or Flink and would like to use any existing capabilities rather than having to write my own, although it could be an inevitability. I need to be able to execute this pipeline on a constrained setup (potentially a single laptop). Please note that I am not talking about capturing or ingesting CDC here, I just want to batch process an existing table in a way that would not OOM a single machine.
As a trivial example, I have a table in SQL Server that's 500GB. I want to break it down into smaller chunks that would fit into the 16GB-32GB of available memory in a recently modern laptop, apply a transformation function to each of the rows and then forward them into a sink.
Some of the available solutions that seem close to doing what I need:
Apache Spark partitioned reads:
spark.read.format("jdbc").
.option("driver", driver)
.option("url", url)
.option("partitionColumn", id)
.option("lowerBound", min)
.option("upperBound", max)
.option("numPartitions", 10)
.option("fetchsize",1000)
.option("dbtable", query)
.option("user", "username")
.option("password", "password")
.load()
It looks like I can even repartition the datasets further after the initial read.
Problem is, in a local execution mode I expect the entire table to be partitioned across multiple CPU cores which will all try to load their respective chunk into memory, OOMing the whole business.
Is there a way to throttle the reading jobs so that only as many execute as can fit in memory? Can I force jobs to run sequentually?
Could I perhaps partition the table into much smaller chunks, many more than there are cores, causing only a small amount to be processed at once? Wouldn't that hamper everything with endless task scheduling etc?
If I wanted to write my own source for streaming into Spark, would that alleviate my memory woes? Does something like this help me?
Does Spark's memory management kick into play here at all? Why does it need to load the entire partition into memory at once during the read?
I looked at Apache Flink as an alternative as the streaming model is perhaps a little more appropriate here. Here's what it offers in terms of JDBC:
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.jdbc.Driver")
.setDBUrl("jdbc:mysql://localhost/log_db")
.setUsername("username")
.setPassword("password")
.setQuery("select id, something from SOMETHING")
.setRowTypeInfo(rowTypeInfo)
.finish()
However, it seems like this is also designed for batch processing and still attempts to load everything into memory.
How would I go about wrangling Flink to stream micro-batches of SQL data for processing?
Could I potentially write my own streaming source that wraps the JDBC input format?
Is it safe to assume that OOMs do not happen with Flink unless some state/accumulators become too big?
I also saw that Kafka has JDBC connectors but it looks like it is not really possible to run it locally (i.e. same JVM) like the other streaming frameworks. Thank you all for the help!
It's true that with Flink, input formats are only intended to be used for batch processing, but that shouldn't be a problem. Flink does batch processing one event at-a-time, without loading everything into memory. I think what you want should just work.

Spark MySQL JDBC hangs without errors

I have a simple application ingesting MySQL tables into S3 running on Spark 3.0.0 (EMR 6.1).
The MySQL table is loaded using a single large executor with 48G of memory as follows:
spark.read \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("driver", driver_class) \
.option("user", secret['username']) \
.option("password", secret['password']) \
.format("jdbc").load()
The Spark jobs work without any problem on small tables where the MySQL query takes less than ~6 minutes to complete. However, in two jobs where the query goes beyond this time the Spark job gets stuck RUNNING and does not trigger any error. The stderr and stdout logs don't show any progress and the executor is completely healthy.
The dag is very simple:
In MYSQL (Aurora RDS) the query seems to be completed but the connection is still open, while checking the thread state it shows 'cleaned up'.
I have tried with the MySQL connector version 5 and 8, but both of them show the same behaviour. I guess this could be related to a Spark default timeout configuration, but I would like to have some guidance.
This is a Thread Dump of the single executor:

What is the purpose of StreamingQuery.awaitTermination?

I have a Spark Structured Streaming job, it reads the offsets from a Kafka topic and writes it to the aerospike database. Currently I am in the process making this job production ready and implementing the SparkListener.
While going to through the documentation I stumbled upon this example:
StreamingQuery query = wordCounts.writeStream()
.outputMode("complete")
.format("console")
.start();
query.awaitTermination();
After this code is executed, the streaming computation will have
started in the background. The query object is a handle to that active
streaming query, and we have decided to wait for the termination of
the query using awaitTermination() to prevent the process from exiting
while the query is active.
I understand that it waits for query to complete before terminating the process.
What does it mean exactly? It helps to avoid data loss written by the query.
How is it helpful when query is writing millions of records every day?
My code looks pretty simple though:
dataset.writeStream()
.option("startingOffsets", "earliest")
.outputMode(OutputMode.Append())
.format("console")
.foreach(sink)
.trigger(Trigger.ProcessingTime(triggerInterval))
.option("checkpointLocation", checkpointLocation)
.start();
There are quite a few questions here, but answering just the one below should answer all.
I understand that it waits for query to complete before terminating the process. What does it mean exactly?
A streaming query runs in a separate daemon thread. In Java, daemon threads are used to allow for parallel processing until the main thread of your Spark application finishes (dies). Right after the last non-daemon thread finishes, the JVM shuts down and the entire Spark application finishes.
That's why you need to keep the main non-daemon thread waiting for the other daemon threads so they can do their work.
Read up on daemon threads in What is a daemon thread in Java?
I understand that it waits for query to complete before terminating the process.
What does it mean exactly
Nothing more, nothing less. Since query is started in background, without explicit blocking instruction your code would simply reach the end of main function and exit immediately.
How is it helpful when query is writing millions of records every day?
It really doesn't. It instead ensure that query is execute at all.

How to control worker transactions with jdbc data source?

when use spark delete(or update) and insert , Either all sucess ,Either all fail.
And I think spark application is distributed across many JVM, how can control the every worker transaction synchronize?
// DELETE: BEGIN
Class.forName("com.oracle.jdbc.Driver");
conn = DriverManager.getConnection(DB_URL, USER, PASS);
String query = "delete from users where id = ?";
PreparedStatement preparedStmt = conn.prepareStatement(query);
preparedStmt.setInt(1, 3);
preparedStmt.execute();
// DELETE: END
val jdbcDF = spark
.read
.jdbc("DB_URL", "schema.tablename", connectionProperties)
.write
.format("jdbc")
.option("url", "DB_URL")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()
tl;dr You can't.
Spark is a fast and general engine for large-scale data processing (i.e. a multi-threaded distributed computing platform) and the main selling point is that you may and will surely execute multiple simultaneously running tasks to process your massive datasets faster (and perhaps even cheaper).
JDBC is not very suitable data source for Spark as you are limited by the capacity of your JDBC database. That's why many people are migrating from JDBC databases to HDFS or Cassandra or similar data storage where thousands of connections is not much of an issue (not to mention other benefits like partitioning your datasets before Spark will touch the data).
You can control JDBC using some configuration parameters (e.g. partitionColumn, lowerBound, upperBound, numPartitions, fetchsize, batchsize or isolationLevel) that give you some flexibility, but wishing to "transaction synchronize" is outside the scope of Spark.
Use JDBC directly instead (just like you did for DELETE).
Note that the code between DELETE: BEGIN and DELETE: END are executed on the driver (on a single thread).

Resources