Consistent SQL database snapshot using Spark - apache-spark

I am trying to export a snapshot of a postgresql database to parquet files using Spark.
I am dumping each table in the database to a seperate parquet file.
tables_names = ["A", "B", "C" , ...]
for table_name in tables_names:
table = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table_name)
.option("user", user)
.load())
table.write.mode("overwrite").saveAsTable(table_name)
The problem, however, is that I need the tables to be consistent with each other.
Ideally, the table loads should be executed in a single transaction so they see the same version of the database.
The only solution I can think of is to select all tables in a single query using UNION/JOIN but then I would need to identify each table columns which is something I am trying to avoid.

Unless you force all future connections to the database, not instance, to be read only and terminate those in flight, using, setting the
PostgreSQL configuration parameter default_transaction_read_only to true, then, no you cannot do this per discrete table approach as per your code.
Note that a session can override the global setting.
Means your 2nd option will work due to MVRCM, but not elegant and how performance from a Spark context for jdbc?

Related

Write data to specific partitions in Azure Dedicated SQL pool

At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.

Reading Big Query using Spark BigQueryConnector

I want to read a big query using spark big query connector and pass the partition information into it. This is working fine but its reading the full table. I want to filter the data based on some partition value. How can I do it? I don't want to read the full table and then apply filter on spark dataset. I want to pass the partition information while reading itself. Is that even possible?
Dataset<Row> testDS = session.read().format("bigquery")
.option("table", <TABLE>)
//.option("partition",<PARTITION>)
.option("project", <PROJECT_ID>)
.option("parentProject", <PROJECT_ID>)
.load();
filter is working this way .option("filter", "_PARTITIONTIME = '2020-11-23 13:00:00'")

Parallel execution of read and write API calls in PySpark SQL

I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df = spark.read.format("jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
.load()
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
t.start()
for t in threads:
t.join()
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.

Does spark saveAsTable really create a table?

This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.
Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.

How to control worker transactions with jdbc data source?

when use spark delete(or update) and insert , Either all sucess ,Either all fail.
And I think spark application is distributed across many JVM, how can control the every worker transaction synchronize?
// DELETE: BEGIN
Class.forName("com.oracle.jdbc.Driver");
conn = DriverManager.getConnection(DB_URL, USER, PASS);
String query = "delete from users where id = ?";
PreparedStatement preparedStmt = conn.prepareStatement(query);
preparedStmt.setInt(1, 3);
preparedStmt.execute();
// DELETE: END
val jdbcDF = spark
.read
.jdbc("DB_URL", "schema.tablename", connectionProperties)
.write
.format("jdbc")
.option("url", "DB_URL")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()
tl;dr You can't.
Spark is a fast and general engine for large-scale data processing (i.e. a multi-threaded distributed computing platform) and the main selling point is that you may and will surely execute multiple simultaneously running tasks to process your massive datasets faster (and perhaps even cheaper).
JDBC is not very suitable data source for Spark as you are limited by the capacity of your JDBC database. That's why many people are migrating from JDBC databases to HDFS or Cassandra or similar data storage where thousands of connections is not much of an issue (not to mention other benefits like partitioning your datasets before Spark will touch the data).
You can control JDBC using some configuration parameters (e.g. partitionColumn, lowerBound, upperBound, numPartitions, fetchsize, batchsize or isolationLevel) that give you some flexibility, but wishing to "transaction synchronize" is outside the scope of Spark.
Use JDBC directly instead (just like you did for DELETE).
Note that the code between DELETE: BEGIN and DELETE: END are executed on the driver (on a single thread).

Resources