when use spark delete(or update) and insert , Either all sucess ,Either all fail.
And I think spark application is distributed across many JVM, how can control the every worker transaction synchronize?
// DELETE: BEGIN
Class.forName("com.oracle.jdbc.Driver");
conn = DriverManager.getConnection(DB_URL, USER, PASS);
String query = "delete from users where id = ?";
PreparedStatement preparedStmt = conn.prepareStatement(query);
preparedStmt.setInt(1, 3);
preparedStmt.execute();
// DELETE: END
val jdbcDF = spark
.read
.jdbc("DB_URL", "schema.tablename", connectionProperties)
.write
.format("jdbc")
.option("url", "DB_URL")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()
tl;dr You can't.
Spark is a fast and general engine for large-scale data processing (i.e. a multi-threaded distributed computing platform) and the main selling point is that you may and will surely execute multiple simultaneously running tasks to process your massive datasets faster (and perhaps even cheaper).
JDBC is not very suitable data source for Spark as you are limited by the capacity of your JDBC database. That's why many people are migrating from JDBC databases to HDFS or Cassandra or similar data storage where thousands of connections is not much of an issue (not to mention other benefits like partitioning your datasets before Spark will touch the data).
You can control JDBC using some configuration parameters (e.g. partitionColumn, lowerBound, upperBound, numPartitions, fetchsize, batchsize or isolationLevel) that give you some flexibility, but wishing to "transaction synchronize" is outside the scope of Spark.
Use JDBC directly instead (just like you did for DELETE).
Note that the code between DELETE: BEGIN and DELETE: END are executed on the driver (on a single thread).
Related
I am trying to export a snapshot of a postgresql database to parquet files using Spark.
I am dumping each table in the database to a seperate parquet file.
tables_names = ["A", "B", "C" , ...]
for table_name in tables_names:
table = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table_name)
.option("user", user)
.load())
table.write.mode("overwrite").saveAsTable(table_name)
The problem, however, is that I need the tables to be consistent with each other.
Ideally, the table loads should be executed in a single transaction so they see the same version of the database.
The only solution I can think of is to select all tables in a single query using UNION/JOIN but then I would need to identify each table columns which is something I am trying to avoid.
Unless you force all future connections to the database, not instance, to be read only and terminate those in flight, using, setting the
PostgreSQL configuration parameter default_transaction_read_only to true, then, no you cannot do this per discrete table approach as per your code.
Note that a session can override the global setting.
Means your 2nd option will work due to MVRCM, but not elegant and how performance from a Spark context for jdbc?
I've inherited some code that runs incredibly slowly on AWS Glue.
Within the job it creates a number of dynamic frames that are then joined using spark.sql. Tables are read from a MySQL and Postgres db and then Glue is used to join them together to finally write another table back to Postgres.
Example (note dbs etc have been renamed and simplified as I can't paste my actual code directly)
jobName = args['JOB_NAME']
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(jobName, args)
# MySQL
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "trans").toDF().createOrReplaceTempView("trans")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "types").toDF().createOrReplaceTempView("types")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "currency").toDF().createOrReplaceTempView("currency")
# DB2 (Postgres)
glueContext.create_dynamic_frame.from_catalog(database = "db2", table_name = "watermark").toDF().createOrReplaceTempView("watermark")
# transactions
new_transactions_df = spark.sql("[SQL CODE HERE]")
# Write to DB
conf_g = glueContext.extract_jdbc_conf("My DB")
url = conf_g["url"] + "/reporting"
new_transactions_df.write.option("truncate", "true").jdbc(url, "staging.transactions", properties=conf_g, mode="overwrite")
The [SQL CODE HERE] is literally a simple select statement joining the three tables together to produce an output that is then written to the staging.transactions table.
When I last ran this it only wrote 150 rows but took 9 minutes to do so. Can somebody please point me in the direction of how to optimise this?
Additional info:
Maximum capacity: 6
Worker type: G.1X
Number of workers: 6
Generally, when reading/writing data in spark using JDBC drivers, the common issue is that the operations aren't parallelized. Here are some optimizations you might want to try:
Specify parallelism on read
From the code you provided it seems that all the tables data is read using one query and one spark executor.
If you use spark dataframe reader directly, you can set options partitionColumn, lowerBound, upperBound, fetchSize to read multiple partitions in parallel using multiple workers, as described in the docs. Example:
spark.read.format("jdbc") \
#...
.option("partitionColumn", "partition_key") \
.option("lowerBound", "<lb>") \
.option("upperBound", "<ub>") \
.option("numPartitions", "<np>") \
.option("fetchsize", "<fs>")
When using read partitioning, note that spark will issue multiple queries in parallel, so make sure the db engine will support it and also optimize indexes especially for the partition_column to avoid entire table scan.
In AWS Glue, this can be done by passing additional options using the parameter additional_options:
To use a JDBC connection that performs parallel reads, you can set the
hashfield, hashexpression, or hashpartitions options:
glueContext.create_dynamic_frame.from_catalog(
database = "db1",
table_name = "trans",
additional_options = {"hashfield": "transID", "hashpartitions": "10"}
).toDF().createOrReplaceTempView("trans")
This is described in the Glue docs: Reading from JDBC Tables in Parallel
Using batchsize option when writing:
In you particular case, not sure if this can help as you write only 150 rows, but you can specify this option to improve writing performance:
new_transactions_df.write.format('jdbc') \
# ...
.option("batchsize", "10000") \
.save()
Push down optimizations
You can also optimize reading by pushing down some query (filter, column selection) directly to the db engine instead of loading the entire table into dynamic frame then filter.
In Glue, this can be done using push_down_predicate parameter:
glueContext.create_dynamic_frame.from_catalog(
database = "db1",
table_name = "trans",
push_down_predicate = "(transDate > '2021-01-01' and transStatus='OK')"
).toDF().createOrReplaceTempView("trans")
See Glue programming ETL partitions pushdowns
Using database utilities to bulk insert / export tables
In some cases, you could consider exporting tables into files using the db engine and then reading from files. The same implies when writing, first write to file then use db bulk insert command. This could avoid the bottleneck of using Spark with JDBC.
The Glue spark cluster usually takes 10 minutes only for startup. So that time(9 minutes) seems reasonable(unless you run Glue2.0, but you didn't specify the glue version you are using).
https://aws.amazon.com/es/about-aws/whats-new/2020/08/aws-glue-version-2-featuring-10x-faster-job-start-times-1-minute-minimum-billing-duration/#:~:text=With%20Glue%20version%202.0%2C%20job,than%20a%2010%20minute%20minimum.
Enable Metrics:
AWS Glue provides Amazon CloudWatch metrics that can be used to provide information about the executors and the amount of done by each executor. You can enable CloudWatch metrics on your AWS Glue job by doing one of the following:
Using a special parameter: Add the following argument to your AWS Glue job. This parameter allows you to collect metrics for job profiling for your job run. These metrics are available on the AWS Glue console and the CloudWatch console.
Key: --enable-metrics
Using the AWS Glue console: To enable metrics on an existing job, do the following:
Open the AWS Glue console.
In the navigation pane, choose Jobs.
Select the job that you want to enable metrics for.
Choose Action, and then choose Edit job.
Under Monitoring options, select Job
metrics.
Choose Save.
Courtesy: https://softans.com/aws-glue-etl-job-running-for-a-long-time/
Greeting,
I have created a Spark 2.1.1 cluster in Amazon EC2 with instance type m4.large of 1 master and 5 slaves to start. My PostgreSQL 9.5 database (t2.large) has a table of over 2 billions rows and 7 column that I would like to process. I have followed the direction from Apache Spark website and other various sources on how to connect and process these data.
My problem is that Spark SQL performance is way slower than my database. My sql statement (see below in the code) takes about 21mins in PSQL, but Spark SQL take about 42 min to finish. My main goal is to measure the performance of PSQL vs Spark SQL and so far I am not getting the desire results. I would appreciate the help.
Thank you
I have tried increasing fetchSize from 10000 to 100000, caching the dataframe, increase numpartition to 100, set spark.sql.shuffle to 2000, double my cluster size, and use larger instance type and so far I have not seen any improvements.
val spark = SparkSession.builder()
.appName("Spark SQL")
.getOrCreate();
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.load()
.createOrReplaceTempView("ghcn_all");
val sqlStatement = "SELECT ghcn_date, element_value/10.0
FROM ghcn_all
WHERE station_id = 'USW00094846'
AND (ghcn_date >= '2015-01-01' AND ghcn_date <= '2015-12-31')
AND qflag IS NULL
AND element_type = 'PRCP'
ORDER BY ghcn_date";
val sqlDF = spark.sql(sqlStatement);
var start:Long = System.nanoTime;
val num_rows:Long = sqlDF.count();
var end:Long = System.nanoTime;
println("Total Row : " + num_rows);
println("Total Collect Time Lapse : " + ((end - start) / 1000000) + " ms");
There is no good reason for this code to ever run faster on Spark, than database alone. First of all it is not even distributed, as you made the same mistake as many before you and don't partition the data.
But it more important is that you actually load data from the database - as a result it has to do at least as much work (and in practice more), then send data over the network, then data has to parsed by Spark, and processed. You basically do way more work and expect things to be faster - that's not going to happen.
If you want to reliably improve performance on Spark you should at least:
Extract data from the database.
Write to efficient (like not S3) distributed storage.
Use proper bucketing and partitioning to enable partition pruning and predicate pushdown.
Then you might have a better lack. But again, proper indexing of your data on the cluster, should improve performance as well, likely at a lower overall cost.
It is very important to set partitionColumn when your read from SQL. It use for parallel query. So you should decide which column is your partitionColumn.
In your case for example:
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.option("partitionColumn", "ghcn_date")
.option("lowerBound", "2015-01-01")
.option("upperBound", "2015-12-31")
.option("numPartitions",16 )
.load()
.createOrReplaceTempView("ghcn_all");
More Reference:
How Apache Spark Makes Your Slow MySQL Queries 10x Faster (or More)
Tips for using JDBC in Apache Spark SQL
I use Spark DataFrameReader to perform sql query from database. For each query performed the SparkSession is required. What I would like to do is: for each of JavaPairRDDs perform map, which would invoke sql query with parameters from this RDD. This means that I need to pass SparkSession in each lambda, which seems to be bad design. What is common approach in such problems?
It could look like:
roots.map(r -> DBLoader.getData(sparkSession, r._1));
How I load data now:
JavaRDD<Row> javaRDD = sparkSession.read().format("jdbc")
.options(options)
.load()
.javaRDD();
The purpose of Big Data is to have data locality and be able to execute your code where your data resides, it is ok to do a big load of a table into memory or local disk (cache/persist), but continuous remote jdbc queries will defeat the purpose.
We are migrating the data from Oracle to Cassandra as part of an ETL process on a daily basis. I would like to perform data validation between the 2 databases once the Spark jobs are complete to ensure that both the databases are in sync. We are using DSE 5.1. Could you please provide your valuable inputs to ensure data has properly migrated
I assumed you have DSE Max with Spark support.
SparkSQL should suite best for it.
First you connect to Oracle with JDBC
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#jdbc-to-other-databases
I have no Oracle DB so following code is not tested, check JDBC URL and drivers before run it:
dse spark --driver-class-path ojdbc7.jar --jars ojdbc7.jar
scala> val oData = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:hr/hr#//localhost:1521/pdborcl")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
C* data is already mapped to SparkSQL table. So:
scala> cData = spark.sql("select * from keyspace.table");
You will need to check schema of both and data conversions details, to compare that tables properly. Simple integration check: All data form Oracle exist in C*:
scala> cData.except(oData).count
0: Long