I'm running the code below and it works fine, but it's supper, super, super slow.
I thought it would load records in batches of 500k at a time, but when I run the code and do record counts in SQL Server after the job kicks off, it's updating about 50 records per second. Hopefully there is an easy fix for this. Thanks!

Check if read method is configured with below params."jdbc") \
.option("partitionColumn", "partition_key") \
.option("lowerBound", "<lb>") \
.option("upperBound", "<ub>") \
.option("numPartitions", "<np>") \
.option("fetchsize", "<fs>")


Optimizing Spark JDBC connection read time by adding query parameter

Connecting sql server to spark using the following package At the moment am reading the entire table however this is bad for performance. To optimize performance I want to pass a query to the following config. for example select * from my table where record time > timestamp. is this possible? how would I do this?
DF = \
.format("") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password).load()
You can just filter the data frame that you are creating. Spark supports predicate pushdown, which means that the filtering will most likely run on top of the database directly. You can make sure that that works by looking at the SparkUI / Explain Plan

Is my use case for GCP Dataproc feasible?

Not sure if there is a place/people to ask for one on one advice for Dataproc setup and tuning. But figure here is as good as place as any to find some help.
Our team has been primarily using BigQuery to do our data analysis on location driven data. We're carrying data back to 2019, so we're carry a lot of data. We've added some clustering (always had date partitioning) to help keep cost down, but its getting to the point where it just not feasible. At the moment we have upwards to 200 TB of data and daily raw data ranges from 3-8 TB (gets reduce quite a bit after a few steps).
First we'd like to move our 200 TB of data to GCS and segment it to more granular level. The schema for this data is:
uid -- STRING
timestamp_of_observation -- TIMESTAMP,
lat -- FLOAT,
lon -- FLOAT,
datasource -- STRING,
cbg (short for census_block_group) -- STRING
We would like to save the data to GCS using hive partitioning so that our bucket folder structure looks like
year > month > day > cbg
Knowing we are processing about 200TB and 3 years of data and cbgs alone have about 200,000 possibilities is this feasible?
We have a few other options using either census block tracts (84,414 subfolders) or counties (35,000), the more granularity for us the better.
My first attempts I either get just a OOM or I get stages just running forever. My initial pyspark code looks like the following:
from pyspark import SparkFiles
from pyspark.sql.functions import year, month, dayofmonth, rand
from pyspark.sql.functions import col, spark_partition_id, asc, desc
# use appropriate version for jar depending on the scala version
spark = SparkSession.builder\
spark.conf.set("spark.sql.shuffle.partitions", 365*100)
df = \
.format("bigquery") \
df1 = df.withColumn("year", year(col("visit_timestamp"))) \
.withColumn("month", month(col("visit_timestamp"))) \
.withColumn("day", dayofmonth(col("visit_timestamp"))) \
.withColumn("cbg", col("boundary_partition")) \
.withColumn('salt', rand())
df1.repartition(365*100,'salt','year','month','day') \
.drop('salt') \
.write.mode("overwrite") \
.format("parquet") \
.partitionBy("year", "month", "day", "cbg") \
This code was given to me but a fellow engineer. He told me to add salt for skewness, to increase my partitions.
Any and all advice would be helpful. The goal here to do one huge batch to migrate our data to GCS and then daily begin to save our raw data transformed to GCS as oppose to Bigquery.
I would envision that the file numbers to be written are 31230*200000 (216000000) which seems like a lot. Is there a better way to organize this, our original purpose was to make this data MUCH cheaper downstream to query. Right now the date partition has been the best way to minimize cost, we have clustering on CBG column but it doesn't seem to drive cost down very much. My thought is that with the GCS hive structure, it would essentially make CBG (or other spatial grouping) as a true partition and now just a cluster.
Lastly I"m not doing much to the cluster configuration, I've played around with number of worker nodes and machines but haven't truly gotten anything to work again any help is appreciated and thank you for looking!
This is the cluster setup CLI code
gcloud dataproc clusters create cluster-f35f --autoscaling-policy location_data --enable-component-gateway --bucket cbg-test-patino --region us-central1 --zone us-central1-f --master-machine-type n1-standard-8 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 30 --worker-machine-type n2-standard-16 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 1000 --image-version 2.0-debian10 --optional-components JUPYTER --project data-*********** --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/ --metadata bigquery-connector-version=1.2.0 --metadata spark-bigquery-connector-version=0.21.0

Performance issues in loading data from Databricks to Azure SQL

I am trying to load 1 million records from Delta table in Databricks to Azure SQL database using the recently released connector by Microsoft supporting Python API and Spark 3.0.
Performance does not really look awesome to me. It takes 19 minutes to load 1 million records. Below is the code which I am using. Do you think I am missing something here?
8 Worker nodes with 28GB memory and 8 cores.
Azure SQL database is a 4 vcore Gen5 .
df.write \
.format("") \
.mode("overwrite") \
.option("url", url) \
.option("dbtable", "lending_club_acc_loans") \
.option("user", username) \
.option("password", password) \
.option("tableLock", "true") \
.option("batchsize", "200000") \
.option("reliabilityLevel", "BEST_EFFORT") \
except ValueError as error :
print("Connector write failed", error)
Is there something I can do to boost the performance?
Repartition the data frame. Earlier I had single partition on my source data frame which upon re-partition to 8 helped improve the performance.

Parallel execution of read and write API calls in PySpark SQL

I need to load the incremental records from a set of tables in MySQL to Amazon S3 in Parquet format. These tables are common across several databases/schemas in the AWS MySQL managed instance. The code should copy data from each of the schemas (which has a set of common tables) in parallel.
I'm using read API PySpark SQL to connect to MySQL instance and read data of each table for a schema and am writing the result dataframe to S3 using write API as a Parquet file. I'm running this in a loop for each table in a database as shown in the code below:
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df ="jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
I would like to know how I can scale this code to multiple databases running in parallel in an EMR cluster. Please suggest me a suitable approach. Let me know if any more details required.
I can propose two solutions:
1. Easy way
Submit multiple jobs to your EMR at once(one job per DB). If monitoring is the problem, just have the logs for failed ones only written to S3 or HDFS.
2. Bit of code change required
You could try using threading to parallelize the data pulls from each DB. I can show a sample for how to do it, but you might need to do more changes to suit your use case.
Sample implementaion:
import threading
def load_data_to_s3(databases_df):
db_query_properties = config['mysql-query']
auto_id_values = config['mysql-auto-id-values']
for row in databases_df.collect():
for table in db_query_properties.keys():
last_recorded_id_value = auto_id_values[table]
select_sql = "select * from {}.{} where id>{}".format(row.database_name, table, last_recorded_id_value)
df ="jdbc") \
.option("driver", mysql_db_properties['driver']) \
.option("url", row.database_connection_url) \
.option("dbtable", select_sql) \
.option("user", username) \
.option("password", password) \
s3_path = 's3a://{}/{}/{}'.format(s3_bucket, database_dir, table)
df.write.parquet(s3_path, mode="append")
threads = [threading.Thread(target=load_data_to_s3, args=(db) for db in databases_df]
for t in threads:
for t in threads:
Also, please make sure to change the scheduler to FAIR using the set('spark.scheduler.mode', 'FAIR') property. This will create a thread for each of your DBs. If you want to control the number of threads running parallelly, modify the for loop accordingly.
Additionally, if you want to create new jobs from within the program, pass your SparkSession along with the arguments.
Your list_of_databases is not parallelized. To do the parallel processing, you should parallelize the list and do the parallel job by using foreach or something that is given by spark.
Turn on the concurrent option in EMR and send EMR step for each table, or you can use the fair scheduler of the Spark which can internally proceed the job in parallel with a small modification of your code.

Batch write from to Kafka does not observe checkpoints and writes duplicates

Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.
So should checkpoints work in batch-writing mode at all? Or I am missing something?
EH_SASL = ' required username="$ConnectionString" password="Endpoint=sb://;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :
Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.
