Slow performance writing from pyspark dataframe to Azure Synapse pool - apache-spark

I am writing data from a Spark dataframe in an Azure Databricks notebook into a dedicated Synapse pool. The problem is this takes an extremely long time given the small size of the data involved.
Read performance is fine, this syntax will happily read 100,000 rows in a couple of seconds. However takes about 25 minutes to write a similar number of rows (of 3 columns).
Are there any options I should be adding to improve write performance? Or is there a faster way of completing the same task?
(df.write
.format("jdbc")
.option("url", f"jdbc:sqlserver://<blahblah>.sql.azuresynapse.net:1433;database=<blahblah>;user=<blah>#<blahblah>;password={password};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;")
.option("dbtable", "dbo.newtablename")
.option("user", <username>)
.option("password", <password>)
.option("createTableColumnTypes", "column1 VARCHAR(36), column2 VARCHAR(1), column3 VARCHAR(1)")
.save()
)
I added the createTableColumnTypes option as the default column type did not permit a columnstore index to be created.
Other options I have reviwed in the documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) do not seem relevant.

Related

Is my use case for GCP Dataproc feasible?

Not sure if there is a place/people to ask for one on one advice for Dataproc setup and tuning. But figure here is as good as place as any to find some help.
Our team has been primarily using BigQuery to do our data analysis on location driven data. We're carrying data back to 2019, so we're carry a lot of data. We've added some clustering (always had date partitioning) to help keep cost down, but its getting to the point where it just not feasible. At the moment we have upwards to 200 TB of data and daily raw data ranges from 3-8 TB (gets reduce quite a bit after a few steps).
First we'd like to move our 200 TB of data to GCS and segment it to more granular level. The schema for this data is:
uid -- STRING
timestamp_of_observation -- TIMESTAMP,
lat -- FLOAT,
lon -- FLOAT,
datasource -- STRING,
cbg (short for census_block_group) -- STRING
We would like to save the data to GCS using hive partitioning so that our bucket folder structure looks like
year > month > day > cbg
Knowing we are processing about 200TB and 3 years of data and cbgs alone have about 200,000 possibilities is this feasible?
We have a few other options using either census block tracts (84,414 subfolders) or counties (35,000), the more granularity for us the better.
My first attempts I either get just a OOM or I get stages just running forever. My initial pyspark code looks like the following:
from pyspark import SparkFiles
from pyspark.sql.functions import year, month, dayofmonth, rand
from pyspark.sql.functions import col, spark_partition_id, asc, desc
# use appropriate version for jar depending on the scala version
spark = SparkSession.builder\
.appName('BigNumeric')\
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 365*100)
df = spark.read \
.format("bigquery") \
.load("data-location-338617.Raw_Location_Data.data_cbg_enrich_proto")
df1 = df.withColumn("year", year(col("visit_timestamp"))) \
.withColumn("month", month(col("visit_timestamp"))) \
.withColumn("day", dayofmonth(col("visit_timestamp"))) \
.withColumn("cbg", col("boundary_partition")) \
.withColumn('salt', rand())
df1.repartition(365*100,'salt','year','month','day') \
.drop('salt') \
.write.mode("overwrite") \
.format("parquet") \
.partitionBy("year", "month", "day", "cbg") \
.save("gs://initial_test/cbg_data/")
This code was given to me but a fellow engineer. He told me to add salt for skewness, to increase my partitions.
Any and all advice would be helpful. The goal here to do one huge batch to migrate our data to GCS and then daily begin to save our raw data transformed to GCS as oppose to Bigquery.
I would envision that the file numbers to be written are 31230*200000 (216000000) which seems like a lot. Is there a better way to organize this, our original purpose was to make this data MUCH cheaper downstream to query. Right now the date partition has been the best way to minimize cost, we have clustering on CBG column but it doesn't seem to drive cost down very much. My thought is that with the GCS hive structure, it would essentially make CBG (or other spatial grouping) as a true partition and now just a cluster.
Lastly I"m not doing much to the cluster configuration, I've played around with number of worker nodes and machines but haven't truly gotten anything to work again any help is appreciated and thank you for looking!
This is the cluster setup CLI code
gcloud dataproc clusters create cluster-f35f --autoscaling-policy location_data --enable-component-gateway --bucket cbg-test-patino --region us-central1 --zone us-central1-f --master-machine-type n1-standard-8 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 30 --worker-machine-type n2-standard-16 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 1000 --image-version 2.0-debian10 --optional-components JUPYTER --project data-*********** --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh --metadata bigquery-connector-version=1.2.0 --metadata spark-bigquery-connector-version=0.21.0

Write data to specific partitions in Azure Dedicated SQL pool

At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.

SparkSQL queries execution is slower than my database

Greeting,
I have created a Spark 2.1.1 cluster in Amazon EC2 with instance type m4.large of 1 master and 5 slaves to start. My PostgreSQL 9.5 database (t2.large) has a table of over 2 billions rows and 7 column that I would like to process. I have followed the direction from Apache Spark website and other various sources on how to connect and process these data.
My problem is that Spark SQL performance is way slower than my database. My sql statement (see below in the code) takes about 21mins in PSQL, but Spark SQL take about 42 min to finish. My main goal is to measure the performance of PSQL vs Spark SQL and so far I am not getting the desire results. I would appreciate the help.
Thank you
I have tried increasing fetchSize from 10000 to 100000, caching the dataframe, increase numpartition to 100, set spark.sql.shuffle to 2000, double my cluster size, and use larger instance type and so far I have not seen any improvements.
val spark = SparkSession.builder()
.appName("Spark SQL")
.getOrCreate();
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.load()
.createOrReplaceTempView("ghcn_all");
val sqlStatement = "SELECT ghcn_date, element_value/10.0
FROM ghcn_all
WHERE station_id = 'USW00094846'
AND (ghcn_date >= '2015-01-01' AND ghcn_date <= '2015-12-31')
AND qflag IS NULL
AND element_type = 'PRCP'
ORDER BY ghcn_date";
val sqlDF = spark.sql(sqlStatement);
var start:Long = System.nanoTime;
val num_rows:Long = sqlDF.count();
var end:Long = System.nanoTime;
println("Total Row : " + num_rows);
println("Total Collect Time Lapse : " + ((end - start) / 1000000) + " ms");
There is no good reason for this code to ever run faster on Spark, than database alone. First of all it is not even distributed, as you made the same mistake as many before you and don't partition the data.
But it more important is that you actually load data from the database - as a result it has to do at least as much work (and in practice more), then send data over the network, then data has to parsed by Spark, and processed. You basically do way more work and expect things to be faster - that's not going to happen.
If you want to reliably improve performance on Spark you should at least:
Extract data from the database.
Write to efficient (like not S3) distributed storage.
Use proper bucketing and partitioning to enable partition pruning and predicate pushdown.
Then you might have a better lack. But again, proper indexing of your data on the cluster, should improve performance as well, likely at a lower overall cost.
It is very important to set partitionColumn when your read from SQL. It use for parallel query. So you should decide which column is your partitionColumn.
In your case for example:
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.option("partitionColumn", "ghcn_date")
.option("lowerBound", "2015-01-01")
.option("upperBound", "2015-12-31")
.option("numPartitions",16 )
.load()
.createOrReplaceTempView("ghcn_all");
More Reference:
How Apache Spark Makes Your Slow MySQL Queries 10x Faster (or More)
Tips for using JDBC in Apache Spark SQL

Spark Cluster configuration

I'm using a spark cluster with of two nodes each having two executors(each using 2 cores and 6GB memory).
Is this a good cluster configuration for a faster execution of my spark jobs?
I am kind of new to spark and I am running a job on 80 million rows of data which includes shuffling heavy tasks like aggregate(count) and join operations(self join on a dataframe).
Bottlenecks:
Showing Insufficient resources for my executors while reading the data.
On a smaller dataset, it's taking a lot of time.
What should be my approach and how can I do away with my bottlenecks?
Any suggestion would be highly appreciable.
query= "(Select x,y,z from table) as df"
jdbcDF = spark.read.format("jdbc").option("url", mysqlUrl) \
.option("dbtable", query) \
.option("user", mysqldetails[2]) \
.option("password", mysqldetails[3]) \
.option("numPartitions", "1000")\
.load()
This gives me a dataframe which on jdbcDF.rdd.getNumPartitions() gives me value of 1. Am I missing something here?. I think I am not parallelizing my dataset.
There are different ways to improve the performance of your application. PFB some of the points which may help.
Try to reduce the number of records and columns for processing. As you have mentioned you are new to spark and you might not need all 80 million rows, so you can filter the rows to whatever you require. Also, select the columns which is required but not all.
If you are using some data frequently then try considering caching the data, so that for the next operation it will be read from the memory.
If you are joining two DataFrames and if one of them is small enough to fit in memory then you can consider broadcast join.
Increasing the resources might not improve the performance of your application in all cases, but looking at your configuration of the cluster, it should help. It might be good idea to throw some more resources and check the performance.
You can also try using Spark UI to monitor your application and see if there are few task which are taking long time than others. Then probably you need to deal with skewness of your data.
You can try considering to Partition your data based on the columns which you are using in your filter criteria.

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

Hi All I want to do incremental data query.
df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
read a csv/mysql Table data into spark dataframe.
Persist that dataframe in memory Only(reason: I need performance & My dataset can fit to memory)
Register as temp table and run spark sql queries. #Till this my spark job is UP and RUNNING.
Next day i will receive a incremental Dataset(in a temp_mysql_table or a csv file). Now I want to run same query on a Total set i:e persisted_prevData + recent_read_IncrementalData. i will call it mixedDataset.
*** there is no certainty that when incremental data comes to system, it can come 30 times a day.
Till here also I don't want the spark-Application to be down,. It should always be Up. And I need performance of querying mixedDataset with same time measure as if it is persisted.
My Concerns :
In P4, Do i need to unpersist the prev_data and again persist the union-Dataframe of prev&Incremantal data?
And my most important concern is i don't want to restart the Spark-JOB to load/start with Updated Data(Only if server went down, i have to restart of course).
So, on a high level, i need to query (faster performance) dataset + Incremnatal_data_if_any dynamically.
Currently i am doing this exercise by creating a folder for all the data, and incremental file also placed in the same directory. Every 2-3 hrs, i am restarting the server and my sparkApp starts with reading all the csv files present in that system. Then queries running on them.
And trying to explore hive persistentTable and Spark Streaming, will update here if found any result.
Please suggest me a way/architecture to achieve this.
Please comment, if anything is not clear on Question, without downvoting it :)
Thanks.
Try streaming instead it will be much faster since the session is already running and it will be triggered everytime you place something in the folder:
df_incremental = spark \
.readStream \
.option("sep", ",") \
.schema(input_schema) \
.csv(input_path)
df_incremental.where("column1 > 10") \
.writeStream \
.queryName("data_table") \
.format("memory") \
.start()
spark.sql("SELECT * FROM data_table).show()

Resources