SparkSQL queries execution is slower than my database - apache-spark

Greeting,
I have created a Spark 2.1.1 cluster in Amazon EC2 with instance type m4.large of 1 master and 5 slaves to start. My PostgreSQL 9.5 database (t2.large) has a table of over 2 billions rows and 7 column that I would like to process. I have followed the direction from Apache Spark website and other various sources on how to connect and process these data.
My problem is that Spark SQL performance is way slower than my database. My sql statement (see below in the code) takes about 21mins in PSQL, but Spark SQL take about 42 min to finish. My main goal is to measure the performance of PSQL vs Spark SQL and so far I am not getting the desire results. I would appreciate the help.
Thank you
I have tried increasing fetchSize from 10000 to 100000, caching the dataframe, increase numpartition to 100, set spark.sql.shuffle to 2000, double my cluster size, and use larger instance type and so far I have not seen any improvements.
val spark = SparkSession.builder()
.appName("Spark SQL")
.getOrCreate();
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.load()
.createOrReplaceTempView("ghcn_all");
val sqlStatement = "SELECT ghcn_date, element_value/10.0
FROM ghcn_all
WHERE station_id = 'USW00094846'
AND (ghcn_date >= '2015-01-01' AND ghcn_date <= '2015-12-31')
AND qflag IS NULL
AND element_type = 'PRCP'
ORDER BY ghcn_date";
val sqlDF = spark.sql(sqlStatement);
var start:Long = System.nanoTime;
val num_rows:Long = sqlDF.count();
var end:Long = System.nanoTime;
println("Total Row : " + num_rows);
println("Total Collect Time Lapse : " + ((end - start) / 1000000) + " ms");

There is no good reason for this code to ever run faster on Spark, than database alone. First of all it is not even distributed, as you made the same mistake as many before you and don't partition the data.
But it more important is that you actually load data from the database - as a result it has to do at least as much work (and in practice more), then send data over the network, then data has to parsed by Spark, and processed. You basically do way more work and expect things to be faster - that's not going to happen.
If you want to reliably improve performance on Spark you should at least:
Extract data from the database.
Write to efficient (like not S3) distributed storage.
Use proper bucketing and partitioning to enable partition pruning and predicate pushdown.
Then you might have a better lack. But again, proper indexing of your data on the cluster, should improve performance as well, likely at a lower overall cost.

It is very important to set partitionColumn when your read from SQL. It use for parallel query. So you should decide which column is your partitionColumn.
In your case for example:
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.option("partitionColumn", "ghcn_date")
.option("lowerBound", "2015-01-01")
.option("upperBound", "2015-12-31")
.option("numPartitions",16 )
.load()
.createOrReplaceTempView("ghcn_all");
More Reference:
How Apache Spark Makes Your Slow MySQL Queries 10x Faster (or More)
Tips for using JDBC in Apache Spark SQL

Related

Slow performance writing from pyspark dataframe to Azure Synapse pool

I am writing data from a Spark dataframe in an Azure Databricks notebook into a dedicated Synapse pool. The problem is this takes an extremely long time given the small size of the data involved.
Read performance is fine, this syntax will happily read 100,000 rows in a couple of seconds. However takes about 25 minutes to write a similar number of rows (of 3 columns).
Are there any options I should be adding to improve write performance? Or is there a faster way of completing the same task?
(df.write
.format("jdbc")
.option("url", f"jdbc:sqlserver://<blahblah>.sql.azuresynapse.net:1433;database=<blahblah>;user=<blah>#<blahblah>;password={password};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;")
.option("dbtable", "dbo.newtablename")
.option("user", <username>)
.option("password", <password>)
.option("createTableColumnTypes", "column1 VARCHAR(36), column2 VARCHAR(1), column3 VARCHAR(1)")
.save()
)
I added the createTableColumnTypes option as the default column type did not permit a columnstore index to be created.
Other options I have reviwed in the documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) do not seem relevant.

Does the Spark Shell JDBC read numPartitions value depend on the number of executors?

I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load(). When I tried to measure the time taken to read a table for different numPartitions values by calling a df.rdd.count, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2 and SPARK_WORKER_CORES=1in my spark_env.sh file.
I have 2 questions:
Do the numPartitions actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?
Thanks!
Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.
In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:
spark.read("jdbc")
.option("url", url)
.option("dbtable", "table")
.option("user", user)
.option("password", password)
.option("numPartitions", numPartitions)
.option("partitionColumn", "<partition_column>")
.option("lowerBound", 1)
.option("upperBound", 10000)
.load()
That will parallel the queries from the databases to 10,000/numPartitions results of each query.
About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).
Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.

Spark Cluster configuration

I'm using a spark cluster with of two nodes each having two executors(each using 2 cores and 6GB memory).
Is this a good cluster configuration for a faster execution of my spark jobs?
I am kind of new to spark and I am running a job on 80 million rows of data which includes shuffling heavy tasks like aggregate(count) and join operations(self join on a dataframe).
Bottlenecks:
Showing Insufficient resources for my executors while reading the data.
On a smaller dataset, it's taking a lot of time.
What should be my approach and how can I do away with my bottlenecks?
Any suggestion would be highly appreciable.
query= "(Select x,y,z from table) as df"
jdbcDF = spark.read.format("jdbc").option("url", mysqlUrl) \
.option("dbtable", query) \
.option("user", mysqldetails[2]) \
.option("password", mysqldetails[3]) \
.option("numPartitions", "1000")\
.load()
This gives me a dataframe which on jdbcDF.rdd.getNumPartitions() gives me value of 1. Am I missing something here?. I think I am not parallelizing my dataset.
There are different ways to improve the performance of your application. PFB some of the points which may help.
Try to reduce the number of records and columns for processing. As you have mentioned you are new to spark and you might not need all 80 million rows, so you can filter the rows to whatever you require. Also, select the columns which is required but not all.
If you are using some data frequently then try considering caching the data, so that for the next operation it will be read from the memory.
If you are joining two DataFrames and if one of them is small enough to fit in memory then you can consider broadcast join.
Increasing the resources might not improve the performance of your application in all cases, but looking at your configuration of the cluster, it should help. It might be good idea to throw some more resources and check the performance.
You can also try using Spark UI to monitor your application and see if there are few task which are taking long time than others. Then probably you need to deal with skewness of your data.
You can try considering to Partition your data based on the columns which you are using in your filter criteria.

Incremental Data loading and Querying in Pyspark without restarting Spark JOB

Hi All I want to do incremental data query.
df = spark .read.csv('csvFile', header=True) #1000 Rows
df.persist() #Assume it takes 5 min
df.registerTempTable('data_table') #or createOrReplaceTempView
result = spark.sql('select * from data_table where column1 > 10') #100 rows
df_incremental = spark.read.csv('incremental.csv') #200 Rows
df_combined = df.unionAll(df_incremental)
df_combined.persist() #It will take morethan 5 mins, I want to avoid this, because other queries might be running at this time
df_combined.registerTempTable("data_table")
result = spark.sql('select * from data_table where column1 > 10') # 105 Rows.
read a csv/mysql Table data into spark dataframe.
Persist that dataframe in memory Only(reason: I need performance & My dataset can fit to memory)
Register as temp table and run spark sql queries. #Till this my spark job is UP and RUNNING.
Next day i will receive a incremental Dataset(in a temp_mysql_table or a csv file). Now I want to run same query on a Total set i:e persisted_prevData + recent_read_IncrementalData. i will call it mixedDataset.
*** there is no certainty that when incremental data comes to system, it can come 30 times a day.
Till here also I don't want the spark-Application to be down,. It should always be Up. And I need performance of querying mixedDataset with same time measure as if it is persisted.
My Concerns :
In P4, Do i need to unpersist the prev_data and again persist the union-Dataframe of prev&Incremantal data?
And my most important concern is i don't want to restart the Spark-JOB to load/start with Updated Data(Only if server went down, i have to restart of course).
So, on a high level, i need to query (faster performance) dataset + Incremnatal_data_if_any dynamically.
Currently i am doing this exercise by creating a folder for all the data, and incremental file also placed in the same directory. Every 2-3 hrs, i am restarting the server and my sparkApp starts with reading all the csv files present in that system. Then queries running on them.
And trying to explore hive persistentTable and Spark Streaming, will update here if found any result.
Please suggest me a way/architecture to achieve this.
Please comment, if anything is not clear on Question, without downvoting it :)
Thanks.
Try streaming instead it will be much faster since the session is already running and it will be triggered everytime you place something in the folder:
df_incremental = spark \
.readStream \
.option("sep", ",") \
.schema(input_schema) \
.csv(input_path)
df_incremental.where("column1 > 10") \
.writeStream \
.queryName("data_table") \
.format("memory") \
.start()
spark.sql("SELECT * FROM data_table).show()

How to control worker transactions with jdbc data source?

when use spark delete(or update) and insert , Either all sucess ,Either all fail.
And I think spark application is distributed across many JVM, how can control the every worker transaction synchronize?
// DELETE: BEGIN
Class.forName("com.oracle.jdbc.Driver");
conn = DriverManager.getConnection(DB_URL, USER, PASS);
String query = "delete from users where id = ?";
PreparedStatement preparedStmt = conn.prepareStatement(query);
preparedStmt.setInt(1, 3);
preparedStmt.execute();
// DELETE: END
val jdbcDF = spark
.read
.jdbc("DB_URL", "schema.tablename", connectionProperties)
.write
.format("jdbc")
.option("url", "DB_URL")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()
tl;dr You can't.
Spark is a fast and general engine for large-scale data processing (i.e. a multi-threaded distributed computing platform) and the main selling point is that you may and will surely execute multiple simultaneously running tasks to process your massive datasets faster (and perhaps even cheaper).
JDBC is not very suitable data source for Spark as you are limited by the capacity of your JDBC database. That's why many people are migrating from JDBC databases to HDFS or Cassandra or similar data storage where thousands of connections is not much of an issue (not to mention other benefits like partitioning your datasets before Spark will touch the data).
You can control JDBC using some configuration parameters (e.g. partitionColumn, lowerBound, upperBound, numPartitions, fetchsize, batchsize or isolationLevel) that give you some flexibility, but wishing to "transaction synchronize" is outside the scope of Spark.
Use JDBC directly instead (just like you did for DELETE).
Note that the code between DELETE: BEGIN and DELETE: END are executed on the driver (on a single thread).

Resources