Spark SQL performance is very bad - apache-spark

I want to use SPARK SQL. I found the performance is very bad.
In my first solution:
when each SQL query coming,
load the data from hbase entity to a dataRDD,
then register this dataRDD to SQLcontext.
at last execute spark SQL query.
Apparently the solution is very bad because it needs to load data every time.
So I improved the first solution.
In my second solution don't consider hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
Register cachedDataRDD to SQLcontext
When each SQL query coming, execute spark SQL query. The performance is very good.
But some entity needs to consider the updates and inserts.
So I changed the solution base on the second solution.
In my third solution need to consider the hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
When SQL query coming, load the new updates and inserts data to another dataRDD, named newDataRDD.
Then set cachedDataRDD = cachedDataRDD.union(dataRDD);
Register cachedDataRDD to SQLcontext
At last execute spark SQL query.
But I found the union transformation will cause the collect action for getting the query result is very slow. Much slow than the hbase api query.
Is there any way to tune the third solution performance? Usually under what condition to use the spark SQL is better? Is any good use case of using spark SQL? Thanks

Consider creating a new table for newDataRDD and do the UNION on the Spark SQL side. So for example, instead of unioning the RDD, do:
SELECT * FROM data
UNION
SELECT * FROM newData
This should provide more information to the query optimizer and hopefully help make your query faster.

Related

is there any performance hit when calling createOrReplaceTempView on a Spark Dataset?

In my code we use a lot createOrReplaceTempView so that we can invoke SQL on the generated view. This is done on multiple stages of the transformation. It also help us to keep the code in modules each performing a particular operation. A sample code below to put in context my question is shown below. So my questions are:
What is the performance penalty if any by creating the temp view
from the Dataset?
When I create more than one from each transformation does this
increases memory size?
What is the life cycle of those views and is there any function call
to remove them?
val dfOne = spark.read.option("header",true).csv("/apps/cortex/landing/auth/cof_auth.csv")
dfOne.createOrReplaceTempView("dfOne")
val dfTwo = spark.sql("select * from dfOne where column_one=1234567890")
dfTwo.createOrReplaceTempView("dfTwo")
val dfThree = spark.sql("select column_two, count(*) as count_two from dfTwo")
dfTree.createOrReplaceTempView("dfThree")
No.
From the manuals on
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.
createOrReplaceTempView takes some time in processing.
udf also takes time as it should be register in spark application and can cause performance issue.
From my experience, its better to use and look for bulitin function --> then I would put other two in same speed udf == sql_temp_table/view. Here, table/view has not all possiblities.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

How can Hive on Spark read data from jdbc?

We are using Hive on Spark, and we want to do everything on hive, and using spark to calculate. That mean's we don't need to write map/reduce code but sql-like code.
And now we got a problem here, we want to read datasource like postgresql, and control it by simple sql code. And we want it run on cluster.
I've got a idea, I can write some Hive udfs to connect to a jdbc and make a table like data, but I've found it doesn't run on spark job, then it will be useless.
What we want is typing in hive like that :
hive>select myfunc('jdbc:***://***','root','pw','some sql here');
Then I can get a table in hive, and let it join others. In the other way, no matter what engine hive use, we want to read other datasource in hive.
I don't know what to do now, maybe some one can give me some advice.
It's there any way to do like this:
hive> select * from hive_table where hive_table.id in
(select myfunc('jdbcUrl','user','pw','sql'));
I know that hive is used to compile the sql to MapReduce job, what I want to know is how to do to make my sql/udf compile to MapReduce job as spark.read().jdbc(...)
I think it’s easier to load the data from db into dataframe, then you could dump it to hive if necessary.
Read this: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases
See the property name dbtable, you could load a part of a table defined in sql query.

Should we create separate dataframe for each table in a join query in SparkSQL

We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.

What role Spark SQL acts? Memory DB?

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

Resources