Apache Spark distributed sql - apache-spark

I use Spark DataFrameReader to perform sql query from database. For each query performed the SparkSession is required. What I would like to do is: for each of JavaPairRDDs perform map, which would invoke sql query with parameters from this RDD. This means that I need to pass SparkSession in each lambda, which seems to be bad design. What is common approach in such problems?
It could look like:
roots.map(r -> DBLoader.getData(sparkSession, r._1));
How I load data now:
JavaRDD<Row> javaRDD = sparkSession.read().format("jdbc")
.options(options)
.load()
.javaRDD();

The purpose of Big Data is to have data locality and be able to execute your code where your data resides, it is ok to do a big load of a table into memory or local disk (cache/persist), but continuous remote jdbc queries will defeat the purpose.

Related

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Should we create separate dataframe for each table in a join query in SparkSQL

We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.

What role Spark SQL acts? Memory DB?

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

Writing small amount of data to cassandra table in Spark

I need to write some small amount of data to a cassandra table in a spark application. The data is not an RDD, and it is just a double value. How to do this in a Spark application using Java API?
Thanks for any hint.
As a quick solution you can use sc.parallelize and save RDD to Cassandra as usual.
If you need to run a query you can use CassandraConnector pool like in doc:
val connector = CassandraConnector(sc.getConf)
connector.withSessionDo(session => ...)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

Spark SQL performance is very bad

I want to use SPARK SQL. I found the performance is very bad.
In my first solution:
when each SQL query coming,
load the data from hbase entity to a dataRDD,
then register this dataRDD to SQLcontext.
at last execute spark SQL query.
Apparently the solution is very bad because it needs to load data every time.
So I improved the first solution.
In my second solution don't consider hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
Register cachedDataRDD to SQLcontext
When each SQL query coming, execute spark SQL query. The performance is very good.
But some entity needs to consider the updates and inserts.
So I changed the solution base on the second solution.
In my third solution need to consider the hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
When SQL query coming, load the new updates and inserts data to another dataRDD, named newDataRDD.
Then set cachedDataRDD = cachedDataRDD.union(dataRDD);
Register cachedDataRDD to SQLcontext
At last execute spark SQL query.
But I found the union transformation will cause the collect action for getting the query result is very slow. Much slow than the hbase api query.
Is there any way to tune the third solution performance? Usually under what condition to use the spark SQL is better? Is any good use case of using spark SQL? Thanks
Consider creating a new table for newDataRDD and do the UNION on the Spark SQL side. So for example, instead of unioning the RDD, do:
SELECT * FROM data
UNION
SELECT * FROM newData
This should provide more information to the query optimizer and hopefully help make your query faster.

Resources