Spark DataFrames: registerTempTable vs not - apache-spark

I just started with DataFrame yesterday and am really liking it so far.
I dont understand one thing though...
(Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)
In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").
So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?

The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val customerDataFrame = myCodeToCreateOrLoadDataFrame()
customerDataFrame.registerTempTable( "cust" )
val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId"""
val salesPerCustomer: DataFrame = hc.sql( query )
salesPerCustomer.show()
Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.
In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.
If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.

It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:
df.createOrReplaceTempView("myTempView")
Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.
%sql
SELECT * FROM myTempView

Related

How to execute CQL query using pyspark

I want to execute Cassandra CQL query using PySpark.But I am not finding the way to execute it.I can load whole table to dataframe and create Tempview and query it.
df = spark.read.format("org.apache.spark.sql.cassandra").
options(table="country_production2",keyspace="country").load()
df.createOrReplaceTempView("Test")
Please suggest any better way to so that I can execute CQL query in PySpark.
Spark SQL doesn't support Cassandra's cql dialects directly. It only allows you to load the table as a Dataframe and operate on it.
If you are concerned about reading a whole table to query it, then you may use the filters as given below to let Spark push the predicates the load only the data you need.
from pyspark.sql.functions import *
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()\
.filter(col("id")=="A")
df.createOrReplaceTempView("Test")
In pyspark you're using SQL, not CQL. If the SQL query somehow matches the CQL, i.e., you're querying by partition or primary key, then Spark Cassandra Connector (SCC) will transform query into that CQL, and execute (so-called predicates pushdown). If it doesn't match, then Spark will load all data via SCC, and perform filtering on the Spark level.
So after you're registered temporary view, you can do:
val result = spark.sql("select ... from Test where ...")
and work with results in result variable. To check if predicates pushdown happens, execute result.explain(), and check for the * marker in the conditions in the PushedFilters section.

Should we create separate dataframe for each table in a join query in SparkSQL

We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.

How to list partition-pruned inputs for Hive tables?

I am using Spark SQL to query data in Hive. The data is partitioned and Spark SQL correctly prunes the partitions when querying.
However, I need to list either the source tables along with partition filters or the specific input files (.inputFiles would be an obvious choice for this but it does not reflect pruning) for a given query in order to determine on which part of the data the computation will be taking place.
The closest I was able to get was by calling df.queryExecution.executedPlan.collectLeaves(). This contains the relevant plan nodes as HiveTableScanExec instances. However, this class is private[hive] for the org.apache.spark.sql.hive package. I think the relevant fields are relation and partitionPruningPred.
Is there any way to achieve this?
Update: I was able to get the relevant information thanks to Jacek's suggestion and by using getHiveQlPartitions on the returned relation and providing partitionPruningPred as the parameter:
scan.findHiveTables(execPlan).flatMap(e => e.relation.getHiveQlPartitions(e.partitionPruningPred))
This contained all the data I needed, including the paths to all input files, properly partition pruned.
Well, you're asking for low-level details of the query execution and things are bumpy down there. You've been warned :)
As you noted in your comment, all the execution information are in this private[hive] HiveTableScanExec.
One way to get some insight into HiveTableScanExec physical operator (that is a Hive table at execution time) is to create a sort of backdoor in org.apache.spark.sql.hive package that is not private[hive].
package org.apache.spark.sql.hive
import org.apache.spark.sql.hive.execution.HiveTableScanExec
object scan {
def findHiveTables(execPlan: org.apache.spark.sql.execution.SparkPlan) = execPlan.collect { case hiveTables: HiveTableScanExec => hiveTables }
}
Change the code to meet your needs.
With the scan.findHiveTables, I usually use :paste -raw while in spark-shell to sneak into such "uncharted areas".
You could then simply do the following:
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
// Create a Hive table
import org.apache.spark.sql.types.StructType
spark.catalog.createTable(
tableName = "h1",
source = "hive", // <-- that makes for a Hive table
schema = new StructType().add($"id".long),
options = Map.empty[String, String])
// select * from h1
val q = spark.table("h1")
val execPlan = q.queryExecution.executedPlan
scala> println(execPlan.numberedTreeString)
00 HiveTableScan [id#22L], HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#22L]
// Use the above code and :paste -raw in spark-shell
import org.apache.spark.sql.hive.scan
scala> scan.findHiveTables(execPlan).size
res11: Int = 1
relation field is the Hive table after it's been resolved using ResolveRelations and FindDataSourceTable logical rule that Spark analyzer uses to resolve data source and hive tables.
You can get pretty much all the information Spark uses from a Hive metastore using ExternalCatalog interface that is available as spark.sharedState.externalCatalog. That gives you pretty much all the metadata Spark uses to plan queries over Hive tables.

How to translate HiveQL query to corresponding DataFrame operation?

In my usecase, I was using hivecontext inside myRDD.map() function. I got error that java.lang nullpointerexception. I realized, it is not possible to use hiveContext inside the map logic. The hivecontext was used to fire a hiveql query to another table (conf). hiveContext.sql(). The query is like this
select config_date, filefilter, family, id from mydb.conf where
id == 178 and config_date < cast("2011-02-04 13:05:41.0" as
timestamp) and family == "drf" order by config_date desc limit 1
I have decided to create a dataframe of this table before the start of the map process in the driver code. And perform dataframe operations inside the map logic. Basically, want to do method calls over dataframe instead of using hivecontext to query.
Is it possible? Can someone help me out here how to replicate this query over dataframe?
Yes, translating your Hive query to Dataset is perfectly possible.
You can just spark.sql(yourQueryHere) or rewrite the query to use Spark SQL's Dataset API.
Just load your Hie table using spark.read.table("mydb.conf") and do the filtering and ordering.
val conf = spark.
read.
table("mydb.conf").
select("config_date", "filefilter", "family", "id").
... // you know the rest
You can then join this Dataset with the other and apply joined transformation that will will avoid using hiveContext inside map.

Spark SQL performance is very bad

I want to use SPARK SQL. I found the performance is very bad.
In my first solution:
when each SQL query coming,
load the data from hbase entity to a dataRDD,
then register this dataRDD to SQLcontext.
at last execute spark SQL query.
Apparently the solution is very bad because it needs to load data every time.
So I improved the first solution.
In my second solution don't consider hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
Register cachedDataRDD to SQLcontext
When each SQL query coming, execute spark SQL query. The performance is very good.
But some entity needs to consider the updates and inserts.
So I changed the solution base on the second solution.
In my third solution need to consider the hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
When SQL query coming, load the new updates and inserts data to another dataRDD, named newDataRDD.
Then set cachedDataRDD = cachedDataRDD.union(dataRDD);
Register cachedDataRDD to SQLcontext
At last execute spark SQL query.
But I found the union transformation will cause the collect action for getting the query result is very slow. Much slow than the hbase api query.
Is there any way to tune the third solution performance? Usually under what condition to use the spark SQL is better? Is any good use case of using spark SQL? Thanks
Consider creating a new table for newDataRDD and do the UNION on the Spark SQL side. So for example, instead of unioning the RDD, do:
SELECT * FROM data
UNION
SELECT * FROM newData
This should provide more information to the query optimizer and hopefully help make your query faster.

Resources