I am trying to understand spark hiveContext.
when we write query using hiveContext like
sqlContext=new HiveContext(sc)
sqlContext.sql("select * from TableA inner join TableB on ( a=b) ")
Is it using Spark Engine OR Hive Engine?? I believe above query get executed with Spark Engine. But if thats the case why we need dataframes?
We can blindly copy all hive queries in sqlContext.sql("") and run without using dataframes.
By DataFrames, I mean like this TableA.join(TableB, a === b)
We can even perform aggregation using SQL commands. Could any one Please clarify the concept? If there is any advantage of using dataframe joins rather that sqlContext.sql() join?
join is just an example. :)
The Spark HiveContext uses Spark execution engine underneath see the spark code.
Parser support in spark is pluggable, HiveContext uses spark's HiveQuery parser.
Functionally you can do everything with sql and Dataframes are not needed. But dataframes provided a convenient way to achieve the same results. The user doesn't need to write a SQL statement.
Related
I want to execute Cassandra CQL query using PySpark.But I am not finding the way to execute it.I can load whole table to dataframe and create Tempview and query it.
df = spark.read.format("org.apache.spark.sql.cassandra").
options(table="country_production2",keyspace="country").load()
df.createOrReplaceTempView("Test")
Please suggest any better way to so that I can execute CQL query in PySpark.
Spark SQL doesn't support Cassandra's cql dialects directly. It only allows you to load the table as a Dataframe and operate on it.
If you are concerned about reading a whole table to query it, then you may use the filters as given below to let Spark push the predicates the load only the data you need.
from pyspark.sql.functions import *
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()\
.filter(col("id")=="A")
df.createOrReplaceTempView("Test")
In pyspark you're using SQL, not CQL. If the SQL query somehow matches the CQL, i.e., you're querying by partition or primary key, then Spark Cassandra Connector (SCC) will transform query into that CQL, and execute (so-called predicates pushdown). If it doesn't match, then Spark will load all data via SCC, and perform filtering on the Spark level.
So after you're registered temporary view, you can do:
val result = spark.sql("select ... from Test where ...")
and work with results in result variable. To check if predicates pushdown happens, execute result.explain(), and check for the * marker in the conditions in the PushedFilters section.
I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.
I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem
We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.
In my usecase, I was using hivecontext inside myRDD.map() function. I got error that java.lang nullpointerexception. I realized, it is not possible to use hiveContext inside the map logic. The hivecontext was used to fire a hiveql query to another table (conf). hiveContext.sql(). The query is like this
select config_date, filefilter, family, id from mydb.conf where
id == 178 and config_date < cast("2011-02-04 13:05:41.0" as
timestamp) and family == "drf" order by config_date desc limit 1
I have decided to create a dataframe of this table before the start of the map process in the driver code. And perform dataframe operations inside the map logic. Basically, want to do method calls over dataframe instead of using hivecontext to query.
Is it possible? Can someone help me out here how to replicate this query over dataframe?
Yes, translating your Hive query to Dataset is perfectly possible.
You can just spark.sql(yourQueryHere) or rewrite the query to use Spark SQL's Dataset API.
Just load your Hie table using spark.read.table("mydb.conf") and do the filtering and ordering.
val conf = spark.
read.
table("mydb.conf").
select("config_date", "filefilter", "family", "id").
... // you know the rest
You can then join this Dataset with the other and apply joined transformation that will will avoid using hiveContext inside map.