What are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use HiveContext which has more features than SQLContext. But the current APIs of each contexts are mostly same.
What are the scenarios which SQLContext/HiveContext is more useful ?.
Is HiveContext more useful only when working with Hive ?.
Or does the SQLContext is all that needs in implementing a Big Data app using Apache Spark ?
Spark 2.0+
Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext (SparkSession with Hive support) seems to be slightly less important.
Spark < 2.0
Obviously if you want to work with Hive you have to use HiveContext. Beyond that the biggest difference as for now (Spark 1.5) is a support for window functions and ability to access Hive UDFs.
Generally speaking window functions are a pretty cool feature and can be used to solve quite complex problems in a concise way without going back and forth between RDDs and DataFrames. Performance is still far from optimal especially without PARTITION BY clause but it is really nothing Spark specific.
Regarding Hive UDFs it is not a serious issue now, but before Spark 1.5 many SQL functions have been expressed using Hive UDFs and required HiveContext to work.
HiveContext also provides more robust SQL parser. See for example: py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment
Finally HiveContext is required to start Thrift server.
The biggest problem with HiveContext is that it comes with large dependencies.
When programming against Spark SQL we have two entry points depending on
whether we need Hive support. The recommended entry point is the HiveContext to
provide access to HiveQL and other Hive-dependent functionality. The more basic
SQLContext provides a subset of the Spark SQL support that does not depend on
Hive.
-The separation exists for users who might have conflicts with including all of
the Hive dependencies.
-Additional features of HiveContext which are not found in in SQLContext include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.
-Using a HiveContext does not require an existing Hive setup.
HiveContext is still the superset of sqlcontext,it contains certain extra properties such as it can read the configuration from hive-site.xml,in case you have hive use otherwise simply use sqlcontext
Related
In the Spark official document, there are two types of SQL syntax mentioned: Spark native SQL syntax and Hive QL syntax. I couldn't find the detail explanation about their difference. And I'm getting confused with the following questions:
Is Spark native SQL syntax a subset of the Hive QL? I asked it because in some articles they said like this. And according to the explaining in the Spark official page https://spark.apache.org/docs/3.0.0-preview2/sql-migration-guide.html#compatibility-with-apache-hive , it seems that Spark SQL does not support all features of Hive QL.
If the question 1 is yes, why I can run "join A rlike B" in Spark SQL but not in Hive?
How does Spark treat a SQL statement as Spark native SQL or Hive QL?
when we use enableHiveSupport during initialization of Spark Session, does it mean Spark will treat all given SQL statement as Hive QL?
Prologue
HiveQL is a mixture of SQL-92, MySQL, and Oracle's
SQL dialect. It also provides features from later SQL standards such as window functions. Additionally, HiveQL extends some features which don't belong to SQL standards. They are inspired by MapReduce ,e.g., multitable inserts.
Briefly speaking, you can analyze data with the Java-based power of MapReduce via the SQL-like HiveQL since Apache Hive is a kind of data warehouse on top of Hadoop.
With Spark SQL, you can read and write data in a variety of structured format and one of them is Hive tables. Spark SQL supports ANSI SQL:2003-compliant commands and HiveQL. In a nutshell, you can manipulate data with the power of Spark engine via the SQL-like Spark SQL and Spark SQL covers the majority features of HiveQL.
When working with Hive, you must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. Spark deals with the storage for you.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
Answers
I would say that they are highly overlapped. Spark SQL is almost the superset of HiveQL.
Spar SQL is not the subset of HiveQL; about the latter part, it's because regular expression like predicates are introduced in the SQL:2003 standard. Spark SQL is SQL:2003 compliant and HiveQL only implements very few features introduced in SQL:2003 and among the few features, rlike is not covered in HiveQL.
You gotta view the source code of Spark. Practically speaking, in my opinion, one only needs to keep in mind Spark SQL helps you read and write data from a variety of data sources and it covers HiveQL. Spark SQL is conferred with most capabilities of HiveQL.
Not exactly. Spark SQL is Spark SQL. With the functionality as you mentioned enabled, it usually means you're going to communicate with Apache Hive. Even you don't have an entity of Apache Hive, with that functionality enabled, you can utilize some features of HiveQL via Spark SQL since Spark SQL supports the majority features of HiveQL and Spark has an internal mechanism to deal with the storage of data warehouse.
/* Example of Utilizing HiveQL via Spark SQL */
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
/* Utilize a feature from HiveQL */
SELECT * FROM person
LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age;
References
Damji, J., Wenig, B., Das, T. and Lee, D., 2020. Learning Spark: Lightning-Fast Data Analytics. 2nd ed. Sebastopol, CA: O’Reilly, pp. 83-112.
ISO/IEC JTC 1/SC 32 Data management and interchange, 1992, Information technology - Database languages - SQL, ISO/IEC 9075:1992, the U.S.A.
ISO/IEC JTC 1/SC 32 Data management and interchange, 2003, Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation), ISO/IEC 9075-2:2003, the U.S.A.
White, T. (2015). Hadoop: The Definitive Guide. 4th ed. Sebastopol, O'Reilly Media, pp. 471-518.
cwiki.apache.org. 2013. LanguageManual LateralView. [ONLINE] Available at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.
spark.apache.org. 2021. Hive Tables. [ONLINE] Available at: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html.
Spark documentation lists the known incompatibilities
I've also found some incompabilities due to bugs in spark parser. It looks like hive is more robust.
You may also find differences in spark serialization/desearization implementation as explained in this answer. Basically you must adjust these properties:
spark.sql.hive.convertMetastoreOrc=false
spark.sql.hive.convertMetastoreParquet=false
but beware that it will have a performance penalty.
I am running a Spark application that has a series of Spark SQL statements that are executed one after the other. The SQL queries are quite complex and the application is working (generating output). These days, I am working towards improving the performance of processing within Spark.
Please suggest whether Tungsten encoding has to be enabled separately or it kicks in automatically while running Spark SQL?
I am using Cloudera 5.13 for my cluster (2 node).
It is enabled by default in spark 2.X (and maybe 1.6: but i'm not sure on that).
In any case you can do this
spark.sql.tungsten.enabled=true
That can be enabled on the spark-submit as follows:
spark-submit --conf spark.sql.tungsten.enabled=true
Tungsten should be enabled if you see a * next to the plan:
Also see: How to enable Tungsten optimization in Spark 2?
Tungsten became the default in Spark 1.5 and can be enabled in an earlier version by setting the spark.sql.tungsten.enabled = true.
Even without Tungsten, SparkSQL uses a columnar storage format with Kyro serialization to minimize storage cost.
To make sure your code benefits as much as possible from Tungsten optimizations try to use the default Dataset API with Scala (instead of RDD).
Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. DataSet APIs are the most up to date and adds type-safety along with better error handling and far more readable unit tests.
I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.
I can use SparkSession to get the list of tables in Hive, or access a Hive table as shown in the code below. Now my question is if in this case, I'm using Spark with Hive Context?
Or is it that to use hive context in Spark, I must directly use HiveContext object to access tables, and perform other Hive related functions?
spark.catalog.listTables.show
val personnelTable = spark.catalog.getTable("personnel")
I can use SparkSession to get the list of tables in Hive, or access a Hive table as shown in the code below.
Yes, you can!
Now my question is if in this case, I'm using Spark with Hive Context?
It depends on how you created the spark value.
SparkSession has the Builder interface that comes with enableHiveSupport method.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
If you used that method, you've got Hive support. If not, well, you don't have it.
You may think that spark.catalog is somehow related to Hive. Well, it was meant to offer Hive support, but by default the catalog is in-memory.
catalog: Catalog Interface through which the user may create, drop, alter or query underlying databases, tables, functions etc.
spark.catalog is just an interface that Spark SQL comes with two implementations for - in-memory (default) and hive.
Now, you might be asking yourself this question:
Is there anyway, such as through spark.conf, to find out if the hive support has been enabled?
There's no isHiveEnabled method or similar I know of that you could use to know whether you work with a Hive-aware SparkSession or not (as a matter of fact you don't need this method since you're in charge of creating a SparkSession instance so you should know what your Spark application does).
In environments where you're given a SparkSession instance (e.g. spark-shell or Databricks), the only way to check if a particular SparkSesssion has the Hive support enabled would be to see the type of the catalog implementation.
scala> spark.sessionState.catalog
res1: org.apache.spark.sql.catalyst.catalog.SessionCatalog = org.apache.spark.sql.hive.HiveSessionCatalog#4aebd384
If you see HiveSessionCatalog used, the SparkSession instance is Hive-aware.
In spark-shell , we can also use spark.conf.getAll. This command will return spark session configuration and we can see "spark.sql.catalogImplementation -> hive" suggesting Hive support.
I am a bit confused with the different actors in this story: PySpark, SparkSQL, Cassandra and the pyspark-cassandra connector.
As I understand Spark evolved quite a bit and SparkSQL is now a key component (with the 'dataframes'). Apparently there is absolutely no reason to work without SparkSQL, especially if connecting to Cassandra.
So my question is: what component are needed and how do I connect them together in the simplest way possible?
With spark-shell in Scala I could do simply
./bin/spark-shell --jars spark-cassandra-connector-java-assembly-1.6.0-M1-SNAPSHOT.jar
and then
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("mykeyspace")
val dataframe = cc.sql("SELECT count(*) FROM mytable group by beamstamp")
How can I do that with pyspark?
Here are a couple of subquestions along with partial answers I have collected (correct if I'm wrong).
Is pyspark-casmandra needed (I don't think so - I don't understand what is was doing in the first place)
Do I need to use pyspark or could I use my regular jupyter notebook and import the necessary things myself?
Pyspark should be started with the spark-cassandra-connector package as described in the Spark Cassandra Connector python docs.
./bin/pyspark
--packages com.datastax.spark:spark-cassandra-connector_$SPARK_SCALA_VERSION:$SPARK_VERSION
With this loaded you will be able to use any of the Dataframe operations already present inside of Spark on C* dataframes. More details on options of using C* dataframes.
To set this up to run with jupyter notebook just set up your env with the following properties.
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
And calling pyspark will start up a notebook correctly configured.
There is no need to use pyspark-cassandra unless you are interseted in working with RDDs in python which has a few performance pitfalls.
In Python connector is exposed DataFrame API. As long as spark-cassandra-connector is available and SparkConf contains required configuration there is no need for additional packages. You can simply specify the format and options:
df = (sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="mytable", keyspace="mykeyspace")
.load())
If you wan to use plain SQL you can register DataFrame as follows:
df.registerTempTable("mytable")
## Optionally cache
sqlContext.cacheTable("mytable")
sqlContext.sql("SELECT count(*) FROM mytable group by beamstamp")
Advanced features of the connector, like CassandraRDD are not exposed to Python so if you need something beyond DataFrame capabilities then pyspark-cassandra may prove useful.