How is spark HiveContext/SQLContext retrieving schema/data? - apache-spark

I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into a DataFrame? And how does it handle a view or can it not handle a view yet?

Yes, it looks up hive metastore.
Spark delegates hive queries to hive. It captures output and turn it to a dataframe of rows.
From docs:
When working with Hive one must construct a HiveContext, which
inherits from SQLContext, and adds support for finding tables in the
MetaStore and writing queries using HiveQL

Related

Write a spark DataFrame to a table

I am trying to understand the spark DataFrame API method called saveAsTable.
I have following question
If I simply write a dataframe using saveAsTable API
df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
How can check if a table is Hive Table or Non-Hive table ?
(I am new to big data processing, so pardon me if question is not phrased properly)
Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).
Below is the documentation comment in DataFrameWriter.scala class. Documentation link
When the DataFrame is created from a non-partitioned
HadoopFsRelation with a single input path, and the data source
provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and
Parquet), the table is persisted in a Hive compatible format, which
means other systems like Hive will be able to read this table.
Otherwise, the table is persisted in a Spark SQL specific format.
Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).

Understanding how Hive SQL gets executed in Spark

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

Does SparkSession always use Hive Context?

I can use SparkSession to get the list of tables in Hive, or access a Hive table as shown in the code below. Now my question is if in this case, I'm using Spark with Hive Context?
Or is it that to use hive context in Spark, I must directly use HiveContext object to access tables, and perform other Hive related functions?
spark.catalog.listTables.show
val personnelTable = spark.catalog.getTable("personnel")
I can use SparkSession to get the list of tables in Hive, or access a Hive table as shown in the code below.
Yes, you can!
Now my question is if in this case, I'm using Spark with Hive Context?
It depends on how you created the spark value.
SparkSession has the Builder interface that comes with enableHiveSupport method.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
If you used that method, you've got Hive support. If not, well, you don't have it.
You may think that spark.catalog is somehow related to Hive. Well, it was meant to offer Hive support, but by default the catalog is in-memory.
catalog: Catalog Interface through which the user may create, drop, alter or query underlying databases, tables, functions etc.
spark.catalog is just an interface that Spark SQL comes with two implementations for - in-memory (default) and hive.
Now, you might be asking yourself this question:
Is there anyway, such as through spark.conf, to find out if the hive support has been enabled?
There's no isHiveEnabled method or similar I know of that you could use to know whether you work with a Hive-aware SparkSession or not (as a matter of fact you don't need this method since you're in charge of creating a SparkSession instance so you should know what your Spark application does).
In environments where you're given a SparkSession instance (e.g. spark-shell or Databricks), the only way to check if a particular SparkSesssion has the Hive support enabled would be to see the type of the catalog implementation.
scala> spark.sessionState.catalog
res1: org.apache.spark.sql.catalyst.catalog.SessionCatalog = org.apache.spark.sql.hive.HiveSessionCatalog#4aebd384
If you see HiveSessionCatalog used, the SparkSession instance is Hive-aware.
In spark-shell , we can also use spark.conf.getAll. This command will return spark session configuration and we can see "spark.sql.catalogImplementation -> hive" suggesting Hive support.

How to save temptable to Hive metastore (and analyze it in Hive)?

I use Spark 1.3.1.
How to store/save a DataFrame data to a Hive metastore?
In Hive If I run show tables the DataFrame does not appear as a table in Hive databases. I have copied hive-site.xml to $SPARK_HOME/conf, but it didn't help (and the dataframe does not appear in Hive metastore either).
I am following this document, using spark 1.4 version.
dataframe.registerTempTable("people")
How to analyze the spark table in Hive?
You can use insertInto method to save dataframe in hive.
Example:
dataframe.insertInto("table_name", false);
//2nd arg is to specify if table should be overridden
Got the solution. I am using spark 1.3.1, so that it's not supporting all. Now using spark 1.5.1 that problem resolved.
I have noticed data-frames fully working after 1.40. Many commands are deprecated.

Resources