Is it possible to use Spark with ORC file format without Hive? - apache-spark

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?

You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

Related

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

Read all tables from one hive then write to another hive on another cluster using spark

We can read or write tables from hive by putting hive-site.xml to the direction "conf" of spark.But now I have two cluster which can be connected to each other.Let`s say hive 1 on cluster,and hive 2 on another cluster
Now I need to read data from hive 1 and do some transformation then write to hive 2,the problem is I can only put one hive-site.xml file to spark conf,means when I execute
someDataFrame.write.saveAsTable("dbName.tableName")
,it will be save to hive 1 not hive 2,because spark only recognize one hive (hive 1)
My question is can I read and write to different hives on different cluster using spark?
Since there would only be one Hive Context active during this operation, I'm going to say it's not possible.
At a minimum, you would have to actually register the table in the "local" Hive metastore as an external table with LOCATION hdfs://othernamenode:9000/table/path, then make Spark write to it that way, but I've not tried it
Alternatively, look into the Circus Train project for migrating Hive tables

Understanding how Hive SQL gets executed in Spark

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

How to save temptable to Hive metastore (and analyze it in Hive)?

I use Spark 1.3.1.
How to store/save a DataFrame data to a Hive metastore?
In Hive If I run show tables the DataFrame does not appear as a table in Hive databases. I have copied hive-site.xml to $SPARK_HOME/conf, but it didn't help (and the dataframe does not appear in Hive metastore either).
I am following this document, using spark 1.4 version.
dataframe.registerTempTable("people")
How to analyze the spark table in Hive?
You can use insertInto method to save dataframe in hive.
Example:
dataframe.insertInto("table_name", false);
//2nd arg is to specify if table should be overridden
Got the solution. I am using spark 1.3.1, so that it's not supporting all. Now using spark 1.5.1 that problem resolved.
I have noticed data-frames fully working after 1.40. Many commands are deprecated.

How is spark HiveContext/SQLContext retrieving schema/data?

I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into a DataFrame? And how does it handle a view or can it not handle a view yet?
Yes, it looks up hive metastore.
Spark delegates hive queries to hive. It captures output and turn it to a dataframe of rows.
From docs:
When working with Hive one must construct a HiveContext, which
inherits from SQLContext, and adds support for finding tables in the
MetaStore and writing queries using HiveQL

Resources