hive query execution using spark engine - apache-spark

I have setup HADOOP 2.7.2,HIVE 2.1,SCALA 2.11.8 and SPARK 2.0 in ubuntu 16.04 system.
Hadoop, hive and spark are working well. I could connect to hive cli and work with map-reduce without any problem.
I have to improve my hive query performance for order by clause.
I have to use Hive cli only, and could not use spark-shell.
I am trying to use spark as query execution engine on hive
I am following instructions as per this link, I am setting some properties in hive as:
set hive.execution.engine=spark;
set spark.home=/usr/local/spark
set spark.master=spark://ip:7077
I executed the query as
select count(*) from table_name;
then it throws this exception:
failed to create spark client.
I increased the timeout of hive client connection to spark also. But, it is not useful.

First, I recommend you to use the shell and follow next steps:
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
And you can run:
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("create table myTable (myField STRING) stored as orc")
If this works you can do another query SQL with hiveContext

Related

Apache Spark: how can I understand and control if my query is executed on Hive engine or on Spark engine?

I am running local instance of spark 2.4.0
I want to execute an SQL query vs Hive
Before, with Spark 1.x.x., I was using HiveContext for this:
import org.apache.spark.sql.hive.HiveContext
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hivequery = hc.sql(“show databases”)
But now I see that HiveContext is deprecated: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/hive/HiveContext.html. Inside HiveContext.sql() code I see that it is now simply a wrapper over SparkSession.sql(). The recomendation is to use enableHiveSupport in SparkSession builder, but as this question clarifies this is only about metastore and list of tables, this is not changing execution engine.
So the questions are:
how can I understand if my query is running on Hive engine or on Spark engine?
how can I control this?
From my understanding there is no Hive Engine to run your query. You submit a query to Hive and Hive would execute it on an engine :
Spark
Tez(based on MapReduce)
MapReduce (commnly Hadoop)
If you use Spark, your query will be executed by Spark using SparkSQL (starting with Spark v1.5.x, if I recall correctly)
How is configured the Hive Engine depends on configuration and I remember seeing Hive on Spark configuration on Cloudera distribution.
So Hive would use Spark to execute the job matching you query (instead of MapReduce or Tez) but Hive would parse, analyze it.
Using local Spark instance, you will only use Spark engine (SparkSQL / Catalyst), but you can use it with Hive Support. It means, you would be able to read an existing Hive metastore and interact with it.
It requires a Spark installation with Hive support : Hive dependencies and hive-site.xml in your classpath

Can not write Spark dataframe into Cassandra table

I am connecting spark on HDP3.0 with Cassandra to write a data frame into Cassandra table but receiving below error:
My code written into Cassandra table is below:
HDP 3.0 is based on the Hadoop 3.1.1 that uses commons-configuration2 library instead of commons-configuration that is used by Spark Cassandra Connector. You can start your spark-shell or spark-submit with following:
spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.1,commons-configuration:commons-configuration:1.10
to explicitly add commons-configuration.

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

How to query hdfs from a spark cluster (2.1) which is running on kubernetes?

I was trying to access HDFS files from a spark cluster which is running inside Kubernetes containers.
However I keep on getting the error:
AnalysisException: 'The ORC data source must be used with Hive support enabled;'
What I am missing here?
Are you have SparkSession created with enableHiveSupport()?
Similar issue:
Spark can access Hive table from pyspark but not from spark-submit

Spark SQL: how does it tell hive to run query on spark?

As rightly pointed out here:
Spark SQL query execution on Hive
Spark SQL when running through HiveContext will make SQL query use the spark engine.
How does spark SQL setting hive.execution.engine=spark tell hive to do so?
Note this works automatically, we do not have to specify this in hive-site.xml in the conf directory of spark.
There are 2 independent projects here
Hive on Spark - Hive project that integrates Spark as an additional engine.
Spark SQL - Spark module that makes use of the Hive code.
HiveContext belongs to the 2nd and hive.execution.engine is a property of the 1st.

Resources