Beeline script is one of the way of connecting to HiveServer2 present in Spark/bin.
I ran simple query as below.
In output I can see Map-Reduce is being launched.
I am just trying to understand what is advantage of beeline feature in Spark as it follows traditional map-reduce execution framework?
Can we use Spark RDD feature in beeline?
Thanks in advance.
Beeline is not part of Spark.
It's just a HiveServer2 client.
You can launch the Spark shell and execute queries within the shell, but this has nothing to do with Beeline. As Beeline has nothing to do with Spark.
This is one way.If you dont want to use Mapreduce you can use TEZ as engine.Which will run in memory as more faster than MR.
SET hive.execution.engine=tez;
But you can not run spark ifrom beeline.This is a standalone application which connects to hiveserver2.
Related
We have a Spark program which executes multiple queries and the tables are Hive tables.
Currently the queries are executed using Tez engine from Spark.
I set it sqlContext.sql("SET hive.execution.engine=spark") in the program and understand that the queries/program would run as Spark. We are using HDP 2.6.5 version and Spark 2.3.0 version in our cluster.
Can someone suggest that it is the correct way as we do not need to run the queries using Tez engine and Spark should run as it is .
In the config file /etc/spark2/conf/hive-site.xml, we do not have any specific engine property setup and we do have only kerberos, metastore property details.
I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.
I need to understand how the hive query will get executed in Spark cluster. It will operate as a Mapreduce job running in memory or it will use the spark architecture for running the hive queries. Pls clarify.
If you run hive queries in hive or beeline it will use Map-reduce, but if you run hive queries in spark REPL or program the queries will simply get converted into dataframes and created the logical and physical plan same as data frame and executes. Hence will use all the power of spark.
Assuming that you have a Hadoop cluster with YARN and Spark configured;
Hive execution engine is controlled by hive.execution.engine property. According to the docs it can be mr (default), tez or spark.
I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.
Can someone spell out the differences between using the Spark SQL CLI vs. Thriftserver/Beeline to query/modify data in Hive ? The Spark SQL documentation
mentions both of them but when would you use one or the other or are they equivalent alternatives from a functional point of view ?
For clarification:
spark-sql is a program that runs a single instance of Spark and you interact with it as if it were a mysql-like shell prompt and it makes use of the spark-warehouse and those types of features
Spark with Thriftserver is an application that exposes a connection to a running instance of Spark over a JDBC connection.
https://community.hortonworks.com/questions/33715/why-do-we-need-to-setup-spark-thrift-server.html
Beeline is a query / consumer tool that one uses to consume / connect to a running JDBC hive2 table (and thus in the spark documentation, they use beeline to test that the JDBC connection is in fact working). Note: query / connector programs like SQL Workbench can be made to connect to Spark with Thriftserver if it imports the proper Hive2 JDBC drivers & jars