Apache spark installation and db_metastore - apache-spark

I am beginner in Spark.
I installed java and spark-1.6.1-bin-hadoop2.6.tgz(I have not installed Hadoop) and with out changing any configuration in conf directory ran spark-shell.
In the director where spark is installed , I see another metastore_db created with tmp folder inside it.
why is this metastore_db is created , where is this configured ?
Also I see sqlContext being created after running spark-shell, what does this sqlContext represent?

When running spark-shell, a SparkContext and SQLContext are created. SQLContext is an extension of SparkContext to enable support of Spark SQL. It has method to execute sql queries (method sql) and to create DataFrames.
db_metastore is a Hive metastore path. Spark support Apache Hive queries via HiveContext. If there is no hive-site.xml configured, Spark will use db_metastore path, see documentation for details.
However, it would be good if you will download Spark 2.0. There you've got unified entry point to Spark, named SparkSession. This class allows you to read data from many sources, create Datasets, etc.

Related

Apache Spark: how can I understand and control if my query is executed on Hive engine or on Spark engine?

I am running local instance of spark 2.4.0
I want to execute an SQL query vs Hive
Before, with Spark 1.x.x., I was using HiveContext for this:
import org.apache.spark.sql.hive.HiveContext
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hivequery = hc.sql(“show databases”)
But now I see that HiveContext is deprecated: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/hive/HiveContext.html. Inside HiveContext.sql() code I see that it is now simply a wrapper over SparkSession.sql(). The recomendation is to use enableHiveSupport in SparkSession builder, but as this question clarifies this is only about metastore and list of tables, this is not changing execution engine.
So the questions are:
how can I understand if my query is running on Hive engine or on Spark engine?
how can I control this?
From my understanding there is no Hive Engine to run your query. You submit a query to Hive and Hive would execute it on an engine :
Spark
Tez(based on MapReduce)
MapReduce (commnly Hadoop)
If you use Spark, your query will be executed by Spark using SparkSQL (starting with Spark v1.5.x, if I recall correctly)
How is configured the Hive Engine depends on configuration and I remember seeing Hive on Spark configuration on Cloudera distribution.
So Hive would use Spark to execute the job matching you query (instead of MapReduce or Tez) but Hive would parse, analyze it.
Using local Spark instance, you will only use Spark engine (SparkSQL / Catalyst), but you can use it with Hive Support. It means, you would be able to read an existing Hive metastore and interact with it.
It requires a Spark installation with Hive support : Hive dependencies and hive-site.xml in your classpath

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

How to concurrently use two DataFrames in different notebooks? [duplicate]

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory

Multiple Spark applications with HiveContext

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory

Can't load a Hive table through Spark

I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
I have downloaded pre-built Spark 1.3.1.
I am using the Hive bundled with the VM and can run hive queries through Spark-shell and Hive cmd line without any issue. This includes running the command:
LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/' INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into Hive. I have saved the results of the Cassandra read in parquet format in a folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
Copied the Hive-site.xml to the Spark conf folder.
I made a change to this xml. I noticed that I had two hive-site.xml - one which was auto generated and another which had Hive execution parameters. I combined both into a single hive-site.xml.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project Folder. From googling around, I know this happens when Spark can't connect to the Hive metastore, so it creates one of its own.I thought I had fixed that, but clearly not.
What am I missing?

Resources