How to concurrently use two DataFrames in different notebooks? [duplicate] - apache-spark

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory

Related

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Why does spark-shell not start SQL context?

I use spark-2.0.2-bin-hadoop2.7 and am setting up a Spark environment. I have completed most of the steps to install and configure, but finally, I found something different from the online tutorials.
The logs are missing the line:
SQL context available as sqlContext.
When I run spark-shell, it just starts the Spark context. Why is the SQL context not started?
Under normal circumstances, should the following two lines of code be run at the same time?
Spark context available as sc
SQL context available as sqlContext.
From Spark 2.0 onwards SparkSession is used instead (as SQL Context/sqlContext was "renamed" to SparkSession/spark).
When you run spark-shell, you will get a reference to this spark session as spark. You should see the following:
Spark session available as 'spark'.
If you want to access the underlying SQL context you could do the following:
spark.sqlContext
Please don't since it's no longer required and most operations can be executed without it.

Zeppelin - Unable to instantiate SessionHiveMetaStoreClient

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

Apache spark installation and db_metastore

I am beginner in Spark.
I installed java and spark-1.6.1-bin-hadoop2.6.tgz(I have not installed Hadoop) and with out changing any configuration in conf directory ran spark-shell.
In the director where spark is installed , I see another metastore_db created with tmp folder inside it.
why is this metastore_db is created , where is this configured ?
Also I see sqlContext being created after running spark-shell, what does this sqlContext represent?
When running spark-shell, a SparkContext and SQLContext are created. SQLContext is an extension of SparkContext to enable support of Spark SQL. It has method to execute sql queries (method sql) and to create DataFrames.
db_metastore is a Hive metastore path. Spark support Apache Hive queries via HiveContext. If there is no hive-site.xml configured, Spark will use db_metastore path, see documentation for details.
However, it would be good if you will download Spark 2.0. There you've got unified entry point to Spark, named SparkSession. This class allows you to read data from many sources, create Datasets, etc.

Multiple Spark applications with HiveContext

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory

Resources