I created a spark submit job in databricks to run a .py script. I created a spark object in my python script. I tried to access existing Hive Tables. But my script fails with "Table or view not found" error. Should I add some configuration settings in my spark submit job to connect to existing hive metastore?
Please try using like below while creating spark Session in spark 2.0+
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
This will usually solve these kind of errors
Related
Today we have this scenario:
Cluster Azure HDInsight 4.0
Running Workflow on Oozie
On this version, spark and hive do not share metadata anymore
We came from HDInsight 3.6, to work with this change, now we use Hive Warehouse Connector
Before: spark.write.saveAsTable("tableName", mode="overwrite")
Now: df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option('table', "tableName").save()
The problem is on this point, using HWC make possible to save tables on hive.
But, hive databases/tables are not visible for Spark, Oozie and Jupyter, they see only tables on spark scope.
So, this is a major problem for us, because is not possible get data on managed tables from hive, and use them on oozie workflow.
To be possible save table on hive, and be visible on all cluster i made this configurations on Ambari:
hive > hive.strict.managed.tables = false
spark2 > metastore.catalog.default = hive
And now is possible to save table on hive, on the "old" way spark.write.saveAsTable.
But there is a problem when table is update/overwrite:
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.security.AccessControlException: Permission denied: user=hive,
path="wasbs://containerName#storageName.blob.core.windows.net/hive/warehouse/managed/table"
:user:supergroup:drwxr-xr-x);'
So, i have two questions:
Is this the correct way to save table on hive, to be visible on all cluster?
How can i avoid this permission error on table overwrite? Keep in mind, this error occours when we execute Oozie Workflow
Thanks!
Does Databricks support submitting a SparkSQL job similar to Google Cloud Dataproc?
The Databricks Job API, doesn't seem to have an option for submitting a Spark SQL job.
Reference:
https://docs.databricks.com/dev-tools/api/latest/jobs.html
https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs
No, you submit a notebook.
That notebook can be many things: python, spark script or with %sql Spark SQL.
You can submit the spark job on databricks cluster just like the dataproc.
Run your spark job in scala context and create a jar for the same. Submitting spark-sql directly is not supported.
To create a job follow the official guide https://docs.databricks.com/jobs.html
Also, to trigger the job using REST API you can trigger the run-now request described https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit
I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata.
I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...")
Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR.
I can work around this by using Glue APIs and registering dataframes as temp views:
create_dynamic_frame_from_catalog(...).toDF().createOrReplaceTempView(...)
but is there a way to do this automatically?
This was a much awaited feature request (to use Glue Data Catalog with Glue ETL jobs) which has been release recently.
When you create a new job, you'll find the following option
Use Glue data catalog as the Hive metastore
You may also enable it for an existing job by editing the job and adding --enable-glue-datacatalog in the job parameters providing no value
Instead of using SparkContext.getOrCreate(), you should use SparkSession.builder().enableHiveSupport().getOrCreate(), with enableHiveSupport() being the important part that's missing. I think what's probably happening is that your Spark job is not actually creating your tables in Glue but rather is creating them in Spark's embedded Hive metastore, since you have not enabled Hive support.
Had the same problem. It was working on my Dev endpoint but not the actual ETL job. It is fixed by editing the job from Spark 2.2 to Spark 2.4.
in my case
SparkSession.builder
.config(
"hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.
I was trying to access HDFS files from a spark cluster which is running inside Kubernetes containers.
However I keep on getting the error:
AnalysisException: 'The ORC data source must be used with Hive support enabled;'
What I am missing here?
Are you have SparkSession created with enableHiveSupport()?
Similar issue:
Spark can access Hive table from pyspark but not from spark-submit