Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()
Related
Today we have this scenario:
Cluster Azure HDInsight 4.0
Running Workflow on Oozie
On this version, spark and hive do not share metadata anymore
We came from HDInsight 3.6, to work with this change, now we use Hive Warehouse Connector
Before: spark.write.saveAsTable("tableName", mode="overwrite")
Now: df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option('table', "tableName").save()
The problem is on this point, using HWC make possible to save tables on hive.
But, hive databases/tables are not visible for Spark, Oozie and Jupyter, they see only tables on spark scope.
So, this is a major problem for us, because is not possible get data on managed tables from hive, and use them on oozie workflow.
To be possible save table on hive, and be visible on all cluster i made this configurations on Ambari:
hive > hive.strict.managed.tables = false
spark2 > metastore.catalog.default = hive
And now is possible to save table on hive, on the "old" way spark.write.saveAsTable.
But there is a problem when table is update/overwrite:
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.security.AccessControlException: Permission denied: user=hive,
path="wasbs://containerName#storageName.blob.core.windows.net/hive/warehouse/managed/table"
:user:supergroup:drwxr-xr-x);'
So, i have two questions:
Is this the correct way to save table on hive, to be visible on all cluster?
How can i avoid this permission error on table overwrite? Keep in mind, this error occours when we execute Oozie Workflow
Thanks!
we have installed our cluster via cdh6.2.
use pyspark create a dataFrame, then save it to hive.
the file is created in warehouse correctly, but it can not be found in hive or impala using show tables.
it can be found by spark sql using sql.('show tables'). But it only show the table be created by the spark code before, which means it can not see tables create via hive or impala console.
so I think it maybe the spark code doesn't collect to the Hive Metastore server.
but I don't know how to setup it to the Hive Metastore server.
In order to connect to hive metastore you need to copy the hive-site.xml file into spark/conf directory. Try the following:
ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector
Where is table created when I use SQL syntax for create table in Spark 2.*?
CREATE TABLE ... AS
Does spark store it internally somewhere of there has to always be Impala or Hive support for Spark tables to work?
By default (when no connection specified e.g. in hive-site.xml). Spark will save it in the local metastore that is on Derby db.
And it should be stored in spark-warehouse directory in your Spark directory or whatever is defined in spark.sql.warehouse.dir prop.
I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
I have downloaded pre-built Spark 1.3.1.
I am using the Hive bundled with the VM and can run hive queries through Spark-shell and Hive cmd line without any issue. This includes running the command:
LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/' INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into Hive. I have saved the results of the Cassandra read in parquet format in a folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
Copied the Hive-site.xml to the Spark conf folder.
I made a change to this xml. I noticed that I had two hive-site.xml - one which was auto generated and another which had Hive execution parameters. I combined both into a single hive-site.xml.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project Folder. From googling around, I know this happens when Spark can't connect to the Hive metastore, so it creates one of its own.I thought I had fixed that, but clearly not.
What am I missing?