Hive TBLPROPERTIES from Spark - apache-spark

I'm not seeing anything in the documentation, but is there a way to query the Hive TBLPROPERTIES for a table from Spark using a HiveContext or Hive-backed DataFrame?

AFAIK you cannot access the HiveMetastoreClient that Spark uses inside its HiveSession.
But, you can just instantiate another one -- hopefully the CLASSPATH is OK and contains both the Hive JARs and the directories containing Hadoop/Hive config files, and you don't have Kerberos authentication (or you benefit from the implicit Hadoop UGI of the Spark driver, that handles Kerberos automagically); so it's just a matter of new HiveMetaStoreClient(new HiveConf())
Then .getTable(...).getParameters() should get you the TBLPROPERTIES you want, in a Java Map.
https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html

Related

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

Does SparkSession always use Hive Context?

I can use SparkSession to get the list of tables in Hive, or access a Hive table as shown in the code below. Now my question is if in this case, I'm using Spark with Hive Context?
Or is it that to use hive context in Spark, I must directly use HiveContext object to access tables, and perform other Hive related functions?
spark.catalog.listTables.show
val personnelTable = spark.catalog.getTable("personnel")
I can use SparkSession to get the list of tables in Hive, or access a Hive table as shown in the code below.
Yes, you can!
Now my question is if in this case, I'm using Spark with Hive Context?
It depends on how you created the spark value.
SparkSession has the Builder interface that comes with enableHiveSupport method.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
If you used that method, you've got Hive support. If not, well, you don't have it.
You may think that spark.catalog is somehow related to Hive. Well, it was meant to offer Hive support, but by default the catalog is in-memory.
catalog: Catalog Interface through which the user may create, drop, alter or query underlying databases, tables, functions etc.
spark.catalog is just an interface that Spark SQL comes with two implementations for - in-memory (default) and hive.
Now, you might be asking yourself this question:
Is there anyway, such as through spark.conf, to find out if the hive support has been enabled?
There's no isHiveEnabled method or similar I know of that you could use to know whether you work with a Hive-aware SparkSession or not (as a matter of fact you don't need this method since you're in charge of creating a SparkSession instance so you should know what your Spark application does).
In environments where you're given a SparkSession instance (e.g. spark-shell or Databricks), the only way to check if a particular SparkSesssion has the Hive support enabled would be to see the type of the catalog implementation.
scala> spark.sessionState.catalog
res1: org.apache.spark.sql.catalyst.catalog.SessionCatalog = org.apache.spark.sql.hive.HiveSessionCatalog#4aebd384
If you see HiveSessionCatalog used, the SparkSession instance is Hive-aware.
In spark-shell , we can also use spark.conf.getAll. This command will return spark session configuration and we can see "spark.sql.catalogImplementation -> hive" suggesting Hive support.

Why does reading from Hive fail with "java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found"?

I use Spark v1.6.1 and Hive v1.2.x with Python v2.7
For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found is thrown.
For example this works when querying against a HIVE table in HDFS.
df1 = sqlContext.sql('select * from hdfs_db.tbl1')
This works when querying against a HIVE table in S3.
df2 = sqlContext.sql('select * from s3_db.tbl2')
This code below throws the above RuntimeException.
sql = """
select *
from hdfs_db.tbl1 a
join s3_db.tbl2 b on a.id = b.id
"""
df3 = sqlContext.sql(sql)
We are migrating from HDFS to S3, and so that's why there is a difference of storage backing HIVE tables (basically, ORC files in HDFS and S3). One interesting thing is that if we use DBeaver or the beeline clients to connect to Hive and issue the joined query, it works. I can also use sqlalchemy to issue the joined query and get result. This problem only shows on Spark's sqlContext.
More information on execution and environment: This code is executed in Jupyter notebook on an edge node (that already has spark, hadoop, hive, tez, etc... setup/configured). The Python environment is managed by conda for Python v2.7. Jupyter is started with pyspark as follows.
IPYTHON_OPTS="notebook --port 7005 --notebook-dir='~/' --ip='*' --no-browser" \
pyspark \
--queue default \
--master yarn-client
When I go to the Spark Application UI under Environment, the following Classpath Entries has the following.
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar
/usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
/usr/hdp/current/hadoop-client/conf/
/usr/hdp/current/spark-historyserver/conf/
The sun.boot.class.path has the following value: /usr/jdk64/jdk1.8.0_60/jre/lib/resources.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/rt.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jsse.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jce.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/charsets.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jfr.jar:/usr/jdk64/jdk1.8.0_60/jre/classes.
The spark.executorEnv.PYTHONPATH has the following value: /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip:/usr/hdp/2.4.2.0-258/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.9-src.zip.
The Hadoop distribution is via CDH: Hadoop 2.7.1.2.4.2.0-258
Quoting the Steve Loughran (who given his track record in Spark development seems to be the source of truth about the topic of accessing S3 file systems) from SPARK-15965 No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1:
This is being fixed with tests in my work in SPARK-7481; the manual
workaround is
Spark 1.6+
This needs my patch a rebuild of spark assembly. However, once that patch is in, trying to use the assembly without the AWS JARs will stop spark from starting —unless you move up to Hadoop 2.7.3
There are also some other sources where you can find workarounds:
https://issues.apache.org/jira/browse/SPARK-7442
https://community.mapr.com/thread/9635
How to access s3a:// files from Apache Spark?
https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html
https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
I've got no environment (and experience) to give the above a shot so after you give the above a try please report back to have a better understanding of the current situation regarding S3 support in Spark. Thanks.

How to let Spark SQL and thift server see the same Hive metastore?

Using spark-shell and HiveContext, I tried to show all the hive tables. But when I start the thirft server, and use beeline to check all tables, it is empty there.
On Spark SQL documentation, it says
(1) if I put hive-site.xml to conf/ in spark, saveAsTable method for DataFrame will persist table to hive specified in the xml file.
(2) if I put hive-site.xml to conf/ in spark, thriftServer will connect to the hive specified in the xml file.
Now I don't have any such xml file in conf/, so I suppose they should all use the default configuration. But clearly it is not the case, could anyone help point out the reason?
Thank you so much.
When I use spark-shell, I see the following line:
INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
Does this cause the two(spark-shell and thrift-server) see different hive metastore?
The code I tried on spark-shell:
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val df = hc.sql("show tables")
df.collect()
I tried "show tables" on beeline;
Turns out it is because I don't know enough about hive.
Every time when running HiveQL(for example "SHOW TABLES"), if there is no metastore_db in the current folder, it will create one for me. metastore_db stores all the table schemas so that they can be queried.
So the solution is, run all the hive-related program in the same folder. For my case, I should run start-thriftserver.sh and spark-shell in the same folder. Now both of them can share the same tables.
Furthermore, if I edit hive-site.xml to specify the metastore location, it is possible that the metastore will always be in a fixed location, which I will explore more.

How is spark HiveContext/SQLContext retrieving schema/data?

I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into a DataFrame? And how does it handle a view or can it not handle a view yet?
Yes, it looks up hive metastore.
Spark delegates hive queries to hive. It captures output and turn it to a dataframe of rows.
From docs:
When working with Hive one must construct a HiveContext, which
inherits from SQLContext, and adds support for finding tables in the
MetaStore and writing queries using HiveQL

Resources