How to create a Spark Dataframe(v1.6) on a secured Hbase Table? - apache-spark

I am trying to create a spark dataframe on a existing HBase Table(HBase is secured via Kerberos). I need to perform some spark Sql operations on this table.
I have tried creating a RDD on a Hbase table but unable to convert it into dataframe.

You can create hive external table with HBase storage handler and then use that table to run your spark-sql queries.
Creating the hive external table:
CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,f:c1,f:c2’)
TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);
Spark-sql:
val df=spark.sql("SELECT * FROM foo WHERE …")
Note: Here spark is a SparkSession

Related

Spark sql return empty dataframe when read hive managed table

Using Spark 2.4 and Hive 3.1.0 in HDP 3.1 , I am trying to read managed table from hive using spark sql, but it returns an empty dataframe, while it could read an external table easily.
How can i read the managed table from hive by spark sql?
Note: The hive maanged table is not empty when reading it usig the hive client.
1- I tried to format the table by ORC an parquet and it failed in both.
2- I failed to read it using HWC.
3- I failed to read it when using JDBC.
os.environ["HADOOP_USER_NAME"] = 'hdfs'
spark = SparkSession\
.builder\
.appName('NHIC')\
.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')\
.config("hive.metastore.uris", "thrift://192.168.1.66:9083")\
.enableHiveSupport()\
.getOrCreate()
HiveTableName ='nhic_poc.nhic_data_sample_formatted'
data = spark.sql('select * from '+HiveTableName)
The expected is to return the dataframe with Data but Actually the dataframe is empty.
Could you check if your spark environment is over-configured?
Try to run the code with environment's default configurations, by removing these lines from your code:
os.environ["HADOOP_USER_NAME"] = 'hdfs'
.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')
.config("hive.metastore.uris", "thrift://192.168.1.66:9083")

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column

Set partition location in Qubole metastore using Spark

How to set partition location for my Hive table in Qubole metastore?
I know that this is MySQL DB, but how to access to it and pass a SQL script with a fix using Spark?
UPD: The issue is that ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION works slowly for >1000 partitions. Do you know how to update metastore directly for Qubole? I want to pass locations in a batch to metastore to increase performance.
Set Hive metastore uris in your Spark config, if not set already. This can be done in the Qubole cluster settings.
Setup a SparkSession with some properties
val spark: SparkSession =
SparkSession
.builder()
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
Assuming AWS, define an external table on S3 using spark.sql
CREATE EXTERNAL TABLE foo (...) PARTITIONED BY (...) LOCATION 's3a://bucket/path'
Generate your dataframe according to that table schema.
Register a temp table for the dataframe. Let's call it tempTable
Run an insert command with your partitions, again using spark.sql
INSERT OVERWRITE TABLE foo PARTITION(part1, part2)
SELECT x, y, z, part1, part2 from tempTable
Partitions must go last in the selection
Partition locations will be placed within the table location in S3.
If you wanted to use external partitions, check out the Hive documentation on ALTER TABLE [PARTITION (spec)] that accepts a LOCATION path

How do I create external Hive Table based on existing Orc file?

I have some orc files produced by spark job.
Is there some easy way to create an external table directly from those files?
The way I have done this is to first register a temp table in Spark job itself and then leverage the sql method of the HiveContext to create a new table in hive using the data from the temp table. For example if I have a dataframe df and HiveContext hc the general process is:
df.registerTempTable("my_temp_table")
hc.sql("CREATE TABLE new_table_name STORED AS ORC AS SELECT * from my_temp_table")

How to use different Hive metastore for saveAsTable?

I am using Spark SQL (Spark 1.6.1) using PySpark and I have a requirement of loading a table from one Hive metastore and writing the result of the dataframe into a different Hive metastore.
I am wondering how can I use two different metastores for one spark SQL script?
Here is my script looks like.
# Hive metastore 1
sc1 = SparkContext()
hiveContext1 = HiveContext(sc1)
hiveContext1.setConf("hive.metastore.warehouse.dir", "tmp/Metastore1")
#Hive metastore 2
sc2 = SparkContext()
hiveContext2 = HiveContext(sc2)
hiveContext2.setConf("hive.metastore.warehouse.dir", "tmp/Metastore2")
#Reading from a table presnt in metastore1
df_extract = hiveContext1.sql("select * from emp where emp_id =1")
# Need to write the result into a different dataframe
df_extract.saveAsTable('targetdbname.target_table',mode='append',path='maprfs:///abc/datapath...')
HotelsDotCom have developed an application (WaggleDance) specifically for this https://github.com/HotelsDotCom/waggle-dance. Using it as a proxy you should be able to achieve what your trying to do
TL;DR It is not possible to use one Hive metastore (for some tables) and another (for other tables).
Since Spark SQL supports a single Hive metastore (in a SharedState) regardless of the number of SparkSessions reading from and writing to different Hive metastores is technically impossible.

Resources