Set partition location in Qubole metastore using Spark - apache-spark

How to set partition location for my Hive table in Qubole metastore?
I know that this is MySQL DB, but how to access to it and pass a SQL script with a fix using Spark?
UPD: The issue is that ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION works slowly for >1000 partitions. Do you know how to update metastore directly for Qubole? I want to pass locations in a batch to metastore to increase performance.

Set Hive metastore uris in your Spark config, if not set already. This can be done in the Qubole cluster settings.
Setup a SparkSession with some properties
val spark: SparkSession =
SparkSession
.builder()
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
Assuming AWS, define an external table on S3 using spark.sql
CREATE EXTERNAL TABLE foo (...) PARTITIONED BY (...) LOCATION 's3a://bucket/path'
Generate your dataframe according to that table schema.
Register a temp table for the dataframe. Let's call it tempTable
Run an insert command with your partitions, again using spark.sql
INSERT OVERWRITE TABLE foo PARTITION(part1, part2)
SELECT x, y, z, part1, part2 from tempTable
Partitions must go last in the selection
Partition locations will be placed within the table location in S3.
If you wanted to use external partitions, check out the Hive documentation on ALTER TABLE [PARTITION (spec)] that accepts a LOCATION path

Related

Write Spark Dataframe to Hive accessible table in HDP2.6

I know there are already lots of answers on writing to HIVE from Spark, but non of them seem to work for me. So first some background. This is an older cluster, running HDP2.6, that's Hive2 and Spark 2.1.
Here an example program:
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.appName("Test App")
.config("spark.sql.warehouse.dir", "/app/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
records.write.saveAsTable("records_table")
If I log into the spark-shell and run that code, a new table called records_table shows up in Hive. However, if I deploy that code in a jar, and submit it to the cluster using spark-submit, I will see the table show up in the same HDFS location as all of the other HIVE tables, but it's not accessible to HIVE.
I know that in HDP 3.1 you have to use a HiveWarehouseConnector class, but I can't find any reference to that in HDP 2.6. Some people have mentioned the HiveContext class, while others say to just use the enableHiveSupport call in the SparkSessionBuilder. I have tried both approaches, but neither seems to work. I have tried saveAsTable. I have tried insertInto. I have even tried creating a temp view, then hiveContext.sql("create table if not exists mytable as select * from tmptable"). With each attempt, I get a parquet file in hdfs:/apps/hive/warehouse, but I cannot access that table from HIVE itself.
Based on the information provided, here is what I suggest you do,
Create Spark Session, enableHiveSupport is required,
val spark = SparkSession.builder()
.appName("Test App")
.enableHiveSupport()
.getOrCreate()
Next, execute DDL for table resultant table via spark.sql,
val ddlStr: String =
s"""CREATE EXTERNAL TABLE IF NOT EXISTS records_table(key int, value string)
|ROW FORMAT SERDE
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|STORED AS INPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|OUTPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|LOCATION '$hdfsLocation'""".stripMargin
spark.sql(ddlStr)
Write data as per your use case,
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.format("orc").insertInto("records_table")
Notes:
Working is going to be similar for spark-shell and spark-submit
Partitioning is can be defined in the DDL, so do no use partitionBy while writing the data frame.
Bucketing/ Clustering is not supported.
Hope this helps/ Cheers.

How to create a Spark Dataframe(v1.6) on a secured Hbase Table?

I am trying to create a spark dataframe on a existing HBase Table(HBase is secured via Kerberos). I need to perform some spark Sql operations on this table.
I have tried creating a RDD on a Hbase table but unable to convert it into dataframe.
You can create hive external table with HBase storage handler and then use that table to run your spark-sql queries.
Creating the hive external table:
CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,f:c1,f:c2’)
TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);
Spark-sql:
val df=spark.sql("SELECT * FROM foo WHERE …")
Note: Here spark is a SparkSession

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column

Spark SQL Fails when hive partition is missing

I have a table which has some missing partions. When I call it on hive it works fine
SELECT *
FROM my_table
but when call it from pyspark (v. 2.3.0) it fails with message Input path does not exist: hdfs://path/to/partition. The spark code I am running is just naive:
spark = ( SparkSession
.builder
.appName("prueba1")
.master("yarn")
.config("spark.sql.hive.verifyPartitionPath", "false")
.enableHiveSupport()
.getOrCreate())
spark.table('some_schema.my_table').show(10)
the config("spark.sql.hive.verifyPartitionPath", "false") has been proposed is
this question but seems to not work fine for me
Is there any way I can configure SparkSession so I can get rid of these. I am afraid that in the future more partitions will miss, so a hardcode solution is not possible
This error occurs when partitioned data dropped from HDFS i.e not using Hive commands to drop the partition.
If the data dropped from HDFS directly Hive doesn't know about the dropped the partition, when we query hive table it still looks for the directory and the directory doesn't exists in HDFS it results file not found exception.
To fix this issue we need to drop the partition associated with the directory in Hive table also by using
alter table <db_name>.<table_name> drop partition(<partition_col_name>=<partition_value>);
Then hive drops the partition from the metadata this is the only way to drop the metadata from the hive table if we dropped the partition directory from HDFS.
msck repair table doesn't drop the partitions instead only adds the new partitions if the new partition got added into HDFS.
The correct way to avoid these kind of issues in future drop the partitions by using Hive drop partition commands.
Does the other way around, .config("spark.sql.hive.verifyPartitionPath", "true") work for you? I have just managed to load data using spark-sql with this setting, while one of the partition paths from Hive was empty, and partition still existed in Hive metastore. Though there are caveats - it seems it takes significantly more time to load data compared to when this setting it set to false.

unable to insert into hive partitioned table from spark

I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.
msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.
Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.

Resources