unable to insert into hive partitioned table from spark - apache-spark

I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.

msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.

Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.

Related

How to automatically update the Hive external table metadata partitions for streaming data

I am writing the spark streaming data into hdfs partitions using pyspark.
please find the code
data = (spark.readStream.format("json").schema(fileSchema).load(inputDirectoryOfJsonFiles))
output = (data.writeStream
.format("parquet")
.partitionBy("date")
.option("compression", "none")
.option("path" , "/user/hdfs/stream-test")
.option("checkpointLocation", "/user/hdfs/stream-ckp")
.outputMode("append")
.start().awaitTermination())
After writing the data into hdfs, i am creating the hive external partition table.
CREATE EXTERNAL TABLE test (id string,record string)
PARTITIONED BY (`date` date)
STORED AS PARQUET
LOCATION '/user/hdfs/stream-test/'
TBLPROPERTIES ('discover.partitions' = 'true');
But the newly created partitions are not been recognized Hive metastore. i am updating the metastore using the msck command.
msck repair table test sync partitions
Now for the streaming data how to automate this task of updating the hive metastore with the real time partitions.
please suggest a solution to this problem.
Spark structured streaming don't natively support this, but you can use foreachBatch as workaround
val yourStream = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.load()
val query = yourStream.writeStream.foreachBatch((batchDF: DataFrame, batchId: Long) => {
batchDF
.write
.mode(SaveMode.Append)
.insertInto("your_db.your_hive_table");
}).start()
query.awaitTermination()
More details refer https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column

How can I refresh a Hive/Impala table from Spark Structured Streaming?

currently my Spark Structured Streaming goes like this (Sink part displayed only):
//Output aggregation query to Parquet in append mode
aggregationQuery.writeStream
.format("parquet")
.trigger(Trigger.ProcessingTime("15 seconds"))
.partitionBy("date", "hour")
.option("path", "hdfs://<myip>:8020/user/myuser/spark/proyecto3")
.option("checkpointLocation", "hdfs://<myip>:8020/user/myuser/spark/checkpointfolder3")
.outputMode("append")
.start()
The above code generates .parquet files in the directory defined by path.
I have externally defined a Impala table that reads from that path, but I need the table to be updated or refreshed after every append of parquet files.
How can this be achieved?
You need to update the partitions of your table after file sink.
import spark.sql
val query1 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (date='20200803') LOCATION '/your/location/proyecto3/date=20200803'"
sql(s"$query1")
import spark.sql
val query2 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (hour='104700') LOCATION '/your/location/proyecto3/date=20200803/hour=104700'"
sql(s"$query2")

Set partition location in Qubole metastore using Spark

How to set partition location for my Hive table in Qubole metastore?
I know that this is MySQL DB, but how to access to it and pass a SQL script with a fix using Spark?
UPD: The issue is that ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION works slowly for >1000 partitions. Do you know how to update metastore directly for Qubole? I want to pass locations in a batch to metastore to increase performance.
Set Hive metastore uris in your Spark config, if not set already. This can be done in the Qubole cluster settings.
Setup a SparkSession with some properties
val spark: SparkSession =
SparkSession
.builder()
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
Assuming AWS, define an external table on S3 using spark.sql
CREATE EXTERNAL TABLE foo (...) PARTITIONED BY (...) LOCATION 's3a://bucket/path'
Generate your dataframe according to that table schema.
Register a temp table for the dataframe. Let's call it tempTable
Run an insert command with your partitions, again using spark.sql
INSERT OVERWRITE TABLE foo PARTITION(part1, part2)
SELECT x, y, z, part1, part2 from tempTable
Partitions must go last in the selection
Partition locations will be placed within the table location in S3.
If you wanted to use external partitions, check out the Hive documentation on ALTER TABLE [PARTITION (spec)] that accepts a LOCATION path

How do I save spark.writeStream results in hive?

I am using spark.readStream to read data from Kafka and running an explode on the resulting dataframe.
I am trying to save the result of the explode in a Hive table and I am not able to find any solution for that.
I tried the following method but it doesn't work (it runs but I don't see any new partitions created)
val query = tradelines.writeStream.outputMode("append")
.format("memory")
.option("truncate", "false")
.option("checkpointLocation", checkpointLocation)
.queryName("tl")
.start()
sc.sql("set hive.exec.dynamic.partition.mode=nonstrict;")
sc.sql("INSERT INTO TABLE default.tradelines PARTITION (dt) SELECT * FROM tl")
Check HDFS for the dt partitions on the file system
You need to run MSCK REPAIR TABLE on the hive table to see new partitions.
If you aren't doing anything special with Spark, then it's worth pointing out that Kafka Connect HDFS is capable of registering Hive partitions directly from Kafka.

Resources