Number of partitions scanned (=1000) on table 'table' exceeds limit (=100) - apache-spark

All queries for table fail with this error.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Number of partitions scanned (=1000) on table 'table' exceeds limit (=100). This is controlled on the metastore server by metastore.limit.partition.request.)
spark.table("table").
filter($"dt" === "2023-01-01").
show
I have these configs for spark-shell
--conf spark.sql.hive.convertMetastoreOrc=false \
--conf spark.sql.hive.metastorePartitionPruning=true \
Spark seems to be scanning the whole table despite the filter and the configs. Why does this happen? My table.
CREATE EXTERNAL TABLE table(
columns ...
PARTITIONED BY (dt date)
STORED AS ORC
TBLPROPERTIES (external.table.purge'='true', orc.compress'='ZLIB')

You should set the hive.metastore.limit.partition.request(see docs) property (as the error says). The default values is -1 (unlimited), but probably your spark config sets that value somewhere.
To set hive properties in Spark you need to prefix them with spark.sql. (as you already do in your example.

Related

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column

Spark SQL Fails when hive partition is missing

I have a table which has some missing partions. When I call it on hive it works fine
SELECT *
FROM my_table
but when call it from pyspark (v. 2.3.0) it fails with message Input path does not exist: hdfs://path/to/partition. The spark code I am running is just naive:
spark = ( SparkSession
.builder
.appName("prueba1")
.master("yarn")
.config("spark.sql.hive.verifyPartitionPath", "false")
.enableHiveSupport()
.getOrCreate())
spark.table('some_schema.my_table').show(10)
the config("spark.sql.hive.verifyPartitionPath", "false") has been proposed is
this question but seems to not work fine for me
Is there any way I can configure SparkSession so I can get rid of these. I am afraid that in the future more partitions will miss, so a hardcode solution is not possible
This error occurs when partitioned data dropped from HDFS i.e not using Hive commands to drop the partition.
If the data dropped from HDFS directly Hive doesn't know about the dropped the partition, when we query hive table it still looks for the directory and the directory doesn't exists in HDFS it results file not found exception.
To fix this issue we need to drop the partition associated with the directory in Hive table also by using
alter table <db_name>.<table_name> drop partition(<partition_col_name>=<partition_value>);
Then hive drops the partition from the metadata this is the only way to drop the metadata from the hive table if we dropped the partition directory from HDFS.
msck repair table doesn't drop the partitions instead only adds the new partitions if the new partition got added into HDFS.
The correct way to avoid these kind of issues in future drop the partitions by using Hive drop partition commands.
Does the other way around, .config("spark.sql.hive.verifyPartitionPath", "true") work for you? I have just managed to load data using spark-sql with this setting, while one of the partition paths from Hive was empty, and partition still existed in Hive metastore. Though there are caveats - it seems it takes significantly more time to load data compared to when this setting it set to false.

Hive: insert into table by Hue produces different number of files than pyspark

I have a Cloudera cluster on which I am accumulating large amounts of data in a Hive table stored as Parquet. The table is partitioned by an integer batch_id. My workflow for inserting a new batch of rows is to first insert the rows into a staging table, then insert into the large accumulating table. I am using a local-mode Python Pyspark script to do this. The script is essentially:
sc = pyspark.SparkContext()
hc = pyspark.HiveContext(sc)
hc.sql(
"""
INSERT INTO largeAccumulatorTable
PARTITION (batch_id = {0})
SELECT * FROM stagingBatchId{0}
"""
.format(batch_id)
)
I execute it using this shell script:
#!/bin/bash
spark-submit \
--master local[*] \
--num-executors 8 \
--executor-cores 1 \
--executor-memory 2G \
spark_insert.py
I have noticed that the resulting Parquet files in the large accumulating table are very small (some just a few KB) and numerous. I want to avoid this. I want the Parquet files to be large and few. I've tried setting different Hive configuration values at runtime in Pyspark to no avail:
Set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
Set mapred.map.tasks to a small number
Set num-executors to a small number
Use local[1] master instead of local[*]
Set mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize to high values
None of these changes had any effect on the number or sizes of Parquet files. However, when I open Cloudera Hue and enter this simple statement:
INSERT INTO largeAccumulatorTable
PARTITION (batch_id = XXX)
SELECT * FROM stagingBatchIdXXX
It works exactly as I would hope, producing a small number of Parquet files that are all about 100 MB.
What am I doing wrong in Pyspark? How can I make it achieve the same result as in Hue? Thanks!
spark default shuffle partitions are 200. Based on data size try reducing or increasing the configuration value. sqlContext.sql("set spark.sql.shuffle.partitions=20");

Spark 2.1 table loaded from Hive Metastore has null values

I am trying to migrate table definitions from one Hive metastore to another.
The source cluster has:
Spark 1.6.0
Hive 1.1.0 (cdh)
HDFS
The destination cluster is an EMR cluster with:
Spark 2.1.1
Hive 2.1.1
S3
To migrate the tables I did the following:
Copy data from HDFS to S3
Run SHOW CREATE TABLE my_table; in the source cluster
Modify the returned create query - change LOCATION from the HDFS path to the S3 path
Run the modified query on the destination cluster's Hive
Run SELECT * FROM my_table;. This returns 0 rows (expected)
Run MSCK REPAIR TABLE my_table;. This passes as expected and registers the partitions in the metastore.
Run SELECT * FROM my_table LIMIT 10; - 10 lines are returned with correct values
On the destination cluster, from Spark that is configured to work with the Hive Metastore, run the following code: spark.sql("SELECT * FROM my_table limit 10").show() - This returns null values!
The result returned from the Spark SQL query has all the correct columns, and the correct number of lines, but all the values are null.
To get Spark to correctly load the values, I can add the following properties to the TBLPROPERTIES part of the create query:
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='org.apache.spark.sql.parquet',
'spark.sql.sources.schema.numPartCols'='<partition-count>',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='<json-schema as seen by spark>'
'spark.sql.sources.schema.partCol.0'='<partition name 1>',
'spark.sql.sources.schema.partCol.1'='<partition name 2>',
...
The other side of this problem is that in the source cluster, Spark reads the table values without any problem and without the extra TBLPROPERTIES.
Why is this happening? How can it be fixed?

Creating hive table on spark output on HDFS

I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.
How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,
CREATE EXTERNAL TABLE `mydb.mytable`
(`col1` string,
`col2` decimal(38,0),
`create_date` timestamp,
`update_date` timestamp)
PARTITIONED BY (`my_date` string)
STORED AS ORC
LOCATION '/tmp/out/'
and the solution suggests to Alter the table as,
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.
How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?
How can I make Hive pick the data under the directory /tmp/data?
I can suggest two solutions:
If you can change your Spark job, then you can partition your data by hour (e.g. /tmp/data/1, /tmp/data/2), add Hive partitions for each hour and just write to relevant partition
you can write bash script responsible for adding Hive partitions which can be achieved by:
listing HDFS subdirectories using command hadoop fs -ls /tmp/data
listing hive partitions for table using command: hive -e 'show partitions table;'
comparing above lists to find missing partitions
adding new Hive partitions with command provided above: ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

Resources