Hive PartitionFilter are not Applied

Hive PartitionFilter are not Applied - apache-spark

I have facing this issue with hive.
When i Query a table ,which is partitioned on date column,
SELECT count(*) from table_name where date='2018-06-01' the query reads the entire table data and keeps for running hours,
Using EXPLAIN I found that HIVE is not applying the PartitionFilter on the query
I have double checked that the table is partitioned on date column by desc table_name.
Execution engine is Spark And Data is stored in Azure Data lake in Parquet Format
However I have another table in the Database for which the PartitionFilter is applied and it executes as expected.
Can there be some issue with the hive metadata or it is something else

Found the cause for this issue,
Hive wasn't applying the Partition Filters on some table ,because those tables were cached.
Thus when i restared the thrift server the cache was removed and Partition Filters were applied

Related

Deleting data from Hive managed table (Partitioned and Bucketed)

We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Thanks
Anuj

Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.

spark and impala in CDH Hadoop do not agree on the table schema

Need some help as we are baffled. Using Impala SQL, we did ALTER TABLE to add 3 columns to a parquet table. The table is used by both Spark (v2) and Impala jobs.
After the columns were added, Impala correctly reports the new columns using describe, however, Spark does not report the freshly added columns when spark.sql("describe tablename") is executed.
We double checked Hive and it correctly reports the added columns.
We ran refresh table tablename in spark but it still doesn't see the new columns. We believe we must be overlooking something simple. What step did we miss?
Update: Impala sees the table with the columns but Spark does not acknowledge the new columns. Reading more about spark, apparently the spark engine reads the schema from the parquet file rather than from the hive meta store. The suggested work around did not work and the only recourse that could be found was to drop the table and rebuild it.

External hive table on top of parquet returns no data

I created a hive table on top of a parquet folder written via spark. In one test server it is running fine and giving out results (hive version 2.6.5.196) but in production it gives no records (hive 2.6.5.179). Could someone please point out what the exact issue could be?

If you created the table on top of an existing partition structure, you have to make it known to the table that there are partitions at this location.
MSCK REPAIR TABLE table_name; -- adds missing partitions
SELECT * FROM table_name; -- should return records now
This problem shouldn't happen if there are only files in that location, and if they are the expected format.
You can verify with:
SHOW CREATE TABLE table_name; -- to see the expected format

created hive table on top of a parquet folder written via spark.
Check for the databases that you are using is available or not using
show databases;
check the ddl of the table that you have created on your test server and the other that is there on production
show create table table_name;
Make sure that both the ddl exactly matches.
Do msck repair table table_name to load the incremental data or the data from all the partitions
select * from table_name to view records

Query Hive table in spark 2.2.0

I have a hive table(say table1) in avro file format with 1900 columns. When I query the table in hive - I am able to fetch data but when I query the same table in spark sql I am getting metastore client lost connection. Attempting to reconnect
I have also queried another hive table(say table2) in avro file format with 130 columns it's fetching data both in hive and spark.
What I observed is I can see data in hdfs location of table2 but I can't see any data in table1 hdfs location (but it's feching data when I query only in hive)

Split tell you about number of mappers in MR job.
It doesn’t show you the exact location from where the data had been picked.

The below will help you to check where the data for Table1 is stored in HDFS.
For Table 1: You can check the location of data in HDFS by running a SELECT query with WHERE conditions in Hive with MapReduce as the execution engine. Once the job is complete, you can check the map task's log of the YARN application (specifically for the text "Processing file") and find where the input data files have been taken from.
Also, try checking the location of data for both the tables present in HiveMetastore by running "SHOW CREATE TABLE ;" in hive for both the tables in Hive. From the result, try to check the "LOCATION" details.

Count(*) over same External table gives different values in spark.sql() and hive

I am working on a AWS cluster with hive and spark. I have faced a weird situation previous day while I was running some ETL pyspark script over an external table in hive.
We have a control table which is having an extract date column. And we are filtering data from a staging table (managed table in hive, but location is s3 bucket) based on the extract date and loading to a target table which is an external table with data located in s3 bucket. We are loading the table as below
spark.sql("INSERT OVERWRITE target_table select * from DF_made_from_stage_table")
Now when I have checked the count(*) over target table via spark as well as via direct hive CLI, both are giving different count
in spark:
spark.sql("select count(1) from target") -- giving 50K records
in hive:
select count(1) from target -- giving a count 50k - 100
Note: There was happening an intermittent issue with statistics over external table which was giving -1 as count in hive. This we have resolved by running
ANALYZE TABLE target COMPUTE STATISTICS
But even after doing all these still we are getting original_count-100 in hive where correct count in spark.

There was a mistake in the DDL used for external table. "skip.header.line.count"="1" was there in the DDL and we are having 100 output files. so 1 line each file were skipped , which caused original count - 100 in hive while spark calculated it correctly. Removed "skip.header.line.count"="1" and its giving count as expected now.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Hive PartitionFilter are not Applied - apache-spark

Found the cause for this issue, Hive wasn't applying the Partition Filters on some table ,because those tables were cached. Thus when i restared the thrift server the cache was removed and Partition Filters were applied

Related

Deleting data from Hive managed table (Partitioned and Bucketed)

spark and impala in CDH Hadoop do not agree on the table schema

External hive table on top of parquet returns no data

Query Hive table in spark 2.2.0

Count(*) over same External table gives different values in spark.sql() and hive

Categories

Resources