SparkSQL - some partitions appear in HiveServer2 but not SparkSQL - apache-spark

Hive external table is pointing to files on S3, ddl includes partitioned by eod clause. Under a folder there are 5 subfolders, each with a file underneath for different partition_date. ie
eod=20180602/fileA
eod=20180603/fileA
eod=20180604/fileA
eod=20180605/fileA
eod=20180606/fileA
Msck repair table is run on HiveServer2
select distinct part_dt from tbl on HiveServer2 (port 10000) returns all 5 dates
However, select distinct part_dt from tbl on SparkThriftServer (ie SparkSQL, port 10015) returns only the first 2 dates.
How is this possible?
Even when running msck repair on SparkThriftServer the discrepancy still exists.
The file schema is same on all dates. (ie each file has same number/type of columns)

Resolved, those 8 affected tables were previously cached in sparksql (ie cache table <table>). Once i ran uncache table <table> all partitions lined up again!

Related

Apache Hive: CREATE TABLE statement without schema over parquet can fail to infer partition column

I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?
Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.

Apache Hive Add TIMESTAMP partition using alter table statement

i'm currently running MSCK HIVE REPAIR SCHEMA.TABLENAME for all my tables after data is loaded.
As the partitions are growing, this statement is taking much longer (some times more than 5 mins) for one table. I know it scans and parses through all partitions in s3 (where my data is) and then adds the latest partitions into hive metastore.
I want to replace MSCK REPAIR with ALTER TABLE ADD PARTITION statement. MSCK REPAIR works perfectly fine with adding latest partitions, however i'm facing problem with TIMESTAMP value in the partition when using ALTER TABLE ADD PARTITION.
I have a table with four partitions (part_dt STRING, part_src STRING, part_src_file STRING, part_ldts TIMESTAMP).
After running **MSCK REPAIR, the SHOW PARTITIONS command gives me below output
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39
But, when i drop the above partition from metastore, and recreate it using ALTER TABLE ADD PARTITION
hive> alter table hub_cont add partition(part_dt='20181016',part_src='asfs',part_src_file='kjui',part_ldts='2019-05-02 06:30:39');
OK
Time taken: 1.595 seconds
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39.0
Time taken: 0.128 seconds, Fetched: 1 row(s)
It is adding .0 at the end of timestamp value. When i query the table for this partition, it is giving me 0 records.
Is there way to add parition that has timestamp value without getting this zero added at the end. I'm unable to figure out how MSCK REPAIR is handling this case that is ALTER TABLE statement not able to.
The same should happen if you insert dynamic partitions, it will create new partitions with .0 because default timestamp string representation format includes milliseconds part, REPAIR TABLE finds new folders and adds partition to the metastore and also works correct because timestamp string without milliseconds is quite compatible with the timestamp...
The solution is to use STRING instead of TIMESTAMP and remove milliseconds explicitly.
But first of all double-check that you have really millions of rows in single partition and really need timestamp grain partition, not DATE and this partition column is really significant (for example if it is functionally dependent on another partition column part_src_file, you can completely get rid of it). Too many partitions will cause performance degradation.

Spark SQL query returns output although hive table does not contain enough records on queried column

I got output from spark SQL query despite the fact that the actual hive table doesn't contain enough records on queried column. The hive table is partitioned by integer column date_nbr which contains values like 20181125, 20181005 for some reason I had to truncate the table (Note: I did not delete the partitions directory in HDFS) and reload the table for the week date_nbr=20181202
After data load I run below query on hive and got expected result
SELECT DISTINCT date_nbr FROM transdb.temp
date_nbr
20181202
but spark SQL doesn't give the same output as hive
scala> spark.sql("SELECT DISTINCT date_nbr FROM transdb.temp").map(_.getAs[Int](0)).collect.toList
res9: List[Int] = List(20181125, 20181005, 20181202)
I'm bit confused by the spark sql result.

Replace a hive partition from Spark

Is there a way I can replace (an existing) a hive partition from a Spark program? Replace only the latest partition, rest of the partitions remains the same.
Below is the idea which I am trying to work upon,
We get transnational data from our RDBMS systems coming into HDFS every min. There will be a spark program (running every 5 or 10 min) which reads the data, performs the ETL and writes the output into a Hive Table.
Since overwriting entire hive table would be huge,
we would like to overwrite the hive table for today's partition only.
End of Day the source and destination partitions would be changed to next day.
Thanks in advance
As you know the hive table location, append the currentdate to location as your table is partitioned on date and overwrite the hdfs path.
df.write.format(source).mode("overwrite").save(path)
Msck repair hive table
once it is completed

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources