Spark SQL query returns output although hive table does not contain enough records on queried column - apache-spark

I got output from spark SQL query despite the fact that the actual hive table doesn't contain enough records on queried column. The hive table is partitioned by integer column date_nbr which contains values like 20181125, 20181005 for some reason I had to truncate the table (Note: I did not delete the partitions directory in HDFS) and reload the table for the week date_nbr=20181202
After data load I run below query on hive and got expected result
SELECT DISTINCT date_nbr FROM transdb.temp
date_nbr
20181202
but spark SQL doesn't give the same output as hive
scala> spark.sql("SELECT DISTINCT date_nbr FROM transdb.temp").map(_.getAs[Int](0)).collect.toList
res9: List[Int] = List(20181125, 20181005, 20181202)
I'm bit confused by the spark sql result.

Related

SparkSQL cannot find existing rows in a Hive table that Trino(Presto) does

We are using Trino(Presto) and SparkSQL for querying Hive tables on s3 but they give different results with the same query and on the same tables. We found the main problem. There are existing rows in a problematic Hive table which can be found with a simple where filter on a specific column with Trino but cannot be found with SparkSQL. The sql statements are the same in both.
On the other hand, SparkSQL can find these rows in the source table of that problematic table, filtering on the same column.
Create sql statement:
CREATE TABLE problematic_hive_table AS SELECT c1,c2,c3 FROM source_table
The select sql that can be used to find missing rows in Trino but not in SparkSQL
SELECT * FROM problematic_hive_table WHERE c1='missing_rows_value_in_column'
And this is the select query which can find these missing rows in SparkSQL:
SELECT * FROM source_table WHERE c1='missing_rows_value_in_column'
We execute the CTAS in Trino(Presto). If we are using ...WHERE trim(c1) = 'missing_key'' then spark can also find the missing rows but the fields do not contain trailing spaces (the length of these fields are the same in the source table as in the problematic table). In the source table spark can find these missing rows without trim.

Apache Hive: CREATE TABLE statement without schema over parquet can fail to infer partition column

I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?
Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.

pyspark hive.table not reading all row of hive table

I am using hive llap(https://github.com/hortonworks-spark/spark-llap) in pyspark to read hive internal table like this:
df = hive.table(<tableName>)
But the issue is that my table has 18 million records, but when I do
df.count()
I just get 7.5 million as count which is wrong
You might have to refresh spark metastore which does not utilize the hive metastore and the stats might be just stale
You can refresh the pyspark metastore like this :
spark.sql("REFRESH TABLE <TABLE_NAME>")

Spark: Record count mismatch

I am quite confused because I am facing a weird situation.
My spark application reads data from an Oracle database and load it into a dataframe using this instruction:
private val df = spark.read.jdbc(
url = [the jdbc url],
table="(" + [the query] + ") qry",
properties= [the oracle driver]
)
Then, I save in a variable the number of records in this dataframe:
val records = df.count()
The I create a hive table ([my table]) with the dataframe schema, and I dump the content of the dataframe on it:
df.write
.mode(SaveMode.Append)
.insertInto([my hive db].[my table])
Well, here is where I am lost; When I perform a select count(*) to the hive table where the dataframe is being loaded, "sometimes" there are a few records more in hive than in the records variable.
Can you think on what could be the source of this mismatch??
*Related to the possible duplicate, my question is different. I am not counting my dataframe many times with different values. I count the records on my dataframe once. I dump the dataframe into hive, and I count the records in the hive table, and sometimes there are more in hive than in my count.*
Thank you very much in advance.

what's the number of partitions when spark sql read a hive table?

After reading up on this answer , i know the number of partitions when reading data from Hive will be decided by the HDFS blockSize.
But i meet a problem: i use spark sql to read a hive table, and save the data to an new hive table, but the two hive tables have different partition numbers when loaded by spark sql.
val data = spark.sql("select * from src_table")
val partitionsNum = data.rdd.getNumPartitions
println(partitionsNum)
val newData = data
newData.write.mode("overwrite").format("parquet").saveAsTable("new_table")
I don't understand the same data, why different partition numbers.

Categories

Resources