what's the number of partitions when spark sql read a hive table? - apache-spark

After reading up on this answer , i know the number of partitions when reading data from Hive will be decided by the HDFS blockSize.
But i meet a problem: i use spark sql to read a hive table, and save the data to an new hive table, but the two hive tables have different partition numbers when loaded by spark sql.
val data = spark.sql("select * from src_table")
val partitionsNum = data.rdd.getNumPartitions
println(partitionsNum)
val newData = data
newData.write.mode("overwrite").format("parquet").saveAsTable("new_table")
I don't understand the same data, why different partition numbers.

Related

Difference between Spark dataframe writer partitionBy vs delta create table partition by

What is the exact difference between Spark dataframe writer partitionBy and delta/hive create table partition by. Which one will be faster and why?

pyspark hive.table not reading all row of hive table

I am using hive llap(https://github.com/hortonworks-spark/spark-llap) in pyspark to read hive internal table like this:
df = hive.table(<tableName>)
But the issue is that my table has 18 million records, but when I do
df.count()
I just get 7.5 million as count which is wrong
You might have to refresh spark metastore which does not utilize the hive metastore and the stats might be just stale
You can refresh the pyspark metastore like this :
spark.sql("REFRESH TABLE <TABLE_NAME>")

Spark SQL query returns output although hive table does not contain enough records on queried column

I got output from spark SQL query despite the fact that the actual hive table doesn't contain enough records on queried column. The hive table is partitioned by integer column date_nbr which contains values like 20181125, 20181005 for some reason I had to truncate the table (Note: I did not delete the partitions directory in HDFS) and reload the table for the week date_nbr=20181202
After data load I run below query on hive and got expected result
SELECT DISTINCT date_nbr FROM transdb.temp
date_nbr
20181202
but spark SQL doesn't give the same output as hive
scala> spark.sql("SELECT DISTINCT date_nbr FROM transdb.temp").map(_.getAs[Int](0)).collect.toList
res9: List[Int] = List(20181125, 20181005, 20181202)
I'm bit confused by the spark sql result.

Spark: Record count mismatch

I am quite confused because I am facing a weird situation.
My spark application reads data from an Oracle database and load it into a dataframe using this instruction:
private val df = spark.read.jdbc(
url = [the jdbc url],
table="(" + [the query] + ") qry",
properties= [the oracle driver]
)
Then, I save in a variable the number of records in this dataframe:
val records = df.count()
The I create a hive table ([my table]) with the dataframe schema, and I dump the content of the dataframe on it:
df.write
.mode(SaveMode.Append)
.insertInto([my hive db].[my table])
Well, here is where I am lost; When I perform a select count(*) to the hive table where the dataframe is being loaded, "sometimes" there are a few records more in hive than in the records variable.
Can you think on what could be the source of this mismatch??
*Related to the possible duplicate, my question is different. I am not counting my dataframe many times with different values. I count the records on my dataframe once. I dump the dataframe into hive, and I count the records in the hive table, and sometimes there are more in hive than in my count.*
Thank you very much in advance.

How to store Spark data frame as a dynamic partitioned Hive table in Parquet format?

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.
I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?
Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?
If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.
Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.
Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?

Resources