Issue with Spark dataframe loading timestamp data to hive table

Issue with Spark dataframe loading timestamp data to hive table - apache-spark

I am trying to load a dataframe to hive table. But it is adding additional 30 minutes to the table.
I have tried the below
from pyspark import SparkContext,HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_load.write.mode("append").saveAsTable("default.DATA_LOAD")
the df_load has a column "currenthour" with value "2020-09-01 09:00:00". But in the table, it is loaded as "2020-09-01 09:30:00".
How to resolve this issue.

Its a common issue with Timestamp datatype because of the timezone.
Refer this:
Spark SQL to Hive table - Datetime Field Hours Bug

Related

pyspark hive.table not reading all row of hive table

I am using hive llap(https://github.com/hortonworks-spark/spark-llap) in pyspark to read hive internal table like this:
df = hive.table(<tableName>)
But the issue is that my table has 18 million records, but when I do
df.count()
I just get 7.5 million as count which is wrong

You might have to refresh spark metastore which does not utilize the hive metastore and the stats might be just stale
You can refresh the pyspark metastore like this :
spark.sql("REFRESH TABLE <TABLE_NAME>")

TIMESTAMP column issue CDH5 vs CDH6 in parquet table

We recently upgraded our server from CDH 5 to CDH 6 . When inserting data to TIMESTAMP columns using SPARK in parquet tables there is difference how data is inserted.
CDH 5:
HIVE:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from Hive value is '2019-01-30 00:00:00 0'
CDH 6:
HIVE:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from HIVE value is '2019-01-30 04:00:00'
IMPALA:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from IMPALA value is '2019-01-30 04:00:00'
Please let me know if there is any spark properties we can use . My primary goal is to match HIVE value in CDH5 vs CDH6 and If possible when we select from IMPALA if should be 2019-01-30 00:00:00'

To skip issues with data type between Spark and Hive the convention used by Spark to write Parquet data is configurable.
This is determined by the property spark.sql.parquet.writeLegacyFormat. The default value is false. If set to true, Spark will use the same convention as Hive for writing the Parquet data.
val spark = SparkSession
.builder()
.appName("MyApp")
.master("local[*]")
.config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
.config("spark.sql.parquet.writeLegacyFormat", true)

Spark: Record count mismatch

I am quite confused because I am facing a weird situation.
My spark application reads data from an Oracle database and load it into a dataframe using this instruction:
private val df = spark.read.jdbc(
url = [the jdbc url],
table="(" + [the query] + ") qry",
properties= [the oracle driver]
)
Then, I save in a variable the number of records in this dataframe:
val records = df.count()
The I create a hive table ([my table]) with the dataframe schema, and I dump the content of the dataframe on it:
df.write
.mode(SaveMode.Append)
.insertInto([my hive db].[my table])
Well, here is where I am lost; When I perform a select count(*) to the hive table where the dataframe is being loaded, "sometimes" there are a few records more in hive than in the records variable.
Can you think on what could be the source of this mismatch??
*Related to the possible duplicate, my question is different. I am not counting my dataframe many times with different values. I count the records on my dataframe once. I dump the dataframe into hive, and I count the records in the hive table, and sometimes there are more in hive than in my count.*
Thank you very much in advance.

what's the number of partitions when spark sql read a hive table?

After reading up on this answer , i know the number of partitions when reading data from Hive will be decided by the HDFS blockSize.
But i meet a problem: i use spark sql to read a hive table, and save the data to an new hive table, but the two hive tables have different partition numbers when loaded by spark sql.
val data = spark.sql("select * from src_table")
val partitionsNum = data.rdd.getNumPartitions
println(partitionsNum)
val newData = data
newData.write.mode("overwrite").format("parquet").saveAsTable("new_table")
I don't understand the same data, why different partition numbers.

Writing Hive table from Spark specifying CSV as the format

I'm having an issue writing a Hive table from Spark. The following code works just fine; I can write the table (which defaults to the Parquet format) and read it back in Hive:
df.write.mode('overwrite').saveAsTable("db.table")
hive> describe table;
OK
val string
Time taken: 0.021 seconds, Fetched: 1 row(s)
However, if I specify the format should be csv:
df.write.mode('overwrite').format('csv').saveAsTable("db.table")
then I can save the table, but Hive doesn't recognize the schema:
hive> describe table;
OK
col array<string> from deserializer
Time taken: 0.02 seconds, Fetched: 1 row(s)
It's also worth noting that I can create a Hive table manually and then insertInto it:
spark.sql("create table db.table(val string)")
df.select('val').write.mode("overwrite").insertInto("db.table")
Doing so, Hive seems to recognize the schema. But that's clunky and I can't figure a way to automate the schema string anyway.

That is because Hive SerDe do not support csv by default.
If you insist on using csv format, creating table as below:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
And insert data through df.write.insertInto
For more info:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

You are creating a table with text format and trying to insert CSV data into it, which may run in to problems. So as suggested in the answer by Zhang Tong, create the hive table using hive OpenCSVSerde.
After that, if you are more comfortable with Hive query language than dataframes, you can try this.
df.registerTempTable("temp")
spark.sql("insert overwrite db.table select * from temp")

This happens because HiveSerde is different for csv than what is used by Spark. Hive by default use TEXTFORMAT and the delimiter has to be specified while creating the table.
One Option is to use the insertInto API instead of saveAsTable while writing from spark. While using insertInto, Spark writes the contents of the Dataframe to the specified table. But it requires the schema of the dataframe to be same as the schema of the table. Position of the columns is important here as it ignores the column names.
Seq((5, 6)).toDF("a", "b").write.insertInto("t1")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Issue with Spark dataframe loading timestamp data to hive table - apache-spark

Its a common issue with Timestamp datatype because of the timezone. Refer this: Spark SQL to Hive table - Datetime Field Hours Bug

Related

pyspark hive.table not reading all row of hive table

TIMESTAMP column issue CDH5 vs CDH6 in parquet table

Spark: Record count mismatch

what's the number of partitions when spark sql read a hive table?

Writing Hive table from Spark specifying CSV as the format

Categories

Resources