Parquet column issue while loading data to SQL Server using spark-submit - apache-spark

I am facing the following issue while migrating the data from hive to SQL Server using spark job with query given through JSON file.
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file.
Column: [abc], Expected: string, Found: INT32
Now from what I understand is the parquet file contains different column structure than the Hive view. I am able to retrieve data using tools like Teradata, etc. While loading to different server causes the issue.
Can anyone help me understand the problem and give a workaround for the same?
Edit:
spark version 2.4.4.2
Scala version 2.11.12
Hive 2.3.6
SQL Server 2016

Related

Table created with "stored as Parquet" option using PySpark SQL or Hive does not actually store data files in Parquet format

I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet") and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table"), however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.
The same problem occurs when repeating those steps in Hive.
But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet")
everything works just fine.
So, my question is: what's wrong with the SQL way of creating and populating Parquet table?
Spark version 2.4.5, Hive version 3.1.2.
Update (27 Dec 2022 after #mazaneicha answer)
Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):
Method # of files Total size Parquet version File name
Hive Insert 8 34.7 G Jparquet-mr version 1.10.0 xxxxxx_x
PySpark SQL Insert 8 10.4 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto 8 10.9 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable 8 11.5 G Jparquet-mr version 1.10.1 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet
(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).
So, considering the above mentioned, it's still not clear:
Why there is no file extension in 3 out of 4 cases?
Why files created with Hive are so big? (no compression, I suppose).
Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?
File format is not defined by the extension, but rather by the contents. You can quickly check if format is parquet by looking for magic bytes PAR1 at the very beginning and the very end of a file.
For in-depth format, metadata and consistency checking, try opening a file with parquet-tools.
Update:
As mentioned in online docs, parquet is supported by Spark as one of the many data sources via its common DataSource framework, so that it doesn't have to rely on Hive:
"When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance..."
You can find and review this implementation in Spark git repo (its open-source! :))

delta write with databricks job is not working as expected

I wrote the following code in python:
val input = spark.read.format("csv").option("header",true).load("input_path")
input.write.format("delta")
.partitionBy("col1Name","col2Name")
.mode("overwrite")
.save("output_path")
input is read properly and we have the col1Name and col2Name.
The problem here is that the data is written in parquet and the _delta_log folder stays empty(no items) so when I try to read the delta data I get an an error : 'output_path` is not a Delta table
How can I change the code to make the data properly data written in data with the _delta_log folder properly filled in?
I am using the following different conf in the databricks cluster:Apache Spark 2.4.5, Scala 2.11 and Apache Spark 3.1.2, Scala 2.12) but all of them give the same result.
Any idea how to fix this please?
Thanks a lot

Spark - Error with datetime to read parquet file

I'm in EMR getting data from Glue Catalog
when I try to pass this data and read it via Spark SQL it throws me the following error:
Caused by: org.apache.spark.SparkUpgradeException:
You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar.
See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteTimestampRebaseFuncInRead$1(DataSourceUtils.scala:209)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anon$4.addLong(ParquetRowConverter.scala:330)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:268)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:367)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
... 21 more
I tried to change the following settings in spark but there was no successful result
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead","CORRECTED")
and
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
I also did a select on the view created with the following code and it worked without problems.
so it makes me think the problem is when I use% sql
why does this happen? I am doing something wrong?

SparkSQL attempts to read data from non-existing path

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

Spark AVRO compatible with BigQuery

I'm trying to create an external table in Hive and another in BigQuery using the same data stored in Google Storage in Avro format wrote with Spark.
I'm using a Dataproc cluster with Spark 2.2.0, Spark-avro 4.0.0 and Hive 2.1.1
There are same differences between Avro versions/packages but If I create the table using Hive and then I write the files using Spark, I'm able to see them in Hive.
But for BigQuery is different, it is able to read Hive Avro files but NOT Spark Avro files.
Error:
The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .someField
Searching a little about the error, the problem is that Spark Avro files are different from Hive/BigQuery Avro files.
I don't know exactly how to fix this, maybe using different Avro package in Spark, but I haven't found which one is compatible with all the systems.
Also I would like to avoid tricky solutions like create a temporary table in Hive and create another using insert into ... select * from ... I'll write a lot of data and I would like to avoid this kind of solutions
Any help would be appreciated. Thanks
The error message is thrown by the C++ Avro library, which BigQuery uses. Hive probably uses the Java Avro library. The C++ library doesn't like namespace to start with ".".
This is the code from the library:
if (! ns_.empty() && (ns_[0] == '.' || ns_[ns_.size() - 1] == '.' || std::find_if(ns_.begin(), ns_.end(), invalidChar1) != ns_.end())) {
throw Exception("Invalid namespace: " + ns_);
}
Spark-avro has additional option recordNamespace to set root namespace, so it will not start from ..
https://github.com/databricks/spark-avro/blob/branch-4.0/README-for-old-spark-versions.md
Wondering if you ever found an answer to this.
I am seeing the same thing, where I am trying to load data into a bigquery table. The library first loads the data into GCS in avro format. The schema has an array of struct as well, and the namespace beings with a .

Resources