Spark - Error with datetime to read parquet file - apache-spark

I'm in EMR getting data from Glue Catalog
when I try to pass this data and read it via Spark SQL it throws me the following error:
Caused by: org.apache.spark.SparkUpgradeException:
You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar.
See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteTimestampRebaseFuncInRead$1(DataSourceUtils.scala:209)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anon$4.addLong(ParquetRowConverter.scala:330)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:268)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:367)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
... 21 more
I tried to change the following settings in spark but there was no successful result
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead","CORRECTED")
and
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
I also did a select on the view created with the following code and it worked without problems.
so it makes me think the problem is when I use% sql
why does this happen? I am doing something wrong?

Related

Unsupported encoding: DELTA_BYTE_ARRAY when reading from Kusto using Kusto Spark connector or using Kusto export with Spark version < 3.3.0

Since last week we started getting java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY while reading from Kusto using the Kusto Spark connector 'Distributed' mode (same thing happens when trying to use the export command and use parquet read over it). How can we resolve this issue? Is it caused by change from the Kusto service or Spark?
We tried setting the configs "spark.sql.parquet.enableVectorizedReader=false", and "parquet.split.files=false". This works but we are worried about the outcome of this approach.
The change of behavior is due to Kusto rolling out a new implementation of Parquet writer that uses new encoding schemes, one of which being delta byte array for strings and other byte array-based Parquet types. This encoding scheme has been part of the Parquet format for a few years now and modern readers are expected to support it. i.e. Spark 3.3.0. This provides performance and cost improvements and therefore we highly advise customers to move to Spark 3.3.0 or above. Kusto Spark connector is using Kusto export for reading and by that produces parquet files with the new writer.
Possible solutions in case this is not an option:
Use Kusto Spark connector version 3.1.10, which checks the Spark version and disables the writer in the export command if version is less than 3.3.0.
Disable Spark configs:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false"). spark.conf.set("parquet.split.files", "false")
In cases non of the above solves the issue you may open a support ticket to ADX to disable the feature (this is a temporary solution)
Note- Synapse workspace will receive the connector updated version in the following days.

Parquet column issue while loading data to SQL Server using spark-submit

I am facing the following issue while migrating the data from hive to SQL Server using spark job with query given through JSON file.
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file.
Column: [abc], Expected: string, Found: INT32
Now from what I understand is the parquet file contains different column structure than the Hive view. I am able to retrieve data using tools like Teradata, etc. While loading to different server causes the issue.
Can anyone help me understand the problem and give a workaround for the same?
Edit:
spark version 2.4.4.2
Scala version 2.11.12
Hive 2.3.6
SQL Server 2016

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;
I knew in Spark 2.xx, the way to solve this issue is to add the following option.
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?
Thanks!
It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?
Something like : batchDF.createOrReplaceTempView("mytable")

Parquet column cannot be converted: Expected decimal, Found binary

I'm using Apache Nifi 1.9.2 to load data from a relational database into Google Cloud Storage. The purpose is to write the outcome into Parquet files as it stores data in columnar way. To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor). The problem with these resulting files is that I cannot read Decimal typed columns when consuming the files in Spark 2.4.0 (scala 2.11.12): Parquet column cannot be converted ... Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
Links to parquet/avro example files:
https://drive.google.com/file/d/1PmaP1qanIZjKTAOnNehw3XKD6-JuDiwC/view?usp=sharing
https://drive.google.com/file/d/138BEZROzHKwmSo_Y-SNPMLNp0rj9ci7q/view?usp=sharing
As I know that Nifi works with the Avro format in between processors within the flowfile, I have also written the avro file (like it is just before the ConvertAvroToParquet processor) and this I can read in Spark.
It is also possible to not use logical types in Avro, but then I lose the column types in the end and all columns are Strings (not preferred).
I have also experimented with the PutParquet processor without success.
val arhg_parquet = spark.read.format("parquet").load("ARHG.parquet")
arhg_parquet.printSchema()
arhg_parquet.show(10,false)
printSchema() gives proper result, indicating ARHG3A is a decimal(2,0)
Executing the show(10,false) results in an ERROR: Parquet column cannot be converted in file file:///C:/ARHG.parquet. Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor)
Try upgrading to NiFi 1.12.1, our latest release. Some improvements were made to handling decimals that might be applicable here. Also, you can use the Parquet reader and writer services to convert from Avro to Parquet now as of ~1.10.0. If that doesn't work, it may be a bug that should have a Jira ticket filed against it.

SparkSQL attempts to read data from non-existing path

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

Resources