How to fix 'Failed to convert the JSON string 'varchar(2)' to a data type.' - apache-spark

We want to move from spark 3.0.1 to 3.1.2. According to migration guide varchar data types are now supported in table schema. Unfortunately data onboarded with new version cant be queried by old spark versions which considered varchar as a string in table schema. According to migration guide applying spark.sql.legacy.charVarcharAsString to true in Spark Session configuration should do the trick but we still get varchar datatype instead of string in hive table schema.
As is:
To be:
What are we missing here?

You should upgrade spark version according to this https://issues.apache.org/jira/browse/SPARK-37452. There is a bug which affect versions 3.1.2, 3.2.0. And it was fixed in versions 3.1.3, 3.2.1, 3.3.0

Related

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

Saving empty DataFrame with known schema (Spark 2.2.1)

Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records?
def example(spark: SparkSession, path: String, schema: StructType) = {
val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet")
dataframeWriter.save(path)
spark.read.load(path) // ERROR!! No files to read, so schema unknown
}
This is the answer I received from Databricks Support:
This is actually a known issue in Spark. There is already fix done in
opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271.
For more details on how this behavior will change from 2.4 please
check this doc change
https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808
The behavior will be changed from Spark 2.4. Until then you need to go
with any one of the following ways
Save a dataframe with at-least one record to preserve its schema
Save schema in a JSON file and use later
I got a similar problem with Spark 2.1.0. I solved it using repartition before writing.
df.repartition(1).write.parquet("my/path")

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Apche spark-1.3.1, I am getting exception ClassNotFoundException while UDTF lateral view used

While executing hive query with UDTF using apache spark 1.3.1 then alias name does not take
And If i use UDTF using lateral veiw then i got class not found for Custom UDTF class.
This mentioned issue was reported under spark 1.1.0, 1.1.1, 1.2.1 and 1.3.0, it has been resolved and will be available with spark 1.4.0.
https://issues.apache.org/jira/i#browse/SPARK-4811

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources