I'm running pyspark and trying to read some avro files in. The avro files are stored in AWS S3. The script goes something like:
df = spark.read.format('avro').load('/path.avro')
df.checkpoint()
However, I'm getting this bug. Notably, it occurs only sometimes for the same input files:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException:
Cannot convert Avro to catalyst because schema at path dummy.lat is
not compatible (avroType = "double", sqlType = FloatType).
I investigated further and dummy.lat is indeed stored as a double. It used to be stored as a float in our database.
Why is this causing an issue? Isn't spark able to infer schema?
Double to Float is not listed as a supported conversion: https://spark.apache.org/docs/latest/sql-data-sources-avro.html
You could try changing your SQL Type to Double.
Related
I'm trying to read a parquet file on spark and I have a question.
How is the type inferred when loading a parquet file with spark.read.parquet?
1. Parquet Type INT32 -> Spark Type IntegerType
2. Parquet inferred from actual stored values -> Spark IntegerType
Is there a dictionary for mapping like 1?
Or is it inferred from the actual stored values like 2?
Spark uses the parquet schema to parse it to an internal representation (i.e, StructType), it is a bit hard to find this information on spark docs. I went through the code to find the mapping you are looking for here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281
Parquet file is generated from Azure Data Factory (copy activity - copying from Azure SQL to Parquet in data lake). When I am trying it read same parquet from Hive it is giving error as org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block.
If you generating parquet using Spark then you can set Spark.sql.parquet.writeLegacyFormat=true but how to handle same thing in Azure Data Factory.
Issue is coming for Decimal conversions
This issue is caused because of different parquet conventions used in Hive and Spark.
I assume Hive guesses the decimal as fixed-bytes but Spark actually writes out them as INT32 for 1<= precision <=9 and INT 64 for 10 <= precision <=18.
Errors like these caused with a schema that had DECIMAL field. Try using DOUBLE instead of DECIMAL.
I'm using Apache Nifi 1.9.2 to load data from a relational database into Google Cloud Storage. The purpose is to write the outcome into Parquet files as it stores data in columnar way. To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor). The problem with these resulting files is that I cannot read Decimal typed columns when consuming the files in Spark 2.4.0 (scala 2.11.12): Parquet column cannot be converted ... Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
Links to parquet/avro example files:
https://drive.google.com/file/d/1PmaP1qanIZjKTAOnNehw3XKD6-JuDiwC/view?usp=sharing
https://drive.google.com/file/d/138BEZROzHKwmSo_Y-SNPMLNp0rj9ci7q/view?usp=sharing
As I know that Nifi works with the Avro format in between processors within the flowfile, I have also written the avro file (like it is just before the ConvertAvroToParquet processor) and this I can read in Spark.
It is also possible to not use logical types in Avro, but then I lose the column types in the end and all columns are Strings (not preferred).
I have also experimented with the PutParquet processor without success.
val arhg_parquet = spark.read.format("parquet").load("ARHG.parquet")
arhg_parquet.printSchema()
arhg_parquet.show(10,false)
printSchema() gives proper result, indicating ARHG3A is a decimal(2,0)
Executing the show(10,false) results in an ERROR: Parquet column cannot be converted in file file:///C:/ARHG.parquet. Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor)
Try upgrading to NiFi 1.12.1, our latest release. Some improvements were made to handling decimals that might be applicable here. Also, you can use the Parquet reader and writer services to convert from Avro to Parquet now as of ~1.10.0. If that doesn't work, it may be a bug that should have a Jira ticket filed against it.
I have a Spark job (in CDH 5.5.1) that loads two Avro files (both with the same schema), combines them to make a DataFrame (also with the same schema) then writes them back out to Avro.
The job explicitly compares the two input schemas to ensure they are the same.
This is used to combine existing data with a few updates (since the files are immutable). I then replace the original file with the new combined file by renaming them in HDFS.
However, if I repeat the update process (i.e. try to add some further updates to the previously updated file), the job fails because the schemas are now different! What is going on?
This is due to the behaviour of the spark-avro package.
When writing to Avro, spark-avro writes everything as unions of the given type along with a null option.
In other words, "string" becomes ["string", "null"] so every field becomes nullable.
If your input schema already contains only nullable fields, then this problem doesn't become apparent.
This isn't mentioned on the spark-avro page, but is described as one of the limitations of spark-avro in some Cloudera documentation:
Because Spark is converting data types, watch for the following:
Enumerated types are erased - Avro enumerated types become strings when they are read into Spark because Spark does not support
enumerated types.
Unions on output - Spark writes everything as unions of the given type along with a null option.
Avro schema changes - Spark reads everything into an internal representation. Even if you just read and then write the data, the
schema for the output will be different.
Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being
partitioned on are the last elements.
See also this github issue: (spark-avro 92)
I have existing Hive data stored in Avro format. For whatever reason reading these data by executing SELECT is very slow. I didn't figure out yet why. The data is partitioned and my WHERE clause always follows the partition columns. So I decided to read the data directly by navigating to the partition path and using Spark SQLContext. This works much faster. However, the problem I have is reading the DOUBLE values. Avro stores them in a binary format.
When I execute the following query in Hive:
select myDoubleValue from myTable;
I'm getting the correct expected values
841.79
4435.13
.....
but the following Spark code:
val path="PathToMyPartition"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.avro(path)
df.select("myDoubleValue").rdd.map(x => x.getAs[Double](0))
gives me this exception
java.lang.ClassCastException : [B cannot be cast to java.lang.Double
What would be the right way either to provide a schema or convert the value that is stored in a binary format into a double format?
I found a partial solution how to convert the Avro schema to a Spark SQL StructType. There is com.databricks.spark.avro.SchemaConverters developed by Databricks that has a bug in converting Avro logical data types in their toSqlType(avroSchema: Schema) method which was incorrectly converting the logicalType
{"name":"MyDecimalField","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":18}],"doc":"","default":null}
into
StructField("MyDecimalField",BinaryType,true)
I fixed this bug in my local version of the code and now it is converting into
StructField("MyDecimalField",DecimalType(38,18),true)
Now, the following code reads the Avro file and creates a Dataframe:
val avroSchema = new Schema.Parser().parse(QueryProvider.getQueryString(pathSchema))
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.schema(MyAvroSchemaConverter.toSqlType(avroSchema).dataType.asInstanceOf[StructType]).avro(path)
However, when I'm selecting the filed that I expect to be decimal by
df.select("MyDecimalField")
I'm getting the following exception:
scala.MatchError: [B#3e6e0d8f (of class [B)
This is where I stuck at this time and would appreciate if anyone can suggest what to do next or any other work around.