How does schema inference work in spark.read.parquet? - apache-spark

I'm trying to read a parquet file on spark and I have a question.
How is the type inferred when loading a parquet file with spark.read.parquet?
1. Parquet Type INT32 -> Spark Type IntegerType
2. Parquet inferred from actual stored values -> Spark IntegerType
Is there a dictionary for mapping like 1?
Or is it inferred from the actual stored values like 2?

Spark uses the parquet schema to parse it to an internal representation (i.e, StructType), it is a bit hard to find this information on spark docs. I went through the code to find the mapping you are looking for here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281

Related

Pyspark org.apache.spark.sql.avro.IncompatibleSchemaException

I'm running pyspark and trying to read some avro files in. The avro files are stored in AWS S3. The script goes something like:
df = spark.read.format('avro').load('/path.avro')
df.checkpoint()
However, I'm getting this bug. Notably, it occurs only sometimes for the same input files:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException:
Cannot convert Avro to catalyst because schema at path dummy.lat is
not compatible (avroType = "double", sqlType = FloatType).
I investigated further and dummy.lat is indeed stored as a double. It used to be stored as a float in our database.
Why is this causing an issue? Isn't spark able to infer schema?
Double to Float is not listed as a supported conversion: https://spark.apache.org/docs/latest/sql-data-sources-avro.html
You could try changing your SQL Type to Double.

How to select columns that contain any of the given strings as part of the column name in Pyspark [duplicate]

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Schema mismatch - Spark DataFrame written to Delta

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the source dataframe schema. Is this expected or am I making a mistake here? Is there a way to get the schema of the written delta to match exactly with the source df?
scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,false), StructField(val1,StringType,true), StructField(val2,StringType,false), StructField(dt,StringType,true))
scala> df.write.format("delta").save("D:/temp/d1")
scala> spark.read.format("delta").load("D:/temp/d1").schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,true), StructField(val1,StringType,true), StructField(val2,StringType,true), StructField(dt,StringType,true))
Writing in parquet, the underlying format of delta lake, can't guarantee the nullability of the column.
Maybe you wrote a parquet that for sure it's not null, but the schema is never validated on write in parquet, and any could append some data with the same schema, but with nulls. So spark will always put as nullable the columns, just to prevention.
This behavior can be prevented using a catalog, that will validate that the dataframe follows the expected schema.
The problem is that a lot of users thought that their schema was not nullable, and wrote null data. Then they couldn't read the data back as their parquet files were corrupted. In order to avoid this, we always assume the table schema is nullable in Delta. In Spark 3.0, when creating a table, you will be able to specify columns as NOT NULL. This way, Delta will actually prevent null values from being written, because Delta will check that the columns are in fact not null when writing it.

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Spark changes the schema when writing to Avro

I have a Spark job (in CDH 5.5.1) that loads two Avro files (both with the same schema), combines them to make a DataFrame (also with the same schema) then writes them back out to Avro.
The job explicitly compares the two input schemas to ensure they are the same.
This is used to combine existing data with a few updates (since the files are immutable). I then replace the original file with the new combined file by renaming them in HDFS.
However, if I repeat the update process (i.e. try to add some further updates to the previously updated file), the job fails because the schemas are now different! What is going on?
This is due to the behaviour of the spark-avro package.
When writing to Avro, spark-avro writes everything as unions of the given type along with a null option.
In other words, "string" becomes ["string", "null"] so every field becomes nullable.
If your input schema already contains only nullable fields, then this problem doesn't become apparent.
This isn't mentioned on the spark-avro page, but is described as one of the limitations of spark-avro in some Cloudera documentation:
Because Spark is converting data types, watch for the following:
Enumerated types are erased - Avro enumerated types become strings when they are read into Spark because Spark does not support
enumerated types.
Unions on output - Spark writes everything as unions of the given type along with a null option.
Avro schema changes - Spark reads everything into an internal representation. Even if you just read and then write the data, the
schema for the output will be different.
Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being
partitioned on are the last elements.
See also this github issue: (spark-avro 92)

Resources