Schema mismatch - Spark DataFrame written to Delta - apache-spark

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the source dataframe schema. Is this expected or am I making a mistake here? Is there a way to get the schema of the written delta to match exactly with the source df?
scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,false), StructField(val1,StringType,true), StructField(val2,StringType,false), StructField(dt,StringType,true))
scala> df.write.format("delta").save("D:/temp/d1")
scala> spark.read.format("delta").load("D:/temp/d1").schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,true), StructField(val1,StringType,true), StructField(val2,StringType,true), StructField(dt,StringType,true))

Writing in parquet, the underlying format of delta lake, can't guarantee the nullability of the column.
Maybe you wrote a parquet that for sure it's not null, but the schema is never validated on write in parquet, and any could append some data with the same schema, but with nulls. So spark will always put as nullable the columns, just to prevention.
This behavior can be prevented using a catalog, that will validate that the dataframe follows the expected schema.

The problem is that a lot of users thought that their schema was not nullable, and wrote null data. Then they couldn't read the data back as their parquet files were corrupted. In order to avoid this, we always assume the table schema is nullable in Delta. In Spark 3.0, when creating a table, you will be able to specify columns as NOT NULL. This way, Delta will actually prevent null values from being written, because Delta will check that the columns are in fact not null when writing it.

Related

How to stop Spark from changing varchar to string?

I have a Hive table with below schema:
hive> desc <DB>.<TN>;
id int,
name varchar(10),
reg varchar(8);
When I try to describe the same table on Spark (Pyspark shell) it's converting Varchar to String.
spark.sql("""describe <DB>.<TN>""").show()
id int
name string
reg string
I would like to retain Hive datatypes while querying on Spark. Means I am expecting varchar in the place of String. Does anyone know how to stop spark from inferring datatypes of its own?
There is no varchar in Apache Spark, it's all Strings. Yeah, this page says there is a VarcharType but it is only for schemas.
Once the data is in the dataframe, things are transparent. When you save the data, all should be back to varchar in Hive.
You can force a schema on reading the dataframe when it is available (like CSV for example), but I do not think it is available for Hive, which is already typed.
I was going to tell you to just add a schema
schema = StructType([StructField('ID', IntegerType(), True),StructField('name', VarcharType(10), True),StructField('reg', VarcharType(8), True)])
df3 = sqlContext.createDataFrame(rdd, schema)
to a dataframe but data frames do not have a varchar type in spark <= 2.4. Which is likely why your varchars are being converted to StringType. That isn't to say that they aren't available in spark(2.4 >).

How to select columns that contain any of the given strings as part of the column name in Pyspark [duplicate]

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Upgrading to Spark 2.0.1 broke array<string> in parquet DataFrame

I have a table with a few columns, some of which are arrays. Since upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null when reading in a DataFrame.
When writing the Parquet files, the schema of the column is specified as
StructField("packageIds",ArrayType(StringType)).
The schema of the column in the Hive Metastore is
packageIds array<string>
The schema used in the writer exactly matches the schema in the Metastore
The query is a simple "select *"
spark.sql("select * from tablename limit 1").collect() // null columns in Row
How can I debug this issue? Notable things I've already investigated:
It works in spark 1.6
I've inspected the parquet files using parquet-tools and can see the data.
I also have another table written in exactly the same way and it doesn't have the issue.

How using Spark read Hive DOUBLE value stored in Avro logical format

I have existing Hive data stored in Avro format. For whatever reason reading these data by executing SELECT is very slow. I didn't figure out yet why. The data is partitioned and my WHERE clause always follows the partition columns. So I decided to read the data directly by navigating to the partition path and using Spark SQLContext. This works much faster. However, the problem I have is reading the DOUBLE values. Avro stores them in a binary format.
When I execute the following query in Hive:
select myDoubleValue from myTable;
I'm getting the correct expected values
841.79
4435.13
.....
but the following Spark code:
val path="PathToMyPartition"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.avro(path)
df.select("myDoubleValue").rdd.map(x => x.getAs[Double](0))
gives me this exception
java.lang.ClassCastException : [B cannot be cast to java.lang.Double
What would be the right way either to provide a schema or convert the value that is stored in a binary format into a double format?
I found a partial solution how to convert the Avro schema to a Spark SQL StructType. There is com.databricks.spark.avro.SchemaConverters developed by Databricks that has a bug in converting Avro logical data types in their toSqlType(avroSchema: Schema) method which was incorrectly converting the logicalType
{"name":"MyDecimalField","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":18}],"doc":"","default":null}
into
StructField("MyDecimalField",BinaryType,true)
I fixed this bug in my local version of the code and now it is converting into
StructField("MyDecimalField",DecimalType(38,18),true)
Now, the following code reads the Avro file and creates a Dataframe:
val avroSchema = new Schema.Parser().parse(QueryProvider.getQueryString(pathSchema))
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.schema(MyAvroSchemaConverter.toSqlType(avroSchema).dataType.asInstanceOf[StructType]).avro(path)
However, when I'm selecting the filed that I expect to be decimal by
df.select("MyDecimalField")
I'm getting the following exception:
scala.MatchError: [B#3e6e0d8f (of class [B)
This is where I stuck at this time and would appreciate if anyone can suggest what to do next or any other work around.

Resources