Validate NULL values from parquet files - apache-spark

I'm reading parquet files from a third party. It seems that parquet always converts the schema of files to nullable columns regardless of how they were written.
When reading these files I would like to reject files that contain a NULL value in a particular column. With csv or json you can do:
schema = StructType([StructField("id", IntegerType(), False), StructField("col1", IntegerType(), False)])
df = spark.read.format("csv").schema(schema).option("mode", "FAILFAST").load(myPath)
And the load will be rejected is it contains a NULL in col1. If you try this in Parquet it will be accepted.
I could do a filter or count on the column for Null values and raise an error - that from a performance stance that is terrible because I will get an extra Stage in the job. It will also reject the complete dataframe and all files (yes the CSV route does this as well).
Is there anyway to enforce validation on the files on read?
I'm using version Spark 3 if it helps.
Edit with example:
from pyspark.sql.types import *
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), True)
])
df = spark.createDataFrame([(1,1),(2, None)], schema)
df.write.format("parquet").mode("overwrite").save("/tmp/parquetValidation/")
df2 = spark.read.format("parquet").load("/tmp/parquetValidation/")
df2.printSchema()
Returns
|-- Id: integer (nullable = true)
|-- col1: integer (nullable = true)
Re-read the file with a schema blocking nulls:
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), False)
])
df3 = spark.read.format("parquet").schema(schema).option("mode", "FAILFAST").load("/tmp/parquetValidation/")
df3.printSchema()
Returns:
|-- Id: integer (nullable = true)
|-- col1: integer (nullable = true)
Ie the schema is not applied.

Thanks to #Sasa in the comments on the question.
from pyspark.sql import DataFrame
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), False)
])
df_junk = spark.read.format("parquet").schema(schema).load("/tmp/parquetValidation/")
new_java_schema = spark._jvm.org.apache.spark.sql.types.DataType.fromJson(schema.json())
java_rdd = df_junk._jdf.toJavaRDD()
new_jdf = spark._jsparkSession.createDataFrame(java_rdd, new_java_schema)
df_validate = DataFrame(new_jdf, df.sql_ctx)
df_validate.printSchema()
Returns
|-- Id: integer (nullable = false)
|-- col1: integer (nullable = false)
And running an action causes:
java.lang.RuntimeException: The 1th field 'col1' of input row cannot be null.
Not nice dropping to a java rdd - but it works

Related

Spark Dataframe returns NULL for entire row when one column value of that row is NULL

Input data -
{"driverId":1,"driverRef":"hamilton","number":44,"code":"HAM","name":{"forename":"Lewis","surname":"Hamilton"},"dob":"1985-01-07","nationality":"British","url":"http://en.wikipedia.org/wiki/Lewis_Hamilton"}
{"driverId":2,"driverRef":"heidfeld","number":"\\N","code":"HEI","name":{"forename":"Nick","surname":"Heidfeld"},"dob":"1977-05-10","nationality":"German","url":"http://en.wikipedia.org/wiki/Nick_Heidfeld"}
{"driverId":3,"driverRef":"rosberg","number":6,"code":"ROS","name":{"forename":"Nico","surname":"Rosberg"},"dob":"1985-06-27","nationality":"German","url":"http://en.wikipedia.org/wiki/Nico_Rosberg"}
{"driverId":4,"driverRef":"alonso","number":14,"code":"ALO","name":{"forename":"Fernando","surname":"Alonso"},"dob":"1981-07-29","nationality":"Spanish","url":"http://en.wikipedia.org/wiki/Fernando_Alonso"}
{"driverId":5,"driverRef":"kovalainen","number":"\\N","code":"KOV","name":{"forename":"Heikki","surname":"Kovalainen"},"dob":"1981-10-19","nationality":"Finnish","url":"http://en.wikipedia.org/wiki/Heikki_Kovalainen"}
{"driverId":6,"driverRef":"nakajima","number":"\\N","code":"NAK","name":{"forename":"Kazuki","surname":"Nakajima"},"dob":"1985-01-11","nationality":"Japanese","url":"http://en.wikipedia.org/wiki/Kazuki_Nakajima"}
{"driverId":7,"driverRef":"bourdais","number":"\\N","code":"BOU","name":{"forename":"Sébastien","surname":"Bourdais"},"dob":"1979-02-28","nationality":"French","url":"http://en.wikipedia.org/wiki/S%C3%A9bastien_Bourdais"}
After reading this data into spark dataframe while display that df, I could se entire row for driverId 2,5,6,7 is NULL. I could see column-number value is NULL for that driver id.
Here is my code. Any mistakes here?
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
name_field = StructType(fields =[
StructField("forename", StringType(), True),
StructField("surname", StringType(), True)
])
driver_schema = StructType(fields =[
StructField("driverId", IntegerType(), False),
StructField("driverRef", StringType(), True),
StructField("number", IntegerType(), True),
StructField("code", StringType(), True),
StructField("name", name_field),
StructField("dob", DateType(), True),
StructField("nationality", StringType(),True),
StructField("url", StringType(), True)
])
driver_df = spark.read\
.schema(driver_schema)\
.json('dbfs:/mnt/databrickslearnf1azure/raw/drivers.json')
driver_df.printSchema()
root
|-- driverId: integer (nullable = true)
|-- driverRef: string (nullable = true)
|-- number: integer (nullable = true)
|-- code: string (nullable = true)
|-- name: struct (nullable = true)
| |-- forename: string (nullable = true)
| |-- surname: string (nullable = true)
|-- dob: date (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
display(driver_df)
You can change your initial schema to be as follows which assume the number to be of type string.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
name_field = StructType(fields =[
StructField("forename", StringType(), True),
StructField("surname", StringType(), True)
])
driver_schema = StructType(fields =[
StructField("driverId", IntegerType(), False),
StructField("driverRef", StringType(), True),
StructField("number", StringType(), True),
StructField("code", StringType(), True),
StructField("name", name_field),
StructField("dob", DateType(), True),
StructField("nationality", StringType(),True),
StructField("url", StringType(), True)
])
then you can read the data from the json file using the same code that you are using as follows:
driver_df = spark.read\
.schema(driver_schema)\
.json('dbfs:/mnt/databrickslearnf1azure/raw/drivers.json')
driver_df.printSchema()
Once you have read the data then you can apply the logic to convert "\N" to null and then change the data type of the column from string to integer as below :
from pyspark.sql.functions import *
df = driver_df.withColumn("number", when(driver_df.number=="\\N","null").otherwise(driver_df.number))
finaldf = df.withColumn("number",df.number.cast(IntegerType()))
finaldf.printSchema()
Now if you do the display or show on the dataframe you can see the output as below :
You are seeing this because, according to the official databricks docs: Cause
Spark 3.0 and above (Databricks Runtime
7.3 LTS and above) cannot parse JSON arrays as structs. You should pass the schema as ArrayType instead of StructType.
Solution: Pass the schema as ArrayType instead of StructType.
driver_schema = ArrayType(StructType(fields =[
StructField("driverId", IntegerType(), False),
StructField("driverRef", StringType(), True),
StructField("number", IntegerType(), True),
StructField("code", StringType(), True),
StructField("name", name_field),
StructField("dob", DateType(), True),
StructField("nationality", StringType(),True),
StructField("url", StringType(), True)
]))

spark data frame Schema With Data Definitions

I'm trying to add comments to the field (Schema With Data Definitions), below is the implementation I'm trying.
Tried to with StructType.add() (code in comments) and also with StructType([ StructField("filed",dtype,boolean,metadata )]
got below error. Not sure this implementation works, Can someone help me here, I'm new to spark.
I'm looking for output(Schema With Data Definitions) like
df.printSchema()
root
|-- firstname: string (nullable = true) comments:val1
|-- middlename: string (nullable = true) comments:val2
|-- lastname: string (nullable = true) comments:val3
|-- id: string (nullable = true) comments:val4
|-- gender: string (nullable = true) comments:val5
|-- salary: integer (nullable = true) comments:val6
error:
IllegalArgumentException: Failed to convert the JSON string '{"metadata":"val1","name":"firstname","nullable":true,"type":"string"}' to a field.
Code Which I'm trying to add comments to the field:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True,'val1'), \
StructField("middlename",StringType(),True,'val2'), \
StructField("lastname",StringType(),True,'val3'), \
StructField("id", StringType(), True,'val4'), \
StructField("gender", StringType(), True,'val5'), \
StructField("salary", IntegerType(), True,'val6') \
])
# schema= StructType().add("firstname",StringType(),True,'val1').add("middlename",StringType(),True,'val2') \
.add("lastname",StringType(),True,'val3').add("id", StringType(), True,'val4').add("gender", StringType(), True,'val5').add("salary", IntegerType(), True,'val6')
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
StructField's metadata parameter needs an argument of a dictionary object. It would be something like this
StructField("firstname", StringType(), True, {"comment":"val1"})

Pyspark dataframe write and read changes schema

I have a spark dataframe which contains both string and int columns.
But when I write the dataframe to a csv file and then load it later, the all the columns are loaded as string.
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "count"])
Before:
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: long (nullable = true)
df.write.mode('overwrite').option('header', True).csv(filepath)
new_df = spark.read.option('header', True).csv(filepath)
After:
new_df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: string (nullable = true)
How do I specify to store the schema as well while writing?
We don't have to specify schema while writing but we can specify the schema while reading.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType(
[
StructField('Name', StringType(), True),
StructField('count', LongType(), True)
]
)
#specify schema while reading
new_df = spark.read.schema(schema).option('header', True).csv(filepath)
new_df.printSchema()
#or else use inferschema option as true but specifying schema will be more robust
new_df = spark.read.option('header', True).option("inferSchema",True).csv(filepath)

Python 3 function to loop over pandas data frame to change schema

I'm converting a bunch of pandas data frames into spark df then writing to hdfs. Also explicitly specifying the schema to change all data types into string to avoid the merge class conflict.
Trying to write a function that will loop through all the pandas df columns, create the schema then I can use the schema to convert to spark.
Here is what I have so far:
def creating_schema(df):
for columnName in df.columns:
schema = StructType([(StructField('"' + columnName + '"' , StringType(), True))])
print(schema)
return(schema)
This outputs:
StructType(List(StructField("column_1",StringType,true)))
StructType(List(StructField("column_2",StringType,true)))
StructType(List(StructField("column_3",StringType,true)))
StructType(List(StructField("column_4",StringType,true)))
StructType(List(StructField("column_5",StringType,true)))
However, I believe I need something in this format for it to work:
schema = StructType([StructField("column_1" , StringType(), True),
StructField("column_2" , StringType(), True),
StructField("column_3" , StringType(), True),
StructField("column_4" , StringType(), True),
StructField("column_5" , StringType(), True)
])
Any help in writing this function would be helpful!
Thanks!
Try:
def creating_schema(df):
sf = []
for columnName in df.columns:
sf.append(StructField(columnName, StringType(), True))
return StructType(sf)
Proof:
pdf = pd.DataFrame(columns=["column_1","column_2","column_3","column_4","column_5"])
schema=creating_schema(pdf)
sdf = sqlContext.createDataFrame(sc.emptyRDD(), schema)
sdf.printSchema()
root
|-- column_1: string (nullable = true)
|-- column_2: string (nullable = true)
|-- column_3: string (nullable = true)
|-- column_4: string (nullable = true)
|-- column_5: string (nullable = true)

Spark DataFrame Schema Nullable Fields

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.
italianVotes.csv
2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0
Scala
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)
val fileName = "italianVotes.csv"
val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)
italianDF.printSchema()
// output
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
Python
from pyspark.sql.types import *
schema = StructType([
StructField("id", IntegerType(), False),
StructField("postId", IntegerType(), False),
StructField("voteType", IntegerType(), True),
StructField("time", TimestampType(), True),
])
file_name = "italianVotes.csv"
italian_df = spark.read.csv(file_name, schema = schema, sep = "~")
# print schema
italian_df.printSchema()
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?
In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.
You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.

Resources