Spark DataFrame Schema Nullable Fields - apache-spark

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.
italianVotes.csv
2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0
Scala
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)
val fileName = "italianVotes.csv"
val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)
italianDF.printSchema()
// output
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
Python
from pyspark.sql.types import *
schema = StructType([
StructField("id", IntegerType(), False),
StructField("postId", IntegerType(), False),
StructField("voteType", IntegerType(), True),
StructField("time", TimestampType(), True),
])
file_name = "italianVotes.csv"
italian_df = spark.read.csv(file_name, schema = schema, sep = "~")
# print schema
italian_df.printSchema()
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?

In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.
You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.

Related

Validate NULL values from parquet files

I'm reading parquet files from a third party. It seems that parquet always converts the schema of files to nullable columns regardless of how they were written.
When reading these files I would like to reject files that contain a NULL value in a particular column. With csv or json you can do:
schema = StructType([StructField("id", IntegerType(), False), StructField("col1", IntegerType(), False)])
df = spark.read.format("csv").schema(schema).option("mode", "FAILFAST").load(myPath)
And the load will be rejected is it contains a NULL in col1. If you try this in Parquet it will be accepted.
I could do a filter or count on the column for Null values and raise an error - that from a performance stance that is terrible because I will get an extra Stage in the job. It will also reject the complete dataframe and all files (yes the CSV route does this as well).
Is there anyway to enforce validation on the files on read?
I'm using version Spark 3 if it helps.
Edit with example:
from pyspark.sql.types import *
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), True)
])
df = spark.createDataFrame([(1,1),(2, None)], schema)
df.write.format("parquet").mode("overwrite").save("/tmp/parquetValidation/")
df2 = spark.read.format("parquet").load("/tmp/parquetValidation/")
df2.printSchema()
Returns
|-- Id: integer (nullable = true)
|-- col1: integer (nullable = true)
Re-read the file with a schema blocking nulls:
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), False)
])
df3 = spark.read.format("parquet").schema(schema).option("mode", "FAILFAST").load("/tmp/parquetValidation/")
df3.printSchema()
Returns:
|-- Id: integer (nullable = true)
|-- col1: integer (nullable = true)
Ie the schema is not applied.
Thanks to #Sasa in the comments on the question.
from pyspark.sql import DataFrame
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), False)
])
df_junk = spark.read.format("parquet").schema(schema).load("/tmp/parquetValidation/")
new_java_schema = spark._jvm.org.apache.spark.sql.types.DataType.fromJson(schema.json())
java_rdd = df_junk._jdf.toJavaRDD()
new_jdf = spark._jsparkSession.createDataFrame(java_rdd, new_java_schema)
df_validate = DataFrame(new_jdf, df.sql_ctx)
df_validate.printSchema()
Returns
|-- Id: integer (nullable = false)
|-- col1: integer (nullable = false)
And running an action causes:
java.lang.RuntimeException: The 1th field 'col1' of input row cannot be null.
Not nice dropping to a java rdd - but it works

Pyspark dataframe write and read changes schema

I have a spark dataframe which contains both string and int columns.
But when I write the dataframe to a csv file and then load it later, the all the columns are loaded as string.
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "count"])
Before:
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: long (nullable = true)
df.write.mode('overwrite').option('header', True).csv(filepath)
new_df = spark.read.option('header', True).csv(filepath)
After:
new_df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: string (nullable = true)
How do I specify to store the schema as well while writing?
We don't have to specify schema while writing but we can specify the schema while reading.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType(
[
StructField('Name', StringType(), True),
StructField('count', LongType(), True)
]
)
#specify schema while reading
new_df = spark.read.schema(schema).option('header', True).csv(filepath)
new_df.printSchema()
#or else use inferschema option as true but specifying schema will be more robust
new_df = spark.read.option('header', True).option("inferSchema",True).csv(filepath)

Python 3 function to loop over pandas data frame to change schema

I'm converting a bunch of pandas data frames into spark df then writing to hdfs. Also explicitly specifying the schema to change all data types into string to avoid the merge class conflict.
Trying to write a function that will loop through all the pandas df columns, create the schema then I can use the schema to convert to spark.
Here is what I have so far:
def creating_schema(df):
for columnName in df.columns:
schema = StructType([(StructField('"' + columnName + '"' , StringType(), True))])
print(schema)
return(schema)
This outputs:
StructType(List(StructField("column_1",StringType,true)))
StructType(List(StructField("column_2",StringType,true)))
StructType(List(StructField("column_3",StringType,true)))
StructType(List(StructField("column_4",StringType,true)))
StructType(List(StructField("column_5",StringType,true)))
However, I believe I need something in this format for it to work:
schema = StructType([StructField("column_1" , StringType(), True),
StructField("column_2" , StringType(), True),
StructField("column_3" , StringType(), True),
StructField("column_4" , StringType(), True),
StructField("column_5" , StringType(), True)
])
Any help in writing this function would be helpful!
Thanks!
Try:
def creating_schema(df):
sf = []
for columnName in df.columns:
sf.append(StructField(columnName, StringType(), True))
return StructType(sf)
Proof:
pdf = pd.DataFrame(columns=["column_1","column_2","column_3","column_4","column_5"])
schema=creating_schema(pdf)
sdf = sqlContext.createDataFrame(sc.emptyRDD(), schema)
sdf.printSchema()
root
|-- column_1: string (nullable = true)
|-- column_2: string (nullable = true)
|-- column_3: string (nullable = true)
|-- column_4: string (nullable = true)
|-- column_5: string (nullable = true)

How to save none nullable dataframe to a hive table?

I have a dataframe in spark which has a non nullable column in it. When I save it to HIVE and then read it from HIVE the non nullable column is nullable. What might be wrong?
For some context I am taking an existing dataframe and changing it's schema to include that none nullable property.
df = spark.table("myhive_table")
df.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
schema = StructType([StructField("col1",spark_types.StringType(), True),
StructField("col2",spark_types.DoubleType(), False),
])
df2 = spark.createDataFrame(df.rdd,schema)
df2.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = false)
spark.sql('drop table myhive_table')
df.write.saveAsTable("myhive_table",overwrite = True)
spark.table("myhive_table").printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)

Got a Error when using DataFrame.schema.fields.update

I want to cast two columns in my DataFrame. Here is my code:
val session = SparkSession
.builder
.master("local")
.appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
.toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
I get these logs from my console:
root
|-- CID: integer (nullable = false)
|-- Name: string (nullable = true)
|-- STD: integer (nullable = false)
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
All I want to know is why this ERROR happen and how can I solve it?
appreciate that very much!
You can not update the schema of dataframe since the dataframe are immutable,
But you can update the schema of dataframe and assign to a new Dataframe.
Here is how you can do
val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))
newDF.printSchema()
The schema of newDF is
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
Your code:
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
In your code
df.schema.fields returns a Array of StructFields as
Array[StructFields]
then if you try to update as
df.schema.fields.update(0, StructField("CID", StringType))
This updates the value of Array[StructField] in 0th position, I this is not what you wanted
DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields
Hope this helps

Resources