Pyspark dataframe write and read changes schema - apache-spark

I have a spark dataframe which contains both string and int columns.
But when I write the dataframe to a csv file and then load it later, the all the columns are loaded as string.
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "count"])
Before:
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: long (nullable = true)
df.write.mode('overwrite').option('header', True).csv(filepath)
new_df = spark.read.option('header', True).csv(filepath)
After:
new_df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: string (nullable = true)
How do I specify to store the schema as well while writing?

We don't have to specify schema while writing but we can specify the schema while reading.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType(
[
StructField('Name', StringType(), True),
StructField('count', LongType(), True)
]
)
#specify schema while reading
new_df = spark.read.schema(schema).option('header', True).csv(filepath)
new_df.printSchema()
#or else use inferschema option as true but specifying schema will be more robust
new_df = spark.read.option('header', True).option("inferSchema",True).csv(filepath)

Related

pyspark read csv with user specified schema - returned all StringType

New to pyspark. I am trying to read the csv file from datalake blob using pyspark with user-specified schema structure type. Below is the code I tried.
from pyspark.sql.types import *
customschema = StructType([
StructField("A", StringType(), True)
,StructField("B", DoubleType(), True)
,StructField("C", TimestampType(), True)
])
df_1 = spark.read.format("csv").options(header="true", schema=customschema, multiline="true", enforceSchema='true').load(destinationPath)
df_1.show()
Out:
+---------+------+--------------------+
| A| B| C|
+---------+------+--------------------+
|322849691|9547.0|2020-09-24 07:30:...|
|322847371| 492.0|2020-09-23 13:15:...|
|322329853|6661.0|2020-09-07 09:45:...|
|322283810| 500.0|2020-09-04 13:12:...|
|322319107| 251.0|2020-09-02 13:51:...|
|322319096| 254.0|2020-09-02 13:51:...|
+---------+------+--------------------+
But I got the field type as String instead. I am not quite sure what I have done wrong.
df_1.printSchema()
Out:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
When you use DataFrameReader load method you should pass the schema using schema and not in the options :
df_1 = spark.read.format("csv") \
.options(header="true", multiline="true")\
.schema(customschema).load(destinationPath)
That's not the same as the API method spark.read.csv which accepts schema as an argument :
df_1 = spark.read.csv(destinationPath, schema=customschema, header=True)
It works with the following syntax
customschema=StructType([
StructField("A",StringType(), True),
StructField("B",DoubleType(), True),
StructField("C",TimestampType(), True)
])
df = spark.read.csv("test.csv", header=True, sep=";", schema=customschema)
df.show()
df.printSchema()
or you can also use
df = spark.read.load("test.csv",format="csv", sep=";", schema=customschema, header="true")
It is interesting that the read().option().load() syntax does not work for me either. I am not sure if it works at all. At least according to the documentation .options() is only used for write(), so to export a dataframe.
Another option would be to cast the datatypes afterwards
import pyspark.sql.functions as f
df=(df
.withColumn("B",f.col("B").cast("string"))
.withColumn("B",f.col("B").cast("double"))
.withColumn("C",f.col("C").cast("timestamp"))
)

Validate NULL values from parquet files

I'm reading parquet files from a third party. It seems that parquet always converts the schema of files to nullable columns regardless of how they were written.
When reading these files I would like to reject files that contain a NULL value in a particular column. With csv or json you can do:
schema = StructType([StructField("id", IntegerType(), False), StructField("col1", IntegerType(), False)])
df = spark.read.format("csv").schema(schema).option("mode", "FAILFAST").load(myPath)
And the load will be rejected is it contains a NULL in col1. If you try this in Parquet it will be accepted.
I could do a filter or count on the column for Null values and raise an error - that from a performance stance that is terrible because I will get an extra Stage in the job. It will also reject the complete dataframe and all files (yes the CSV route does this as well).
Is there anyway to enforce validation on the files on read?
I'm using version Spark 3 if it helps.
Edit with example:
from pyspark.sql.types import *
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), True)
])
df = spark.createDataFrame([(1,1),(2, None)], schema)
df.write.format("parquet").mode("overwrite").save("/tmp/parquetValidation/")
df2 = spark.read.format("parquet").load("/tmp/parquetValidation/")
df2.printSchema()
Returns
|-- Id: integer (nullable = true)
|-- col1: integer (nullable = true)
Re-read the file with a schema blocking nulls:
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), False)
])
df3 = spark.read.format("parquet").schema(schema).option("mode", "FAILFAST").load("/tmp/parquetValidation/")
df3.printSchema()
Returns:
|-- Id: integer (nullable = true)
|-- col1: integer (nullable = true)
Ie the schema is not applied.
Thanks to #Sasa in the comments on the question.
from pyspark.sql import DataFrame
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("col1", IntegerType(), False)
])
df_junk = spark.read.format("parquet").schema(schema).load("/tmp/parquetValidation/")
new_java_schema = spark._jvm.org.apache.spark.sql.types.DataType.fromJson(schema.json())
java_rdd = df_junk._jdf.toJavaRDD()
new_jdf = spark._jsparkSession.createDataFrame(java_rdd, new_java_schema)
df_validate = DataFrame(new_jdf, df.sql_ctx)
df_validate.printSchema()
Returns
|-- Id: integer (nullable = false)
|-- col1: integer (nullable = false)
And running an action causes:
java.lang.RuntimeException: The 1th field 'col1' of input row cannot be null.
Not nice dropping to a java rdd - but it works

create DataFrame of struct PySpark

enter image description hereHow can I create a dataframe of empty structs please.?
Thank you .
dataxx = []
schema = StructType(
[
StructField('Info1',
StructType([
StructField('fld', IntegerType(),True),
StructField('fld1', IntegerType(),True),
StructField('fld2', IntegerType(),True),
StructField('fld3', IntegerType(),True),
StructField('fld4', IntegerType(),True),
])
),
]
)
df = sqlCtx.createDataFrame(dataxx, schema)
Thank you for your help
If you want to create DataFrame that has specific schema but contains no data, you can do it simply by providing empty list to the createDataFrame function:
from pyspark.sql.types import *
schema = StructType(
[
StructField('Info1',
StructType([
StructField('fld', IntegerType(),True),
StructField('fld1', IntegerType(),True),
StructField('fld2', IntegerType(),True),
StructField('fld3', IntegerType(),True),
StructField('fld4', IntegerType(),True),
])
),
]
)
df = spark.createDataFrame([], schema)
df.printSchema()
root
|-- Info1: struct (nullable = true)
| |-- fld: integer (nullable = true)
| |-- fld1: integer (nullable = true)
| |-- fld2: integer (nullable = true)
| |-- fld3: integer (nullable = true)
| |-- fld4: integer (nullable = true)
Here spark is sparkSession.

Spark: Create nested dataframe from a flat one

From the following dataframe:
import spark.implicits._
val data = Seq(
(1, "value11", "value12"),
(2, "value21", "value22"),
(3, "value31", "value32")
)
val df = data.toDF("id", "v1", "v2")
Is it possible to turn df to a nested dataframe, whose schema is:
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
I know there is a RDD solution:
spark.createDataFrame(df.rdd.map(row => Row(row.get(0), Row(row.get(1), row.get(2))), schema)
But I want to apply it dynamically to many columns, this will lead to a lot of boilerplate code.
is there an easier way?
Thx.
One way you could do is using struct
You can also rename the columns if you want as
val newColumns = List("value1", "value2")
columns.zip(newColumns).foldLeft(df){(acc, name) =>
acc.withColumnRenamed(name._1, name._2)
}
//list the columns names you want to nested
val columns = df.columns.tail
//use struct to create new fields and drop all columns
val finalDF = df.withColumn("nested", struct(columns.map(col(_)):_*))..drop(columns:_*)
Final Schema:
finalDF.printSchema()
root
|-- id: integer (nullable = false)
|-- nested: struct (nullable = false)
| |-- v1: string (nullable = true)
| |-- v2: string (nullable = true)

Spark DataFrame Schema Nullable Fields

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.
italianVotes.csv
2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0
Scala
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)
val fileName = "italianVotes.csv"
val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)
italianDF.printSchema()
// output
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
Python
from pyspark.sql.types import *
schema = StructType([
StructField("id", IntegerType(), False),
StructField("postId", IntegerType(), False),
StructField("voteType", IntegerType(), True),
StructField("time", TimestampType(), True),
])
file_name = "italianVotes.csv"
italian_df = spark.read.csv(file_name, schema = schema, sep = "~")
# print schema
italian_df.printSchema()
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?
In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.
You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.

Resources