Spark: create a nested schema - apache-spark

With spark,
import spark.implicits._
val data = Seq(
(1, ("value11", "value12")),
(2, ("value21", "value22")),
(3, ("value31", "value32"))
)
val df = data.toDF("id", "v1")
df.printSchema()
The result is the following:
root
|-- id: integer (nullable = false)
|-- v1: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
Now if I want to create the schema myself, how should I process?
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", ???)
))
Thanks.

According to example in here:
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/types/StructType.html
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val innerStruct =
StructType(
StructField("f1", IntegerType, true) ::
StructField("f2", LongType, false) ::
StructField("f3", BooleanType, false) :: Nil)
val struct = StructType(
StructField("a", innerStruct, true) :: Nil)
// Create a Row with the schema defined by struct
val row = Row(Row(1, 2, true))
And in your case it will be:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
Output:
StructType(
StructField(id,IntegerType,true),
StructField(nested,StructType(
StructField(value1,StringType,true),
StructField(value2,StringType,true)
),true)
)

Related

Spark withColumn changes column nullable property in schema

I'm using withColumn in order to override a certain column (applying the same value to the entire data frame), my problem is that withColumn changes the nullable property of the column:
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.lit
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true)
))
val data = Seq(Row(1, "pepsi"), Row(2, "coca cola"))
val rdd = spark.sparkContext.parallelize(data)
val df = spark.createDataFrame(rdd, schema)
df.withColumn("name", lit("*******"))
df.printSchema
result:
root
|-- id: string (nullable = true)
|-- name: string (nullable = false)
The best idea I have is change the schema after the manipulation, was wondering if someone has a better idea.
Thanks!

Pyspark dataframe write and read changes schema

I have a spark dataframe which contains both string and int columns.
But when I write the dataframe to a csv file and then load it later, the all the columns are loaded as string.
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "count"])
Before:
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: long (nullable = true)
df.write.mode('overwrite').option('header', True).csv(filepath)
new_df = spark.read.option('header', True).csv(filepath)
After:
new_df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: string (nullable = true)
How do I specify to store the schema as well while writing?
We don't have to specify schema while writing but we can specify the schema while reading.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType(
[
StructField('Name', StringType(), True),
StructField('count', LongType(), True)
]
)
#specify schema while reading
new_df = spark.read.schema(schema).option('header', True).csv(filepath)
new_df.printSchema()
#or else use inferschema option as true but specifying schema will be more robust
new_df = spark.read.option('header', True).option("inferSchema",True).csv(filepath)

Spark: Create nested dataframe from a flat one

From the following dataframe:
import spark.implicits._
val data = Seq(
(1, "value11", "value12"),
(2, "value21", "value22"),
(3, "value31", "value32")
)
val df = data.toDF("id", "v1", "v2")
Is it possible to turn df to a nested dataframe, whose schema is:
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
I know there is a RDD solution:
spark.createDataFrame(df.rdd.map(row => Row(row.get(0), Row(row.get(1), row.get(2))), schema)
But I want to apply it dynamically to many columns, this will lead to a lot of boilerplate code.
is there an easier way?
Thx.
One way you could do is using struct
You can also rename the columns if you want as
val newColumns = List("value1", "value2")
columns.zip(newColumns).foldLeft(df){(acc, name) =>
acc.withColumnRenamed(name._1, name._2)
}
//list the columns names you want to nested
val columns = df.columns.tail
//use struct to create new fields and drop all columns
val finalDF = df.withColumn("nested", struct(columns.map(col(_)):_*))..drop(columns:_*)
Final Schema:
finalDF.printSchema()
root
|-- id: integer (nullable = false)
|-- nested: struct (nullable = false)
| |-- v1: string (nullable = true)
| |-- v2: string (nullable = true)

Spark DataFrame Schema Nullable Fields

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.
italianVotes.csv
2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0
Scala
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)
val fileName = "italianVotes.csv"
val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)
italianDF.printSchema()
// output
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
Python
from pyspark.sql.types import *
schema = StructType([
StructField("id", IntegerType(), False),
StructField("postId", IntegerType(), False),
StructField("voteType", IntegerType(), True),
StructField("time", TimestampType(), True),
])
file_name = "italianVotes.csv"
italian_df = spark.read.csv(file_name, schema = schema, sep = "~")
# print schema
italian_df.printSchema()
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?
In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.
You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.

how to cast all columns of dataframe to string

I have a mixed type dataframe.
I am reading this dataframe from hive table using
spark.sql('select a,b,c from table') command.
Some columns are int , bigint , double and others are string. There are 32 columns in total.
Is there any way in pyspark to convert all columns in the data frame to string type ?
Just:
from pyspark.sql.functions import col
table = spark.sql("table")
table.select([col(c).cast("string") for c in table.columns])
Here's a one line solution in Scala :
df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
Let's see an example here :
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
Row(1, "a"),
Row(5, "z")
)
val schema = StructType(
List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)
val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)
I hope it helps
for col in df_data.columns:
df_data = df_data.withColumn(col, df_data[col].cast(StringType()))
For Scala, spark version > 2.0
case class Row(id: Int, value: Double)
import spark.implicits._
import org.apache.spark.sql.functions._
val r1 = Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)).toDF()
r1.show
+---+-----+
| id|value|
+---+-----+
| 1| 1.0|
| 2| 2.0|
| 3| 3.0|
+---+-----+
val castedDF = r1.columns.foldLeft(r1)((current, c) => current.withColumn(c, col(c).cast("String")))
castedDF.printSchema
root
|-- id: string (nullable = false)
|-- value: string (nullable = false)
you can cast single column as this
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("id", F.col("new_id").cast(T.StringType()))
and just for all column to cast

Resources