Spark: Create nested dataframe from a flat one

Spark: Create nested dataframe from a flat one - apache-spark

From the following dataframe:
import spark.implicits._
val data = Seq(
(1, "value11", "value12"),
(2, "value21", "value22"),
(3, "value31", "value32")
)
val df = data.toDF("id", "v1", "v2")
Is it possible to turn df to a nested dataframe, whose schema is:
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
I know there is a RDD solution:
spark.createDataFrame(df.rdd.map(row => Row(row.get(0), Row(row.get(1), row.get(2))), schema)
But I want to apply it dynamically to many columns, this will lead to a lot of boilerplate code.
is there an easier way?
Thx.

One way you could do is using struct
You can also rename the columns if you want as
val newColumns = List("value1", "value2")
columns.zip(newColumns).foldLeft(df){(acc, name) =>
acc.withColumnRenamed(name._1, name._2)
}
//list the columns names you want to nested
val columns = df.columns.tail
//use struct to create new fields and drop all columns
val finalDF = df.withColumn("nested", struct(columns.map(col(_)):_*))..drop(columns:_*)
Final Schema:
finalDF.printSchema()
root
|-- id: integer (nullable = false)
|-- nested: struct (nullable = false)
| |-- v1: string (nullable = true)
| |-- v2: string (nullable = true)

Related

Spark withColumn changes column nullable property in schema

I'm using withColumn in order to override a certain column (applying the same value to the entire data frame), my problem is that withColumn changes the nullable property of the column:
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.lit
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true)
))
val data = Seq(Row(1, "pepsi"), Row(2, "coca cola"))
val rdd = spark.sparkContext.parallelize(data)
val df = spark.createDataFrame(rdd, schema)
df.withColumn("name", lit("*******"))
df.printSchema
result:
root
|-- id: string (nullable = true)
|-- name: string (nullable = false)
The best idea I have is change the schema after the manipulation, was wondering if someone has a better idea.
Thanks!

Pyspark dataframe write and read changes schema

I have a spark dataframe which contains both string and int columns.
But when I write the dataframe to a csv file and then load it later, the all the columns are loaded as string.
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "count"])
Before:
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: long (nullable = true)
df.write.mode('overwrite').option('header', True).csv(filepath)
new_df = spark.read.option('header', True).csv(filepath)
After:
new_df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- count: string (nullable = true)
How do I specify to store the schema as well while writing?

We don't have to specify schema while writing but we can specify the schema while reading.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType(
[
StructField('Name', StringType(), True),
StructField('count', LongType(), True)
]
)
#specify schema while reading
new_df = spark.read.schema(schema).option('header', True).csv(filepath)
new_df.printSchema()
#or else use inferschema option as true but specifying schema will be more robust
new_df = spark.read.option('header', True).option("inferSchema",True).csv(filepath)

Spark: create a nested schema

With spark,
import spark.implicits._
val data = Seq(
(1, ("value11", "value12")),
(2, ("value21", "value22")),
(3, ("value31", "value32"))
)
val df = data.toDF("id", "v1")
df.printSchema()
The result is the following:
root
|-- id: integer (nullable = false)
|-- v1: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
Now if I want to create the schema myself, how should I process?
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", ???)
))
Thanks.

According to example in here:
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/types/StructType.html
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val innerStruct =
StructType(
StructField("f1", IntegerType, true) ::
StructField("f2", LongType, false) ::
StructField("f3", BooleanType, false) :: Nil)
val struct = StructType(
StructField("a", innerStruct, true) :: Nil)
// Create a Row with the schema defined by struct
val row = Row(Row(1, 2, true))
And in your case it will be:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
Output:
StructType(
StructField(id,IntegerType,true),
StructField(nested,StructType(
StructField(value1,StringType,true),
StructField(value2,StringType,true)
),true)
)

How to add column to exploded struct in Spark?

Say I have the following data:
{"id":1, "payload":[{"foo":1, "lol":2},{"foo":2, "lol":2}]}
I would like to explode the payload and add a column to it, like this:
df = df.select('id', F.explode('payload').alias('data'))
df = df.withColumn('data.bar', F.col('data.foo') * 2)
However this results in a dataframe with three columns:
id
data
data.bar
I expected the data.bar to be part of the data struct...
How can I add a column to the exploded struct, instead of adding a top-level column?

df = df.withColumn('data', f.struct(
df['data']['foo'].alias('foo'),
(df['data']['foo'] * 2).alias('bar')
))
This will result in:
root
|-- id: long (nullable = true)
|-- data: struct (nullable = false)
| |-- col1: long (nullable = true)
| |-- bar: long (nullable = true)
UPDATE:
def func(x):
tmp = x.asDict()
tmp['foo'] = tmp.get('foo', 0) * 100
res = zip(*tmp.items())
return Row(*res[0])(*res[1])
df = df.withColumn('data', f.UserDefinedFunction(func, StructType(
[StructField('foo', StringType()), StructField('lol', StringType())]))(df['data']))
P.S.
Spark almost do not support inplace opreation.
So every time you want to do inplace, you need to do replace actually.

Got a Error when using DataFrame.schema.fields.update

I want to cast two columns in my DataFrame. Here is my code:
val session = SparkSession
.builder
.master("local")
.appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
.toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
I get these logs from my console:
root
|-- CID: integer (nullable = false)
|-- Name: string (nullable = true)
|-- STD: integer (nullable = false)
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
All I want to know is why this ERROR happen and how can I solve it?
appreciate that very much!

You can not update the schema of dataframe since the dataframe are immutable,
But you can update the schema of dataframe and assign to a new Dataframe.
Here is how you can do
val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))
newDF.printSchema()
The schema of newDF is
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
Your code:
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
In your code
df.schema.fields returns a Array of StructFields as
Array[StructFields]
then if you try to update as
df.schema.fields.update(0, StructField("CID", StringType))
This updates the value of Array[StructField] in 0th position, I this is not what you wanted
DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields
Hope this helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark: Create nested dataframe from a flat one - apache-spark

Related

Spark withColumn changes column nullable property in schema

Pyspark dataframe write and read changes schema

Spark: create a nested schema

How to add column to exploded struct in Spark?

Got a Error when using DataFrame.schema.fields.update

Categories

Resources