Is Spark rdd.toDF() thread-safe? - multithreading

I have an application that runs in a Spark-2.4.4 cluster and it transforms two RDD to DataFrame with rdd.toDF() then outputs them to file.
For slave resource usage optimization, the application executes the job in multi-thread. The code snippet looks like this:
import spark.implicits._
Seq(rdd1, rdd2).par.foreach { rdd =>
rdd.toDF().write.text("xxx")
}
And I found that toDF() is not thread-safe. The application failed sometimes by java.lang.UnsupportedOperationException.
You can reproduce it from the following code snippet (1% to happen, it's easier to happen when the case class has a large number of fields.):
package example
import org.apache.spark.sql.SparkSession
object SparkToDF {
case class SampleData(
field_1: Option[Int] = None,
field_2: Option[Long] = Some(0L),
field_3: Option[Double] = None,
field_4: Option[Float] = None,
field_5: Option[Short] = None,
field_6: Option[Byte] = None,
field_7: Option[Boolean] = Some(false),
field_8: Option[StructData] = Some(StructData(Some(0), Some(0L), None, None, None, None, None)),
field_9: String = "",
field_10: String = "",
field_11: String = "",
field_12: String = "",
field_13: String = "",
field_14: String = "",
field_15: String = "",
field_16: String = "",
field_17: String = "",
field_18: String = "",
field_19: String = "",
field_20: String = "",
field_21: String = "",
field_22: String = "",
field_23: String = "",
field_24: String = "",
field_25: String = "",
field_26: String = "",
field_27: String = "",
field_28: String = "",
field_29: String = "",
field_30: String = "",
field_31: String = "",
field_32: String = "",
field_33: String = "",
field_34: String = "",
field_35: String = "",
field_36: String = "",
field_37: String = "",
field_38: String = "",
field_39: String = "",
field_40: String = "",
field_41: String = "",
field_42: String = "",
field_43: String = "",
field_44: String = "",
field_45: String = "",
field_46: String = "",
field_47: String = "",
field_48: String = "",
field_49: String = "",
field_50: String = "",
field_51: String = "",
field_52: String = "",
field_53: String = "",
field_54: String = "",
field_55: String = "",
field_56: String = "",
field_57: String = "",
field_58: String = "",
field_59: String = "",
field_60: String = ""
)
case class StructData(
field_1: Option[Int],
field_2: Option[Long],
field_3: Option[Double],
field_4: Option[Float],
field_5: Option[Short],
field_6: Option[Byte],
field_7: Option[Boolean]
)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("ThreadSafeToDF()")
.getOrCreate()
import spark.implicits._
try {
(1 to 10).par.foreach { _ =>
(1 to 10000).map(_ => SampleData()).toDF()
}
} finally {
spark.close()
}
}
}
for i in {1..200}; do
echo "iter: $i (`date`)";
./bin/spark-submit --class sample.SparkToDF your.jar
done
You may get a similar exception message: Schema for type A is not supported
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type scala.Option[scala.Float] is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:812)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:743)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:929)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:742)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:407)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:406)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:406)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:176)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:929)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:176)
at org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:164)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
From the error message, the exception was caused by ScalaReflection.schemaFor().
I had looked at the code, it seems like Spark uses Scala reflection to get the data type and as I know there is a concurrency issue in Scala reflection.
https://issues.apache.org/jira/browse/SPARK-26555
Thread Safety in Scala reflection with type matching
Is this an expected behavior? I can not find any document about thread-safe in creating DataFrame.
I had workarounded this by adding a lock when transforming RDD to DataFrame.
object toDFLock
import spark.implicits
Seq(rdd1, rdd2).par.foreach { rdd =>
toDFLock.synchronized(rdd.toDF()).write.text("xxx")
}

Related

Consume Spring Flux by Retrofit Observable/Flowable

Is it possible to consume Flux endpoint as Observable or Flowable in Retrofit?
What I am aiming to achieve is to emit items from endpoint to the consumer.
Spring boot WebFlux endpoint
#RestController
#RequestMapping("user")
class UserController {
private val repo = listOf(
UserDoc().apply {
id = 1
name = "test1"
createdAt = LocalDateTime.now()
},
UserDoc().apply {
id = 2
name = "test2"
createdAt = LocalDateTime.now()
},
UserDoc().apply {
id = 3
name = "test3"
createdAt = LocalDateTime.now()
}
)
#GetMapping
fun findAll(): Flux<UserDoc> = Flux.just(*repo.toTypedArray())
}
Sample response
[
{
"id": 1,
"name": "test1",
"createdAt": "2022-10-04T15:25:34.540953"
},
{
"id": 2,
"name": "test2",
"createdAt": "2022-10-04T15:25:34.540976"
},
{
"id": 3,
"name": "test3",
"createdAt": "2022-10-04T15:25:34.54098"
}
]
Retrofit client
fun main() {
val userApi = Retrofit.Builder()
.addCallAdapterFactory(RxJava2CallAdapterFactory.createWithScheduler(Schedulers.io()))
.addConverterFactory(GsonConverterFactory.create())
.baseUrl("http://localhost:2020/")
.client(OkHttpClient())
.build()
.create(UserApi::class.java)
val compositeDisposable = CompositeDisposable()
compositeDisposable.add(
userApi.findAll()
.subscribeOn(Schedulers.io())
.subscribe {user: User ->
println(user)
}
)
Thread.currentThread().join()
}
interface UserApi {
#GET("user")
fun findAll(): Flowable<User>
}
data class User(
val id: Long,
val name: String
)
The current implementation return this error:
com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was BEGIN_ARRAY at line 1 column 2 path $
If I change the Flowable<User> to Flowable<List<User>> it's works fine.
Is it possible to subscribe the list user by user? or create WebSocket and create my custom Observable?

Retrieve nested Data from Firebase Database android

Snapshot of my firebase realtime database
I want to extract the entire data under the "Orders" node, please tell me how should I model my data class for android in Kotlin?
I tried with this type of modeling,
After getting the reference of (Orders/uid/)
Order.kt
data class Order(
val items:ArrayList<Myitems>=ArrayList(),
val timeStamp:Long=0,
val totalCost:Int=0
)
MyItems.kt
data class MyItems(
val Item:ArrayList<Menu>=ArrayList()
)
Menu.kt
data class Menu(
val menCategory:String="",
val menName:String="",
val menImage:String="",
val menId:String="",
val menQuantity:Int=0,
val menCost:Int=0
)
After a lot of thinking and research online. I was finally able to model my classes and call add value event listener to it. Here it goes:
Order.kt
data class Order(
val items: ArrayList<HashMap<String, Any>> = ArrayList(),
val timeStamp: Long = 0,
val totalCost: Int = 0
)
OItem.kt
data class OItem(
val menCategory: String = "",
val menId: String = "",
val menImage: String = "",
val menName: String = "",
val menPrice: Int = 0,
var menQuantity: Int = 0
)
MainActivity.kt
val uid = FirebaseAuth.getInstance().uid
val ref = FirebaseDatabase.getInstance().getReference("Orders/$uid")
ref.addListenerForSingleValueEvent(object : ValueEventListener {
override fun onCancelled(error: DatabaseError) {
//
}
override fun onDataChange(p0: DataSnapshot) {
p0.children.forEach {
val order = it.getValue(Order::class.java)
ordList.add(order!!)
}
Log.d("hf", ordList.toString())
}
})

Spark AnalysisException when "flattening" DataFrame in Spark SQL

I'm using the approach given here to flatten a DataFrame in Spark SQL. Here is my code:
package com.acme.etl.xml
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, SparkSession}
object RuntimeError { def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
val rowTag = "idocData"
val dataFrameReader =
spark.read
.option("rowTag", rowTag)
val xmlUri = "bad_011_1.xml"
val df =
dataFrameReader
.format("xml")
.load(xmlUri)
val schema: StructType = df.schema
val columns: Array[Column] = flattenSchema(schema)
val df2 = df.select(columns: _*)
}
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName: String = if (prefix == null) f.name else prefix + "." + f.name
val dataType = f.dataType
dataType match {
case st: StructType => flattenSchema(st, colName)
case _: StringType => Array(new org.apache.spark.sql.Column(colName))
case _: LongType => Array(new org.apache.spark.sql.Column(colName))
case _: DoubleType => Array(new org.apache.spark.sql.Column(colName))
case arrayType: ArrayType => arrayType.elementType match {
case structType: StructType => flattenSchema(structType, colName)
}
case _ => Array(new org.apache.spark.sql.Column(colName))
}
})
}
}
Much of the time, this works fine. But for the XML given below:
<Receive xmlns="http://Microsoft.LobServices.Sap/2007/03/Idoc/3/ORDERS05/ZORDERS5/702/Receive">
<idocData>
<E2EDP01008GRP xmlns="http://Microsoft.LobServices.Sap/2007/03/Types/Idoc/3/ORDERS05/ZORDERS5/702">
<E2EDPT1001GRP>
<E2EDPT2001>
<DATAHEADERCOLUMN_DOCNUM>0000000141036013</DATAHEADERCOLUMN_DOCNUM>
</E2EDPT2001>
<E2EDPT2001>
<DATAHEADERCOLUMN_DOCNUM>0000000141036013</DATAHEADERCOLUMN_DOCNUM>
</E2EDPT2001>
</E2EDPT1001GRP>
</E2EDP01008GRP>
<E2EDP01008GRP xmlns="http://Microsoft.LobServices.Sap/2007/03/Types/Idoc/3/ORDERS05/ZORDERS5/702">
</E2EDP01008GRP>
</idocData>
</Receive>
this exception occurs:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`E2EDP01008GRP`.`E2EDPT1001GRP`.`E2EDPT2001`['DATAHEADERCOLUMN_DOCNUM']' due to data type mismatch: argument 2 requires integral type, however, ''DATAHEADERCOLUMN_DOCNUM'' is of string type.;;
'Project [E2EDP01008GRP#0.E2EDPT1001GRP.E2EDPT2001[DATAHEADERCOLUMN_DOCNUM] AS DATAHEADERCOLUMN_DOCNUM#3, E2EDP01008GRP#0._VALUE AS _VALUE#4, E2EDP01008GRP#0._xmlns AS _xmlns#5]
+- Relation[E2EDP01008GRP#0] XmlRelation(<function0>,Some(/Users/paulreiners/s3/cdi-events-partition-staging/content_acme_purchase_order_json_v1/bad_011_1.xml),Map(rowtag -> idocData, path -> /Users/paulreiners/s3/cdi-events-partition-staging/content_acme_purchase_order_json_v1/bad_011_1.xml),null)
What is causing this?
Your document contains a multi-valued array so you can't flatten it completely in one pass since you can't give both elements of the array the same column name.
Also, it's usually a bad idea to use a dot within a column name since it can easily confuse the Spark parser and will need to be escaped at all time.
The usual way to flatten such a dataset is to create new rows for each element of the array.
You can use the explode function to do this but you will need to recursively call your flatten operation because explode can't be nested.
The following code works as expected, using '_' instead of '.' as column name separator:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.{Dataset, Row}
object RuntimeError {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
val rowTag = "idocData"
val dataFrameReader = spark.read.option("rowTag", rowTag)
val xmlUri = "bad_011_1.xml"
val df = dataFrameReader.format("xml").load(xmlUri)
val df2 = flatten(df)
}
def flatten(df: Dataset[Row], prefixSeparator: String = "_") : Dataset[Row] = {
import org.apache.spark.sql.functions.{col,explode}
def mustFlatten(sc: StructType): Boolean =
sc.fields.exists(f => f.dataType.isInstanceOf[ArrayType] || f.dataType.isInstanceOf[StructType])
def flattenAndExplodeOne(sc: StructType, parent: Column = null, prefix: String = null, cols: Array[(DataType,Column)] = Array[(DataType,Column)]()): Array[(DataType,Column)] = {
val res = sc.fields.foldLeft(cols)( (columns, f) => {
val my_col = if (parent == null) col(f.name) else parent.getItem(f.name)
val flat_name = if (prefix == null) f.name else s"${prefix}${prefixSeparator}${f.name}"
f.dataType match {
case st: StructType => flattenAndExplodeOne(st, my_col, flat_name, columns)
case dt: ArrayType => {
if (columns.exists(_._1.isInstanceOf[ArrayType])) {
columns :+ ((dt, my_col.as(flat_name)))
} else {
columns :+ ((dt, explode(my_col).as(flat_name)))
}
}
case dt => columns :+ ((dt, my_col.as(flat_name)))
}
})
res
}
var flatDf = df
while (mustFlatten(flatDf.schema)) {
val newColumns = flattenAndExplodeOne(flatDf.schema, null, null).map(_._2)
flatDf = flatDf.select(newColumns:_*)
}
flatDf
}
}
The resulting df2 has the following schema and data:
df2.printSchema
root
|-- E2EDP01008GRP_E2EDPT1001GRP_E2EDPT2001_DATAHEADERCOLUMN_DOCNUM: long (nullable = true)
|-- E2EDP01008GRP__xmlns: string (nullable = true)
df2.show(true)
+--------------------------------------------------------------+--------------------+
|E2EDP01008GRP_E2EDPT1001GRP_E2EDPT2001_DATAHEADERCOLUMN_DOCNUM|E2EDP01008GRP__xmlns|
+--------------------------------------------------------------+--------------------+
| 141036013|http://Microsoft....|
| 141036013|http://Microsoft....|
+--------------------------------------------------------------+--------------------+

Spark Structured streaming kafka avro Producer

I have a dataframe let's say:
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
I want to send that dataframe to a kafka topic using avro serialization and using schema registry. I believe I'm almost there, but I can't seem to get past the Task not serializable error. I understand there is a sink for kafka, but it doesn't communicate with the schema registry which is a requirement.
object Holder extends Serializable{
def prop(): java.util.Properties = {
val props = new Properties()
props.put("schema.registry.url", schemaRegistryURL)
props.put("key.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
props.put("value.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
props.put("schema.registry.url", schemaRegistryURL)
props.put("bootstrap.servers", brokers)
props
}
def vProps(props: java.util.Properties): kafka.utils.VerifiableProperties = {
val vProps = new kafka.utils.VerifiableProperties(props)
vProps
}
def messageSchema(vProps: kafka.utils.VerifiableProperties): org.apache.avro.Schema = {
val ser = new KafkaAvroEncoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueName)
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
messageSchema
}
def avroRecord(messageSchema: org.apache.avro.Schema): org.apache.avro.generic.GenericData.Record = {
val avroRecord = new GenericData.Record(messageSchema)
avroRecord
}
def ProducerRecord(avroRecord:org.apache.avro.generic.GenericData.Record): org.apache.kafka.clients.producer.ProducerRecord[org.apache.avro.generic.GenericRecord,org.apache.avro.generic.GenericRecord] = {
val record = new ProducerRecord[GenericRecord, GenericRecord](topicWrite, avroRecord)
record
}
def producer(props: java.util.Properties): KafkaProducer[GenericRecord, GenericRecord] = {
val producer = new KafkaProducer[GenericRecord, GenericRecord](props)
producer
}
}
val prod: (String, String) => String = (
number: String,
word: String,
) => {
val prop = Holder.prop()
val vProps = Holder.vProps(prop)
val mSchema = Holder.messageSchema(vProps)
val aRecord = Holder.avroRecord(mSchema)
aRecord.put("number", number)
aRecord.put("word", word)
val record = Holder.ProducerRecord(aRecord)
val producer = Holder.producer(prop)
producer.send(record)
"sent"
}
val prodUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
udf((
Number: String,
word: String,
) => prod(number,word))
val testDF = firstDF.withColumn("sent", prodUDF(col("number"), col("word")))
KafkaProducer is not serializable.
Create the KafkaProducer inside prod() instead of creating it outside.

How to use from_json with schema as string (i.e. a JSON-encoded schema)?

I'm reading a stream from Kafka, and I convert the value from Kafka ( which is JSON ) in to Structure.
from_json has a variant that takes a schema of type String, but I could not find a sample. Please advise what is wrong in the below code.
Error
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '(' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT',
== SQL ==
STRUCT ( `firstName`: STRING, `lastName`: STRING, `email`: STRING, `addresses`: ARRAY ( STRUCT ( `city`: STRING, `state`: STRING, `zip`: STRING ) ) )
-------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
Program
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
String brokers = "quickstart:9092";
String topics = "simple_topic_6";
SparkSession sparkSession = SparkSession
.builder().appName(EmployeeSchemaLoader.class.getName())
.master(master).getOrCreate();
String employeeSchema = "STRUCT ( firstName: STRING, lastName: STRING, email: STRING, " +
"addresses: ARRAY ( STRUCT ( city: STRING, state: STRING, zip: STRING ) ) ) ";
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
Dataset<Row> employeeDataset = sparkSession.readStream().
format("kafka").
option("kafka.bootstrap.servers", brokers)
.option("subscribe", topics).load();
employeeDataset.printSchema();
employeeDataset = employeeDataset.withColumn("strValue", employeeDataset.col("value").cast("string"));
employeeDataset = employeeDataset.withColumn("employeeRecord",
functions.from_json(employeeDataset.col("strValue"),employeeSchema, new HashMap<>()));
employeeDataset.printSchema();
employeeDataset.createOrReplaceTempView("employeeView");
sparkSession.catalog().listTables().show();
sqlCtx.sql("select * from employeeView").show();
}
Your question helped me to find that the variant of from_json with String-based schema was only available in Java and has recently been added to Spark API for Scala in the upcoming 2.3.0. I've so long lived with the strong belief that Spark API for Scala was always the most feature-rich and your question helped me to learn it should not have been so before the change in 2.3.0 (!)
Back to your question, you can define the string-based schema in JSON or DDL format actually.
Writing JSON by hand may be a bit cumbersome and so I'd take a different approach (that given I'm a Scala developer is fairly easy).
Let's first define the schema using Spark API for Scala.
import org.apache.spark.sql.types._
val addressesSchema = new StructType()
.add($"city".string)
.add($"state".string)
.add($"zip".string)
val schema = new StructType()
.add($"firstName".string)
.add($"lastName".string)
.add($"email".string)
.add($"addresses".array(addressesSchema))
scala> schema.printTreeString
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- email: string (nullable = true)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- zip: string (nullable = true)
That seems to match your schema, doesn't it?
With that convert the schema to a JSON-encoded string was a breeze with json method.
val schemaAsJson = schema.json
schemaAsJson is exactly your JSON string which looks pretty...hmmm...complex. For the display purposes I'd rather use prettyJson method.
scala> println(schema.prettyJson)
{
"type" : "struct",
"fields" : [ {
"name" : "firstName",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "lastName",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "email",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "addresses",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "city",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "state",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "zip",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}
That's your schema in JSON.
You can use DataType and "validate" the JSON string (using DataType.fromJson that Spark uses under the covers for from_json).
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
scala> println(dt.sql)
STRUCT<`firstName`: STRING, `lastName`: STRING, `email`: STRING, `addresses`: ARRAY<STRUCT<`city`: STRING, `state`: STRING, `zip`: STRING>>>
All seems fine. Mind if I'm checking this out with a sample dataset?
val rawJsons = Seq("""
{
"firstName" : "Jacek",
"lastName" : "Laskowski",
"email" : "jacek#japila.pl",
"addresses" : [
{
"city" : "Warsaw",
"state" : "N/A",
"zip" : "02-791"
}
]
}
""").toDF("rawjson")
val people = rawJsons
.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*") // <-- flatten the struct field
.withColumn("address", explode($"addresses")) // <-- explode the array field
.drop("addresses") // <-- no longer needed
.select("firstName", "lastName", "email", "address.*") // <-- flatten the struct field
scala> people.show
+---------+---------+---------------+------+-----+------+
|firstName| lastName| email| city|state| zip|
+---------+---------+---------------+------+-----+------+
| Jacek|Laskowski|jacek#japila.pl|Warsaw| N/A|02-791|
+---------+---------+---------------+------+-----+------+

Resources