Spark Struct structfield names getting changed in UDF - apache-spark

I am trying to pass a struct in spark to udf. It is changing the field names and renaming to the column position. How do I fix it?
object TestCSV {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("localTest").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val inputData = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter","|")
.option("header", "true")
.load("test.csv")
inputData.printSchema()
inputData.show()
val groupedData = inputData.withColumn("name",struct(inputData("firstname"),inputData("lastname")))
val udfApply = groupedData.withColumn("newName",processName(groupedData("name")))
udfApply.show()
}
def processName = udf((input:Row) =>{
println(input)
println(input.schema)
Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))
})
}
Output:
root
|-- id: string (nullable = true)
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
+---+---------+--------+
| id|firstname|lastname|
+---+---------+--------+
| 1| jack| reacher|
| 2| john| Doe|
+---+---------+--------+
Error:
[jack,reacher]
StructType(StructField(i[1],StringType,true), > StructField(i[2],StringType,true))
17/03/08 09:45:35 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.IllegalArgumentException: Field "firstname" does not exist.

What you are encountering is really strange. After playing around a bit I finally figured out that it may be related to a problem with the optimizer engine. It seems that the problem is not the UDF but the struct function.
I get it to work (Spark 1.6.3) when I cache the groupedData, without caching I get your reported exception:
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object Demo {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[1]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions._
def processName = udf((input: Row) => {
Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))
})
val inputData =
sc.parallelize(
Seq(("1", "Kevin", "Costner"))
).toDF("id", "firstname", "lastname")
val groupedData = inputData.withColumn("name", struct(inputData("firstname"), inputData("lastname")))
.cache() // does not work without cache
val udfApply = groupedData.withColumn("newName", processName(groupedData("name")))
udfApply.show()
}
}
Alternatively you can use the RDD API to make your struct, but this is not really nice:
case class Name(firstname:String,lastname:String) // define outside main
val groupedData = inputData.rdd
.map{r =>
(r.getAs[String]("id"),
Name(
r.getAs[String]("firstname"),
r.getAs[String]("lastname")
)
)
}
.toDF("id","name")

Related

Is it possible to write a dataframe into 2 files of different type?

We can use following api to write dataframe into local files.
df.write.parquet(path)
df.write.json(path)
However, Can I write into a parquet and a json in one time without compute the dataframe twice ?
By the way , I dont want to cache the data in memory, because it's too big.
If you don't cache/persist the dataframe, then it'll will need re-computed for each output format.
We can implement an org.apache.spark.sql.execution.datasources.FileFormat to do such thing.
DuplicateOutFormat demo
/**
* Very Dangerous Toy Code. DO NOT USE IN PRODUCTION.
*/
class DuplicateOutFormat
extends FileFormat
with DataSourceRegister
with Serializable {
override def inferSchema(sparkSession: SparkSession, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] = {
throw new UnsupportedOperationException()
}
override def prepareWrite(sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val format1 = options("format1")
val format2 = options("format2")
val format1Instance = DataSource.lookupDataSource(format1, sparkSession.sessionState.conf)
.newInstance().asInstanceOf[FileFormat]
val format2Instance = DataSource.lookupDataSource(format2, sparkSession.sessionState.conf)
.newInstance().asInstanceOf[FileFormat]
val writerFactory1 = format1Instance.prepareWrite(sparkSession, job, options, dataSchema)
val writerFactory2 = format2Instance.prepareWrite(sparkSession, job, options, dataSchema)
new OutputWriterFactory {
override def getFileExtension(context: TaskAttemptContext): String = ".dup"
override def newInstance(path: String, dataSchema: StructType, context: TaskAttemptContext): OutputWriter = {
val path1 = path.replace(".dup", writerFactory1.getFileExtension(context))
val path2 = path.replace(".dup", writerFactory2.getFileExtension(context))
val writer1 = writerFactory1.newInstance(path1, dataSchema, context)
val writer2 = writerFactory2.newInstance(path2, dataSchema, context)
new OutputWriter {
override def write(row: InternalRow): Unit = {
writer1.write(row)
writer2.write(row)
}
override def close(): Unit = {
writer1.close()
writer2.close()
}
}
}
}
}
override def shortName(): String = "dup"
}
SPI
we should make a SPI file /META-INF/services/org.apache.spark.sql.sources.DataSourceRegister, content:
com.github.sparkdemo.DuplicateOutFormat.
demo usage
class DuplicateOutFormatTest extends FunSuite {
val spark = SparkSession.builder()
.master("local")
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
test("testDuplicateWrite") {
val data = Array(
("k1", "fa", "20210901", 16),
("k2", null, "20210902", 15),
("k3", "df", "20210903", 14),
("k4", null, "20210904", 13)
)
val tempDir = System.getProperty("java.io.tmpdir") + "spark-dup-test" + System.nanoTime()
val df = sc.parallelize(data).toDF("k", "col2", "day", "col4")
df.write
.option("format1", "csv")
.option("format2", "orc")
.format("dup").save(tempDir)
df.show(1000, false)
}
}
WARNING
Spark SQL couple some sth in DataFrameWriter#saveToV1Source and other source code, that we can't change. This custom DuplicateOutFormat is just for demo, lacking of test. Full demo in github.

Save RDD as csv file using coalesce function

I am trying to stream twitter data using Apache Spark in Intellij however when i use the function coalesce , it says that it cannot resolve symbol coalesce. Here is my main code:
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
//englishTweets.print()
englishTweets.foreachRDD{rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val tweets = rdd.map( field =>
(
field.getId,
field.getUser.getScreenName,
field.getCreatedAt.toInstant.toString,
field.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
sentiment(field.getText)
)
)
val tweetsdf = tweets.toDF("userID", "user", "createdAt", "text", "sentimentType")
tweetsdf.printSchema()
tweetsdf.show(false)
}.coalesce(1).write.csv("hdfs://localhost:9000/usr/sparkApp/test/testing.csv")
I have tried with my own dataset, and I have read a dataset and while writing I have applied coalesce function and it is giving results, please refer to this it may help you.
import org.apache.spark.sql.SparkSession
import com.spark.Rdd.DriverProgram
import org.apache.log4j.{ Logger, Level }
import org.apache.spark.sql.SaveMode
import java.sql.Date
object JsonDataDF {
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("hadoop.home.dir", "C:\\hadoop"); // This is the system property which is useful to find the winutils.exe
Logger.getLogger("org").setLevel(Level.WARN) // This will remove Logs
case class AOK(appDate:Date, arr:String, base:String, Comments:String)
val dp = new DriverProgram
val spark = dp.getSparkSession()
def main(args : Array[String]): Unit = {
import spark.implicits._
val jsonDf = spark.read.option("multiline", "true").json("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JSONdata.txt").as[AOK]
jsonDf.coalesce(1) // Refer Here
.write
.mode(SaveMode.Overwrite)
.option("header", "true")
.format("csv")
.save("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JsonToCsv")
}
}

Spark AnalysisException when "flattening" DataFrame in Spark SQL

I'm using the approach given here to flatten a DataFrame in Spark SQL. Here is my code:
package com.acme.etl.xml
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, SparkSession}
object RuntimeError { def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
val rowTag = "idocData"
val dataFrameReader =
spark.read
.option("rowTag", rowTag)
val xmlUri = "bad_011_1.xml"
val df =
dataFrameReader
.format("xml")
.load(xmlUri)
val schema: StructType = df.schema
val columns: Array[Column] = flattenSchema(schema)
val df2 = df.select(columns: _*)
}
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName: String = if (prefix == null) f.name else prefix + "." + f.name
val dataType = f.dataType
dataType match {
case st: StructType => flattenSchema(st, colName)
case _: StringType => Array(new org.apache.spark.sql.Column(colName))
case _: LongType => Array(new org.apache.spark.sql.Column(colName))
case _: DoubleType => Array(new org.apache.spark.sql.Column(colName))
case arrayType: ArrayType => arrayType.elementType match {
case structType: StructType => flattenSchema(structType, colName)
}
case _ => Array(new org.apache.spark.sql.Column(colName))
}
})
}
}
Much of the time, this works fine. But for the XML given below:
<Receive xmlns="http://Microsoft.LobServices.Sap/2007/03/Idoc/3/ORDERS05/ZORDERS5/702/Receive">
<idocData>
<E2EDP01008GRP xmlns="http://Microsoft.LobServices.Sap/2007/03/Types/Idoc/3/ORDERS05/ZORDERS5/702">
<E2EDPT1001GRP>
<E2EDPT2001>
<DATAHEADERCOLUMN_DOCNUM>0000000141036013</DATAHEADERCOLUMN_DOCNUM>
</E2EDPT2001>
<E2EDPT2001>
<DATAHEADERCOLUMN_DOCNUM>0000000141036013</DATAHEADERCOLUMN_DOCNUM>
</E2EDPT2001>
</E2EDPT1001GRP>
</E2EDP01008GRP>
<E2EDP01008GRP xmlns="http://Microsoft.LobServices.Sap/2007/03/Types/Idoc/3/ORDERS05/ZORDERS5/702">
</E2EDP01008GRP>
</idocData>
</Receive>
this exception occurs:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`E2EDP01008GRP`.`E2EDPT1001GRP`.`E2EDPT2001`['DATAHEADERCOLUMN_DOCNUM']' due to data type mismatch: argument 2 requires integral type, however, ''DATAHEADERCOLUMN_DOCNUM'' is of string type.;;
'Project [E2EDP01008GRP#0.E2EDPT1001GRP.E2EDPT2001[DATAHEADERCOLUMN_DOCNUM] AS DATAHEADERCOLUMN_DOCNUM#3, E2EDP01008GRP#0._VALUE AS _VALUE#4, E2EDP01008GRP#0._xmlns AS _xmlns#5]
+- Relation[E2EDP01008GRP#0] XmlRelation(<function0>,Some(/Users/paulreiners/s3/cdi-events-partition-staging/content_acme_purchase_order_json_v1/bad_011_1.xml),Map(rowtag -> idocData, path -> /Users/paulreiners/s3/cdi-events-partition-staging/content_acme_purchase_order_json_v1/bad_011_1.xml),null)
What is causing this?
Your document contains a multi-valued array so you can't flatten it completely in one pass since you can't give both elements of the array the same column name.
Also, it's usually a bad idea to use a dot within a column name since it can easily confuse the Spark parser and will need to be escaped at all time.
The usual way to flatten such a dataset is to create new rows for each element of the array.
You can use the explode function to do this but you will need to recursively call your flatten operation because explode can't be nested.
The following code works as expected, using '_' instead of '.' as column name separator:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.{Dataset, Row}
object RuntimeError {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
val rowTag = "idocData"
val dataFrameReader = spark.read.option("rowTag", rowTag)
val xmlUri = "bad_011_1.xml"
val df = dataFrameReader.format("xml").load(xmlUri)
val df2 = flatten(df)
}
def flatten(df: Dataset[Row], prefixSeparator: String = "_") : Dataset[Row] = {
import org.apache.spark.sql.functions.{col,explode}
def mustFlatten(sc: StructType): Boolean =
sc.fields.exists(f => f.dataType.isInstanceOf[ArrayType] || f.dataType.isInstanceOf[StructType])
def flattenAndExplodeOne(sc: StructType, parent: Column = null, prefix: String = null, cols: Array[(DataType,Column)] = Array[(DataType,Column)]()): Array[(DataType,Column)] = {
val res = sc.fields.foldLeft(cols)( (columns, f) => {
val my_col = if (parent == null) col(f.name) else parent.getItem(f.name)
val flat_name = if (prefix == null) f.name else s"${prefix}${prefixSeparator}${f.name}"
f.dataType match {
case st: StructType => flattenAndExplodeOne(st, my_col, flat_name, columns)
case dt: ArrayType => {
if (columns.exists(_._1.isInstanceOf[ArrayType])) {
columns :+ ((dt, my_col.as(flat_name)))
} else {
columns :+ ((dt, explode(my_col).as(flat_name)))
}
}
case dt => columns :+ ((dt, my_col.as(flat_name)))
}
})
res
}
var flatDf = df
while (mustFlatten(flatDf.schema)) {
val newColumns = flattenAndExplodeOne(flatDf.schema, null, null).map(_._2)
flatDf = flatDf.select(newColumns:_*)
}
flatDf
}
}
The resulting df2 has the following schema and data:
df2.printSchema
root
|-- E2EDP01008GRP_E2EDPT1001GRP_E2EDPT2001_DATAHEADERCOLUMN_DOCNUM: long (nullable = true)
|-- E2EDP01008GRP__xmlns: string (nullable = true)
df2.show(true)
+--------------------------------------------------------------+--------------------+
|E2EDP01008GRP_E2EDPT1001GRP_E2EDPT2001_DATAHEADERCOLUMN_DOCNUM|E2EDP01008GRP__xmlns|
+--------------------------------------------------------------+--------------------+
| 141036013|http://Microsoft....|
| 141036013|http://Microsoft....|
+--------------------------------------------------------------+--------------------+

Spark How to RDD[JSONObject] to Dataset

I am reading data from RDD of Element of type com.google.gson.JsonObject. Trying to convert that into DataSet but no clue how to do this.
import com.google.gson.{JsonParser}
import org.apache.hadoop.io.LongWritable
import org.apache.spark.sql.{SparkSession}
object tmp {
class people(name: String, age: Long, phone: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val parser = new JsonParser();
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject()
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject()
val PairRDD = sc.parallelize(List(
(new LongWritable(1l), jsonObject1),
(new LongWritable(2l), jsonObject2)
))
val rdd1 =PairRDD.map(element => element._2)
import spark.implicits._
//How to create Dataset as schema People from rdd1?
}
}
Even trying to print rdd1 elements throws
object not serializable (class: org.apache.hadoop.io.LongWritable, value: 1)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (1,{"name":"abc","age":23,"phone":"0208"}))
Basically I get this RDD[LongWritable,JsonParser] from BigQuery table which I want to convert to Dataset so I can apply SQL for transformation.
I've left phone in the second record null intentionally, BigQuery return nothing for that element with null value.
Thanks for the clarification. You need to register the class as Serializable in kryo. The following show work. I am running in spark-shell so had to destroy the old context and create a new spark context with a config that included the registered Kryo Classes
import com.google.gson.{JsonParser}
import org.apache.hadoop.io.LongWritable
import org.apache.spark.SparkContext
sc.stop()
val conf = sc.getConf
conf.registerKryoClasses( Array(classOf[LongWritable], classOf[JsonParser] ))
conf.get("spark.kryo.classesToRegister")
val sc = new SparkContext(conf)
val parser = new JsonParser();
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject()
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject()
val pairRDD = sc.parallelize(List(
(new LongWritable(1l), jsonObject1),
(new LongWritable(2l), jsonObject2)
))
val rdd = pairRDD.map(element => element._2)
rdd.collect()
// res9: Array[com.google.gson.JsonObject] = Array({"name":"abc","age":23,"phone":"0208"}, {"name":"xyz","age":33})
val jsonstrs = rdd.map(e=>e.toString).collect()
val df = spark.read.json( sc.parallelize(jsonstrs) )
df.printSchema
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- phone: string (nullable = true)

convert RDD to Dataframe in 2.0

I am trying to convert rdd to dataframe in Spark2.0
val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val sqlCon=new SQLContext(sc)
import sqlCon.implicits._
val rdd=sc.textFile("/home/cloudera/alpha.dat").persist()
val row=rdd.first()
val data=rdd.filter { x => !x.contains(row) }
data.foreach { x => println(x) }
case class person(name:String,age:Int,city:String)
val rdd2=data.map { x => x.split(",") }
val rdd3=rdd2.map { x => person(x(0),x(1).toInt,x(2)) }
val df=rdd3.toDF()
df.printSchema();
df.registerTempTable("alpha")
val df1=sqlCon.sql("select * from alpha")
df1.foreach { x => println(x) }
but i a getting below error at toDF(). ---> "val df=rdd3.toDF() "
Multiple markers at this line:
- Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
- Implicit conversion found: rdd3 ⇒ rddToDatasetHolder(rdd3): (implicit evidence$4:
org.apache.spark.sql.Encoder[person])org.apache.spark.sql.DatasetHolder[person]
How to convert the above to Dataframe using toDF()
Cloudera & Spark 2.0? hmmm, didn't think we supported that yet :)
Anyway, first of all you don't need to call .persist() on your RDD so you can remove that bit. Secondly, since Person is a case class you should capitalize its name.
Lastly, in Spark 2.0 you no longer call import sqlContext.implicits._ to implicitly build a DataFrame schema, you now call import spark.implicits._. This is hinted at by your error message.
There was a simple mistake where I had defined case class inside the main method. After removing the same, I am able to convert RDD to DataFrame.
package sparksql
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
import org.apache.spark.SparkContext
object asw {
case class Person(name:String,age:Int,city:String)
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setMaster("local").setAppName("Dataframe")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val rdd1=sc.textFile("/home/cloudera/alpha.dat")
val row=rdd1.first()
val data=rdd1.filter { x => !x.contains(row) }
val rdd2=data.map { x => x.split(",") }
val df=rdd2.map { x => Person(x(0),x(1).toInt,x(2)) }.toDF()
df.createOrReplaceTempView("rdd21")
spark.sql("select * from rdd21").show()
}
}

Resources