Spark RDD from a sequencefile and deserialization

Spark RDD from a sequencefile and deserialization - apache-spark

I am using Spark 1.5.0 and I want to read a sqeuence file -- key is filename (ext) and value is actually a list of java objects of type Myclass
Here is my code for doing that without Spark.
val ois = new ObjectInputStream(new FileInputStream("/path/to/seqfile"))
val data = ois.readObject.asInstanceOf[java.util.List[MyClass]]
val scalalist = data.asScala
I want to use Spark to do the same, However I am not sure when the serialized data is available in string, how do I create an RDD where the 2nd element of the tuple is casted to List of MyClass objects.
val seq_rdd = sc.sequenceFile("/path/to/seqfile", classOf[Text], classOf[BytesWritable])
val seq_formatted_rdd = seq_rdd.map { case (text, bytes) => (text.toString, bytes.copyBytes) }
val my_rdd = seq_formatted_rdd.map { case (text, ser_bytes) => (text, new ByteArrayInputStream(ser_bytes)) }
I get the following exception because ByteArrayInputStream does not implement Serializable:
object not serializable (class: java.io.ByteArrayInputStream, value: java.io.ByteArrayInputStream#73d5a077)
After that, I want to do the following:
val my_rdd1 = my_rdd.map { case (text, bytestream) => (text, new ObjectInputStream(bytestream).readObject.asInstanceOf[java.util.List[MyClass]])}

Related

Why Spark not serializable exception occurs when changing RDD to DataFrame?

I am using structured streaming and following code works
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.rdd.mapPartitions(...)
}}
But following code throws an exception, object not serializable (class: redis.clients.jedis.Jedis, value: redis.clients.jedis.Jedis#a8e0378)
The code has only one place change (change RDD to DataFrame):
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(...) // only change is change batchDF.rdd to batchDF
}}
My Jedis code should be executed on driver and never reach executor. I suppose Spark RDD and DataFrame should have similar APIS? Why this happens?
I used ctrl to go into the lower level code. The batchDF.mapPartitions goes to
#Experimental
#InterfaceStability.Evolving
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] =
{
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
and batchDF.rdd.mapPartitions goes to
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
My Spark version is 2.4.3.
My simplest version of code below, and I just found something else...
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(x => {
val arr = x.grouped(2).toArray // this line matters
})
// only change is change batchDF.rdd to batchDF
}}

see this DataFrame api implementation
internally its calling rdd.mapPartitions of your function.
/**
* Returns a new RDD by applying a function to each partition of this DataFrame.
* #group rdd
* #since 1.3.0
*/
def mapPartitions[R: ClassTag](f: Iterator[Row] => Iterator[R]): RDD[R] = {
rdd.mapPartitions(f)
}
There is no difference some where else you might have done mistake.
AFAIK, Ideally this should be the way
batchDF.mapPartitions { yourparition =>
// better to create a JedisPool and take object rather than new Jedis
val j = new Jedis()
val result = yourparition.map {
// do some process here
}
j.close // release and take care of connections/ resources here
result
}
}

how to save Iterator to ES

I use the partitionBy functions to divide my rdd to multiple partitions, and then I want to put partitions to ES.
EsSpark.saveToEs need rdd, but the partitionBy function leave me the parameter Iterator. Is there a method to save the Iterator to ES or
convert Iterator to rdd？I use the ES-spark 5.2.2
the code is below:
var entry = Array("vpn","linux","error")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
var resultRDD=stream.map( record => {
val json = parse(record.value())
val x = json.extract[vpnLogEntry]
if (!x.innerIP.equals("-")){
("vpn",x)
}else{
("linux",x)
}
})
resultRDD.foreachRDD { (rdd,durationTime) =>
val entryToIndexDis = rdd.context.broadcast(entry.zipWithIndex.toMap)
val indexToEntryDis = rdd.context.broadcast(entry.zipWithIndex.map(_.swap).toMap)
rdd.partitionBy(new Partitioner {
override def numPartitions: Int = entryToIndexDis.value.size
override def getPartition(key: Any): Int = {
entryToIndexDis.value.get(key.toString).get
}
}).mapPartitionsWithIndex((index, data) => {
val index_type = indexToEntryDis.value(index)
//here, I want to put vpn data into vpn/vpn of ES,
//and put linux data into linux/linux of ES.
//the variable of data is type of Iterator,
//so can not use EsSpark.saveToEs function
data
}, true).count()

convert RDD to Dataframe in 2.0

I am trying to convert rdd to dataframe in Spark2.0
val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val sqlCon=new SQLContext(sc)
import sqlCon.implicits._
val rdd=sc.textFile("/home/cloudera/alpha.dat").persist()
val row=rdd.first()
val data=rdd.filter { x => !x.contains(row) }
data.foreach { x => println(x) }
case class person(name:String,age:Int,city:String)
val rdd2=data.map { x => x.split(",") }
val rdd3=rdd2.map { x => person(x(0),x(1).toInt,x(2)) }
val df=rdd3.toDF()
df.printSchema();
df.registerTempTable("alpha")
val df1=sqlCon.sql("select * from alpha")
df1.foreach { x => println(x) }
but i a getting below error at toDF(). ---> "val df=rdd3.toDF() "
Multiple markers at this line:
- Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
- Implicit conversion found: rdd3 ⇒ rddToDatasetHolder(rdd3): (implicit evidence$4:
org.apache.spark.sql.Encoder[person])org.apache.spark.sql.DatasetHolder[person]
How to convert the above to Dataframe using toDF()

Cloudera & Spark 2.0? hmmm, didn't think we supported that yet :)
Anyway, first of all you don't need to call .persist() on your RDD so you can remove that bit. Secondly, since Person is a case class you should capitalize its name.
Lastly, in Spark 2.0 you no longer call import sqlContext.implicits._ to implicitly build a DataFrame schema, you now call import spark.implicits._. This is hinted at by your error message.

There was a simple mistake where I had defined case class inside the main method. After removing the same, I am able to convert RDD to DataFrame.
package sparksql
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
import org.apache.spark.SparkContext
object asw {
case class Person(name:String,age:Int,city:String)
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setMaster("local").setAppName("Dataframe")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val rdd1=sc.textFile("/home/cloudera/alpha.dat")
val row=rdd1.first()
val data=rdd1.filter { x => !x.contains(row) }
val rdd2=data.map { x => x.split(",") }
val df=rdd2.map { x => Person(x(0),x(1).toInt,x(2)) }.toDF()
df.createOrReplaceTempView("rdd21")
spark.sql("select * from rdd21").show()
}
}

Exception when using UDT in Spark DataFrame

I'm trying to create a user defined type in spark sql, but I receive:
com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType even when using their example. Has anyone made this work?
My code:
test("udt serialisation") {
val points = Seq(new ExamplePoint(1.3, 1.6), new ExamplePoint(1.3, 1.8))
val df = SparkContextForStdout.context.parallelize(points).toDF()
}
#SQLUserDefinedType(udt = classOf[ExamplePointUDT])
case class ExamplePoint(val x: Double, val y: Double)
/**
* User-defined type for [[ExamplePoint]].
*/
class ExamplePointUDT extends UserDefinedType[ExamplePoint] {
override def sqlType: DataType = ArrayType(DoubleType, false)
override def pyUDT: String = "pyspark.sql.tests.ExamplePointUDT"
override def serialize(obj: Any): Seq[Double] = {
obj match {
case p: ExamplePoint =>
Seq(p.x, p.y)
}
}
override def deserialize(datum: Any): ExamplePoint = {
datum match {
case values: Seq[_] =>
val xy = values.asInstanceOf[Seq[Double]]
assert(xy.length == 2)
new ExamplePoint(xy(0), xy(1))
case values: util.ArrayList[_] =>
val xy = values.asInstanceOf[util.ArrayList[Double]].asScala
new ExamplePoint(xy(0), xy(1))
}
}
override def userClass: Class[ExamplePoint] = classOf[ExamplePoint]
}
The usefull stackstrace is this:
com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType
java.lang.ClassCastException: com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)

It seems that the UDT needs to be used inside of another class to work (as the type of a field). One solution to use it directly is to wrap it into a Tuple1:
test("udt serialisation") {
val points = Seq(new Tuple1(new ExamplePoint(1.3, 1.6)), new Tuple1(new ExamplePoint(1.3, 1.8)))
val df = SparkContextForStdout.context.parallelize(points).toDF()
df.collect().foreach(println(_))
}

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}

The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.

Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark RDD from a sequencefile and deserialization - apache-spark

Related

Why Spark not serializable exception occurs when changing RDD to DataFrame?

how to save Iterator to ES

convert RDD to Dataframe in 2.0

Exception when using UDT in Spark DataFrame

How can use spark SqlContext object in spark sql registeredFunction?

Categories

Resources