I am using Spark 1.5.0 and I want to read a sqeuence file -- key is filename (ext) and value is actually a list of java objects of type Myclass
Here is my code for doing that without Spark.
val ois = new ObjectInputStream(new FileInputStream("/path/to/seqfile"))
val data = ois.readObject.asInstanceOf[java.util.List[MyClass]]
val scalalist = data.asScala
I want to use Spark to do the same, However I am not sure when the serialized data is available in string, how do I create an RDD where the 2nd element of the tuple is casted to List of MyClass objects.
val seq_rdd = sc.sequenceFile("/path/to/seqfile", classOf[Text], classOf[BytesWritable])
val seq_formatted_rdd = seq_rdd.map { case (text, bytes) => (text.toString, bytes.copyBytes) }
val my_rdd = seq_formatted_rdd.map { case (text, ser_bytes) => (text, new ByteArrayInputStream(ser_bytes)) }
I get the following exception because ByteArrayInputStream does not implement Serializable:
object not serializable (class: java.io.ByteArrayInputStream, value: java.io.ByteArrayInputStream#73d5a077)
After that, I want to do the following:
val my_rdd1 = my_rdd.map { case (text, bytestream) => (text, new ObjectInputStream(bytestream).readObject.asInstanceOf[java.util.List[MyClass]])}
Related
I am using structured streaming and following code works
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.rdd.mapPartitions(...)
}}
But following code throws an exception, object not serializable (class: redis.clients.jedis.Jedis, value: redis.clients.jedis.Jedis#a8e0378)
The code has only one place change (change RDD to DataFrame):
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(...) // only change is change batchDF.rdd to batchDF
}}
My Jedis code should be executed on driver and never reach executor. I suppose Spark RDD and DataFrame should have similar APIS? Why this happens?
I used ctrl to go into the lower level code. The batchDF.mapPartitions goes to
#Experimental
#InterfaceStability.Evolving
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] =
{
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
and batchDF.rdd.mapPartitions goes to
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
My Spark version is 2.4.3.
My simplest version of code below, and I just found something else...
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(x => {
val arr = x.grouped(2).toArray // this line matters
})
// only change is change batchDF.rdd to batchDF
}}
see this DataFrame api implementation
internally its calling rdd.mapPartitions of your function.
/**
* Returns a new RDD by applying a function to each partition of this DataFrame.
* #group rdd
* #since 1.3.0
*/
def mapPartitions[R: ClassTag](f: Iterator[Row] => Iterator[R]): RDD[R] = {
rdd.mapPartitions(f)
}
There is no difference some where else you might have done mistake.
AFAIK, Ideally this should be the way
batchDF.mapPartitions { yourparition =>
// better to create a JedisPool and take object rather than new Jedis
val j = new Jedis()
val result = yourparition.map {
// do some process here
}
j.close // release and take care of connections/ resources here
result
}
}
I use the partitionBy functions to divide my rdd to multiple partitions, and then I want to put partitions to ES.
EsSpark.saveToEs need rdd, but the partitionBy function leave me the parameter Iterator. Is there a method to save the Iterator to ES or
convert Iterator to rdd?I use the ES-spark 5.2.2
the code is below:
var entry = Array("vpn","linux","error")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
var resultRDD=stream.map( record => {
val json = parse(record.value())
val x = json.extract[vpnLogEntry]
if (!x.innerIP.equals("-")){
("vpn",x)
}else{
("linux",x)
}
})
resultRDD.foreachRDD { (rdd,durationTime) =>
val entryToIndexDis = rdd.context.broadcast(entry.zipWithIndex.toMap)
val indexToEntryDis = rdd.context.broadcast(entry.zipWithIndex.map(_.swap).toMap)
rdd.partitionBy(new Partitioner {
override def numPartitions: Int = entryToIndexDis.value.size
override def getPartition(key: Any): Int = {
entryToIndexDis.value.get(key.toString).get
}
}).mapPartitionsWithIndex((index, data) => {
val index_type = indexToEntryDis.value(index)
//here, I want to put vpn data into vpn/vpn of ES,
//and put linux data into linux/linux of ES.
//the variable of data is type of Iterator,
//so can not use EsSpark.saveToEs function
data
}, true).count()
I am trying to convert rdd to dataframe in Spark2.0
val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val sqlCon=new SQLContext(sc)
import sqlCon.implicits._
val rdd=sc.textFile("/home/cloudera/alpha.dat").persist()
val row=rdd.first()
val data=rdd.filter { x => !x.contains(row) }
data.foreach { x => println(x) }
case class person(name:String,age:Int,city:String)
val rdd2=data.map { x => x.split(",") }
val rdd3=rdd2.map { x => person(x(0),x(1).toInt,x(2)) }
val df=rdd3.toDF()
df.printSchema();
df.registerTempTable("alpha")
val df1=sqlCon.sql("select * from alpha")
df1.foreach { x => println(x) }
but i a getting below error at toDF(). ---> "val df=rdd3.toDF() "
Multiple markers at this line:
- Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
- Implicit conversion found: rdd3 ⇒ rddToDatasetHolder(rdd3): (implicit evidence$4:
org.apache.spark.sql.Encoder[person])org.apache.spark.sql.DatasetHolder[person]
How to convert the above to Dataframe using toDF()
Cloudera & Spark 2.0? hmmm, didn't think we supported that yet :)
Anyway, first of all you don't need to call .persist() on your RDD so you can remove that bit. Secondly, since Person is a case class you should capitalize its name.
Lastly, in Spark 2.0 you no longer call import sqlContext.implicits._ to implicitly build a DataFrame schema, you now call import spark.implicits._. This is hinted at by your error message.
There was a simple mistake where I had defined case class inside the main method. After removing the same, I am able to convert RDD to DataFrame.
package sparksql
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
import org.apache.spark.SparkContext
object asw {
case class Person(name:String,age:Int,city:String)
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setMaster("local").setAppName("Dataframe")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val rdd1=sc.textFile("/home/cloudera/alpha.dat")
val row=rdd1.first()
val data=rdd1.filter { x => !x.contains(row) }
val rdd2=data.map { x => x.split(",") }
val df=rdd2.map { x => Person(x(0),x(1).toInt,x(2)) }.toDF()
df.createOrReplaceTempView("rdd21")
spark.sql("select * from rdd21").show()
}
}
I'm trying to create a user defined type in spark sql, but I receive:
com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType even when using their example. Has anyone made this work?
My code:
test("udt serialisation") {
val points = Seq(new ExamplePoint(1.3, 1.6), new ExamplePoint(1.3, 1.8))
val df = SparkContextForStdout.context.parallelize(points).toDF()
}
#SQLUserDefinedType(udt = classOf[ExamplePointUDT])
case class ExamplePoint(val x: Double, val y: Double)
/**
* User-defined type for [[ExamplePoint]].
*/
class ExamplePointUDT extends UserDefinedType[ExamplePoint] {
override def sqlType: DataType = ArrayType(DoubleType, false)
override def pyUDT: String = "pyspark.sql.tests.ExamplePointUDT"
override def serialize(obj: Any): Seq[Double] = {
obj match {
case p: ExamplePoint =>
Seq(p.x, p.y)
}
}
override def deserialize(datum: Any): ExamplePoint = {
datum match {
case values: Seq[_] =>
val xy = values.asInstanceOf[Seq[Double]]
assert(xy.length == 2)
new ExamplePoint(xy(0), xy(1))
case values: util.ArrayList[_] =>
val xy = values.asInstanceOf[util.ArrayList[Double]].asScala
new ExamplePoint(xy(0), xy(1))
}
}
override def userClass: Class[ExamplePoint] = classOf[ExamplePoint]
}
The usefull stackstrace is this:
com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType
java.lang.ClassCastException: com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
It seems that the UDT needs to be used inside of another class to work (as the type of a field). One solution to use it directly is to wrap it into a Tuple1:
test("udt serialisation") {
val points = Seq(new Tuple1(new ExamplePoint(1.3, 1.6)), new Tuple1(new ExamplePoint(1.3, 1.8)))
val df = SparkContextForStdout.context.parallelize(points).toDF()
df.collect().foreach(println(_))
}
I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")