how to save Iterator to ES - apache-spark

I use the partitionBy functions to divide my rdd to multiple partitions, and then I want to put partitions to ES.
EsSpark.saveToEs need rdd, but the partitionBy function leave me the parameter Iterator. Is there a method to save the Iterator to ES or
convert Iterator to rdd?I use the ES-spark 5.2.2
the code is below:
var entry = Array("vpn","linux","error")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
var resultRDD=stream.map( record => {
val json = parse(record.value())
val x = json.extract[vpnLogEntry]
if (!x.innerIP.equals("-")){
("vpn",x)
}else{
("linux",x)
}
})
resultRDD.foreachRDD { (rdd,durationTime) =>
val entryToIndexDis = rdd.context.broadcast(entry.zipWithIndex.toMap)
val indexToEntryDis = rdd.context.broadcast(entry.zipWithIndex.map(_.swap).toMap)
rdd.partitionBy(new Partitioner {
override def numPartitions: Int = entryToIndexDis.value.size
override def getPartition(key: Any): Int = {
entryToIndexDis.value.get(key.toString).get
}
}).mapPartitionsWithIndex((index, data) => {
val index_type = indexToEntryDis.value(index)
//here, I want to put vpn data into vpn/vpn of ES,
//and put linux data into linux/linux of ES.
//the variable of data is type of Iterator,
//so can not use EsSpark.saveToEs function
data
}, true).count()

Related

Spark not reading all the records from binary file

I am trying to read Avro files from S3 and as shown in this spark documentation I am able to read it fine. My files are like below, these files consist of 5000 record each.
s3a://bucket/part-0.avro
s3a://bucket/part-1.avro
s3a://bucket/part-2.avro
val byteRDD: RDD[Array[Byte]] = sc.binaryFiles(s"$s3URL/*.avro").map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available()
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
import org.apache.avro.io.DecoderFactory
val deserialisedAvroRDD = byteRDD.map(record => {
import org.apache.avro.Schema
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(record, null)
var datum: GenericRecord = null
while (!decoder.isEnd()) {
datum = datumReader.read(datum, decoder)
}
datum
}
)
deserialisedAvroRDD.count() ---> 3
I am deserializing the binaryAvro messages to generate GenericRecords and I was expecting the deserilized RDD to have 15k records as each .avro file had 5k record however after deserializing I only get 3 record. Can someone please help in finding out the issue with my code? How can I serialize one record at a time.
This should work
val recRDD: RDD[GenericRecord] = sc.binaryFiles(s"$s3URL/*.avro").flatMap {
case (file, pds) => {
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(pds.toArray(), null)
var datum: GenericRecord = null
val out = ArrayBuffer[GenericRecord]()
while (!decoder.isEnd()) {
out += datumReader.read(datum, decoder)
}
out
}
}

Why Spark not serializable exception occurs when changing RDD to DataFrame?

I am using structured streaming and following code works
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.rdd.mapPartitions(...)
}}
But following code throws an exception, object not serializable (class: redis.clients.jedis.Jedis, value: redis.clients.jedis.Jedis#a8e0378)
The code has only one place change (change RDD to DataFrame):
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(...) // only change is change batchDF.rdd to batchDF
}}
My Jedis code should be executed on driver and never reach executor. I suppose Spark RDD and DataFrame should have similar APIS? Why this happens?
I used ctrl to go into the lower level code. The batchDF.mapPartitions goes to
#Experimental
#InterfaceStability.Evolving
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] =
{
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
and batchDF.rdd.mapPartitions goes to
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
My Spark version is 2.4.3.
My simplest version of code below, and I just found something else...
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(x => {
val arr = x.grouped(2).toArray // this line matters
})
// only change is change batchDF.rdd to batchDF
}}
see this DataFrame api implementation
internally its calling rdd.mapPartitions of your function.
/**
* Returns a new RDD by applying a function to each partition of this DataFrame.
* #group rdd
* #since 1.3.0
*/
def mapPartitions[R: ClassTag](f: Iterator[Row] => Iterator[R]): RDD[R] = {
rdd.mapPartitions(f)
}
There is no difference some where else you might have done mistake.
AFAIK, Ideally this should be the way
batchDF.mapPartitions { yourparition =>
// better to create a JedisPool and take object rather than new Jedis
val j = new Jedis()
val result = yourparition.map {
// do some process here
}
j.close // release and take care of connections/ resources here
result
}
}

Spark Structured streaming kafka avro Producer

I have a dataframe let's say:
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
I want to send that dataframe to a kafka topic using avro serialization and using schema registry. I believe I'm almost there, but I can't seem to get past the Task not serializable error. I understand there is a sink for kafka, but it doesn't communicate with the schema registry which is a requirement.
object Holder extends Serializable{
def prop(): java.util.Properties = {
val props = new Properties()
props.put("schema.registry.url", schemaRegistryURL)
props.put("key.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
props.put("value.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
props.put("schema.registry.url", schemaRegistryURL)
props.put("bootstrap.servers", brokers)
props
}
def vProps(props: java.util.Properties): kafka.utils.VerifiableProperties = {
val vProps = new kafka.utils.VerifiableProperties(props)
vProps
}
def messageSchema(vProps: kafka.utils.VerifiableProperties): org.apache.avro.Schema = {
val ser = new KafkaAvroEncoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueName)
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
messageSchema
}
def avroRecord(messageSchema: org.apache.avro.Schema): org.apache.avro.generic.GenericData.Record = {
val avroRecord = new GenericData.Record(messageSchema)
avroRecord
}
def ProducerRecord(avroRecord:org.apache.avro.generic.GenericData.Record): org.apache.kafka.clients.producer.ProducerRecord[org.apache.avro.generic.GenericRecord,org.apache.avro.generic.GenericRecord] = {
val record = new ProducerRecord[GenericRecord, GenericRecord](topicWrite, avroRecord)
record
}
def producer(props: java.util.Properties): KafkaProducer[GenericRecord, GenericRecord] = {
val producer = new KafkaProducer[GenericRecord, GenericRecord](props)
producer
}
}
val prod: (String, String) => String = (
number: String,
word: String,
) => {
val prop = Holder.prop()
val vProps = Holder.vProps(prop)
val mSchema = Holder.messageSchema(vProps)
val aRecord = Holder.avroRecord(mSchema)
aRecord.put("number", number)
aRecord.put("word", word)
val record = Holder.ProducerRecord(aRecord)
val producer = Holder.producer(prop)
producer.send(record)
"sent"
}
val prodUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
udf((
Number: String,
word: String,
) => prod(number,word))
val testDF = firstDF.withColumn("sent", prodUDF(col("number"), col("word")))
KafkaProducer is not serializable.
Create the KafkaProducer inside prod() instead of creating it outside.

Why number of partitions after join are different in Spark Streaming

val sparkConf = new SparkConf()
val streamingContext = new StreamingContext(sparkConf, Minutes(1))
var historyRdd: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
var historyRdd_2: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
val stream_1 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_1))
val dstream_2 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_2))
val dstream_2 = stream_2.map((r: Tuple2[String, GenericData.Record]) =>
{
//some mapping
}
val historyDStream = dstream_1.transform(rdd => rdd.union(historyRdd))
val historyDStream_2 = dstream_2.transform(rdd => rdd.union(historyRdd_2))
val fullJoinResult = historyDStream.fullOuterJoin(historyDStream_2)
val filtered = fullJoinResult.filter(r => r._2._1.isEmpty)
filtered.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._2.get))
historyRdd_2.unpersist(false) // unpersist the 'old' history RDD
historyRdd_2 = formatted // assign the new history
historyRdd_2.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
val filteredStream = fullJoinResult.filter(r => r._2._2.isEmpty)
filteredStream.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._1.get))
historyRdd.unpersist(false) // unpersist the 'old' history RDD
historyRdd = formatted // assign the new history
historyRdd.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
streamingContext.start()
streamingContext.awaitTermination()
}
}
DStream_1 and DStream_2 has 128 partitions each but after performing the join the resultant DStream has 3 partitions , I haven't done any repartition. I had this thought that if no. of partitions of DStream are same then the join resultant DStream has same number of partitions as join happens partition to partition.Please correct me if I am wrong in that case.

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Resources