Kafka Avro to Elasticsearch with Spark - apache-spark

Want to put Avro messages from Kafka topics into Elasticsearch using Spark job (and SchemaRegistry with many defined schemas). I was able to read and deserialize records into Strings (json) format succesfully (with those 2 methods):
// Deserialize Avro to String
def avroToJsonString(record: GenericRecord): String = try {
val baos = new ByteArrayOutputStream
try {
val schema = record.getSchema
val jsonEncoder = EncoderFactory.get.jsonEncoder(schema, baos, false)
val avroWriter = new SpecificDatumWriter[GenericRecord](schema)
avroWriter.write(record, jsonEncoder)
jsonEncoder.flush()
baos.flush()
new String(baos.toByteArray)
} catch {
case ex: IOException =>
throw new IllegalStateException(ex)
} finally if (baos != null) baos.close()
}
// Parse JSON String
val parseJsonStream = (inStream: String) => {
try {
val parsed = Json.parse(inStream)
Option(parsed)
} catch {
case e: Exception => System.err.println("Exception while parsing JSON: " + inStream)
e.printStackTrace()
None
}
}
I'm reading record by record and I see deserialized JSON strings in debugger, everything looks fine, but for some reason I couldn't save them into Elasticsearch, because I guess RDD is needed to call saveToEs method. This is how I read avro records from Kafka:
val kafkaStream : InputDStream[ConsumerRecord[String, GenericRecord]] = KafkaUtils.createDirectStream[String, GenericRecord](ssc, PreferBrokers, Subscribe[String, GenericRecord](KAFKA_AVRO_TOPICS, kafkaParams))
val kafkaStreamParsed= kafkaStream.foreachRDD(rdd => {
rdd.foreach( x => {
val jsonString: String = avroToJsonString(x.value())
parseJsonStream(jsonString)
})
})
In case when I was reading json (not Avro) records, I was able to do it with:
EsSparkStreaming.saveToEs(kafkaStreamParsed, ELASTICSEARCH_EVENTS_INDEX + "/" + ELASTICSEARCH_TYPE)
I have an error in saveToEs method saying
Cannot resolve overloaded method 'saveToEs'
Tried to make rdd with sc.makeRDD() but had no luck either. How should I put all these records from batch job into RDD and afterward to Elasticsearch or I'm doing it all wrong?
UPDATE
Tried with solution:
val messages: DStream[Unit] = kafkaStream
.map(record => record.value)
.flatMap(record => {
val record1 = avroToJsonString(record)
JSON.parseFull(record1).map(rawMap => {
val map = rawMap.asInstanceOf[Map[String,String]]
})
})
again with the same Error (cannot resolve overloaded method)
UPDATE2
val kafkaStreamParsed: DStream[Any] = kafkaStream.map(rdd => {
val eventJSON = avroToJsonString(rdd.value())
parseJsonStream(eventJSON)
})
try {
EsSparkStreaming.saveToEs(kafkaStreamParsed, ELASTICSEARCH_EVENTS_INDEX + "/" + ELASTICSEARCH_TYPE)
} catch {
case e: Exception =>
EsSparkStreaming.saveToEs(kafkaStreamParsed, ELASTICSEARCH_FAILED_EVENTS)
e.printStackTrace()
}
Now I get the records in ES.
Using Spark 2.3.0 and Scala 2.11.8

I've managed to do it:
val kafkaStream : InputDStream[ConsumerRecord[String, GenericRecord]] = KafkaUtils.createDirectStream[String, GenericRecord](ssc, PreferBrokers, Subscribe[String, GenericRecord](KAFKA_AVRO_EVENT_TOPICS, kafkaParams))
val kafkaStreamParsed: DStream[Any] = kafkaStream.map(rdd => {
val eventJSON = avroToJsonString(rdd.value())
parseJsonStream(eventJSON)
})
try {
EsSparkStreaming.saveToEs(kafkaStreamParsed, ELASTICSEARCH_EVENTS_INDEX + "/" + ELASTICSEARCH_TYPE)
} catch {
case e: Exception =>
EsSparkStreaming.saveToEs(kafkaStreamParsed, ELASTICSEARCH_FAILED_EVENTS)
e.printStackTrace()
}

Related

Spark custom receicer not get data

I'm using spark streaming to ingest my company's internal data source. I followed this tutorial to write a receiver: https://spark.apache.org/docs/latest/streaming-custom-receivers.html. But in Spark UI streaming tag, I always see 0 msgs coming in. Also I don't see any errors in driver logs. Really confused what goes wrong. (To connect to the internal data source, need to create a client, then listen() will keep running to get the new msgs) I doubt is it because of the listen mode on the data source?
My Receiver
class MyReceiver(val clientId: String, val token: String, val env: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("My Data Source") { override def run() { receive() } }.start()
}
def onStop() { }
private def receive() {
while(!isStopped()) {
try {
val client = new Client(clientId, token, "STAGE")
client.connect()
client.listen(Client.Topic, new ClientMsgHandler() {
override def process(event: ClientMsg): Unit = {
val msg: String = event.getBody
store(msg)
}
override def onException(event: ClientEvent): Unit = {
}
})
} catch {
case ce: java.net.ConnectException =>
System.out.println("Could not connect")
case t: Throwable =>
System.out.println("Error receiving data")
}
}
}
}
==================================================================
Create Stream
class MyStream(sc: SparkContext, sqlContext: SQLContext, cpDir: String) {
def creatingFunc(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(3))
// Set the active SQLContext so that we can access it statically within the foreachRDD
SQLContext.setActiveSession(sqlContext)
ssc.checkpoint(cpDir)
val ClientId = <Myclientid>
val Token = <Mytoken>
val env = "STAGE"
val stream = ssc.receiverStream(new MyReceiver(ClientId, Token, env))
stream.foreachRDD { rdd => println("Here"+rdd.take(10).mkString(", "))
}
ssc
}
}
==================================================================
Start Streaming
val checkpoint_dir = <my_checkpoint_dir>
val MyDataSourceStream = new MyStream(sc, sqlContext, checkpoint_dir)
val ssc = StreamingContext.getActiveOrCreate(checkpoint_dir, MyDataSourceStream.creatingFunc _)
ssc.start()
ssc.awaitTermination()
Updates:
Since it's an internal source, I cannot share Client source code. But I've tested the connection. It works for below code and msg can be printed out correctly. You can think of Client as an external lib which has no connection issues.
val ClientId = <myclientid>
val Token = <mytoken>
val client = new EVClient(ClientId, Token, "STAGE")
client.connect()
client.listen(Client.Topic, new ClientMsgHandler() {
override def onEvent(event: ClientMsg): Unit = {
val res = event.getBody
println(res)
}
override def onException(event: ClientEvent): Unit = {
}
})

Spark not reading all the records from binary file

I am trying to read Avro files from S3 and as shown in this spark documentation I am able to read it fine. My files are like below, these files consist of 5000 record each.
s3a://bucket/part-0.avro
s3a://bucket/part-1.avro
s3a://bucket/part-2.avro
val byteRDD: RDD[Array[Byte]] = sc.binaryFiles(s"$s3URL/*.avro").map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available()
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
import org.apache.avro.io.DecoderFactory
val deserialisedAvroRDD = byteRDD.map(record => {
import org.apache.avro.Schema
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(record, null)
var datum: GenericRecord = null
while (!decoder.isEnd()) {
datum = datumReader.read(datum, decoder)
}
datum
}
)
deserialisedAvroRDD.count() ---> 3
I am deserializing the binaryAvro messages to generate GenericRecords and I was expecting the deserilized RDD to have 15k records as each .avro file had 5k record however after deserializing I only get 3 record. Can someone please help in finding out the issue with my code? How can I serialize one record at a time.
This should work
val recRDD: RDD[GenericRecord] = sc.binaryFiles(s"$s3URL/*.avro").flatMap {
case (file, pds) => {
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(pds.toArray(), null)
var datum: GenericRecord = null
val out = ArrayBuffer[GenericRecord]()
while (!decoder.isEnd()) {
out += datumReader.read(datum, decoder)
}
out
}
}

how to save Iterator to ES

I use the partitionBy functions to divide my rdd to multiple partitions, and then I want to put partitions to ES.
EsSpark.saveToEs need rdd, but the partitionBy function leave me the parameter Iterator. Is there a method to save the Iterator to ES or
convert Iterator to rdd?I use the ES-spark 5.2.2
the code is below:
var entry = Array("vpn","linux","error")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
var resultRDD=stream.map( record => {
val json = parse(record.value())
val x = json.extract[vpnLogEntry]
if (!x.innerIP.equals("-")){
("vpn",x)
}else{
("linux",x)
}
})
resultRDD.foreachRDD { (rdd,durationTime) =>
val entryToIndexDis = rdd.context.broadcast(entry.zipWithIndex.toMap)
val indexToEntryDis = rdd.context.broadcast(entry.zipWithIndex.map(_.swap).toMap)
rdd.partitionBy(new Partitioner {
override def numPartitions: Int = entryToIndexDis.value.size
override def getPartition(key: Any): Int = {
entryToIndexDis.value.get(key.toString).get
}
}).mapPartitionsWithIndex((index, data) => {
val index_type = indexToEntryDis.value(index)
//here, I want to put vpn data into vpn/vpn of ES,
//and put linux data into linux/linux of ES.
//the variable of data is type of Iterator,
//so can not use EsSpark.saveToEs function
data
}, true).count()

spark streaming hbase error

I want to insert streaming data into hbase;
this is my code :
val tableName = "streamingz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
print("-----------------------------------------------------------------------------------------------------------")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
tableDesc.addFamily(new HColumnDescriptor("z2".getBytes()))
admin.createTable(tableDesc)
} else {
print("Table already exists!!--------------------------------------------------------------------------------------")
}
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set("fluxAstellia")
val kafkaParams = Map[String, String]("metadata.broker.list" - > "10.32.201.90:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val lines = stream.map(_._2).map(_.split(" ", -1)).foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val myTable = new HTable(conf, tableName)
rdd.map(rec => {
var put = new Put(rec._1.getBytes)
put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec._2))
myTable.put(put)
}).saveAsNewAPIHadoopDataset(conf)
myTable.flushCommits()
} else {
println("rdd is empty")
}
})
ssc.start()
ssc.awaitTermination()
}
}
I got this error:
:66: error: value _1 is not a member of Array[String]
var put = new Put(rec._1.getBytes)
I'm beginner so how I can't fix this error, and I have a question:
where exactly create the table; outside the streaming process or inside ?
Thank you
You error is basically on line var put = new Put(rec._1.getBytes)
You can call _n only on a Map(_1 for key and _2 for value) or a Tuple.
rec is a String Array you got by splitting the string in the stream by space characters. If you were after first element, you'd write it as var put = new Put(rec(0).getBytes). Likewise in the next line you'd write it as put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec(1)))

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Resources