I have a dataframe let's say:
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
I want to send that dataframe to a kafka topic using avro serialization and using schema registry. I believe I'm almost there, but I can't seem to get past the Task not serializable error. I understand there is a sink for kafka, but it doesn't communicate with the schema registry which is a requirement.
object Holder extends Serializable{
def prop(): java.util.Properties = {
val props = new Properties()
props.put("schema.registry.url", schemaRegistryURL)
props.put("key.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
props.put("value.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
props.put("schema.registry.url", schemaRegistryURL)
props.put("bootstrap.servers", brokers)
props
}
def vProps(props: java.util.Properties): kafka.utils.VerifiableProperties = {
val vProps = new kafka.utils.VerifiableProperties(props)
vProps
}
def messageSchema(vProps: kafka.utils.VerifiableProperties): org.apache.avro.Schema = {
val ser = new KafkaAvroEncoder(vProps)
val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueName)
val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
messageSchema
}
def avroRecord(messageSchema: org.apache.avro.Schema): org.apache.avro.generic.GenericData.Record = {
val avroRecord = new GenericData.Record(messageSchema)
avroRecord
}
def ProducerRecord(avroRecord:org.apache.avro.generic.GenericData.Record): org.apache.kafka.clients.producer.ProducerRecord[org.apache.avro.generic.GenericRecord,org.apache.avro.generic.GenericRecord] = {
val record = new ProducerRecord[GenericRecord, GenericRecord](topicWrite, avroRecord)
record
}
def producer(props: java.util.Properties): KafkaProducer[GenericRecord, GenericRecord] = {
val producer = new KafkaProducer[GenericRecord, GenericRecord](props)
producer
}
}
val prod: (String, String) => String = (
number: String,
word: String,
) => {
val prop = Holder.prop()
val vProps = Holder.vProps(prop)
val mSchema = Holder.messageSchema(vProps)
val aRecord = Holder.avroRecord(mSchema)
aRecord.put("number", number)
aRecord.put("word", word)
val record = Holder.ProducerRecord(aRecord)
val producer = Holder.producer(prop)
producer.send(record)
"sent"
}
val prodUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
udf((
Number: String,
word: String,
) => prod(number,word))
val testDF = firstDF.withColumn("sent", prodUDF(col("number"), col("word")))
KafkaProducer is not serializable.
Create the KafkaProducer inside prod() instead of creating it outside.
Related
We can use following api to write dataframe into local files.
df.write.parquet(path)
df.write.json(path)
However, Can I write into a parquet and a json in one time without compute the dataframe twice ?
By the way , I dont want to cache the data in memory, because it's too big.
If you don't cache/persist the dataframe, then it'll will need re-computed for each output format.
We can implement an org.apache.spark.sql.execution.datasources.FileFormat to do such thing.
DuplicateOutFormat demo
/**
* Very Dangerous Toy Code. DO NOT USE IN PRODUCTION.
*/
class DuplicateOutFormat
extends FileFormat
with DataSourceRegister
with Serializable {
override def inferSchema(sparkSession: SparkSession, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] = {
throw new UnsupportedOperationException()
}
override def prepareWrite(sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val format1 = options("format1")
val format2 = options("format2")
val format1Instance = DataSource.lookupDataSource(format1, sparkSession.sessionState.conf)
.newInstance().asInstanceOf[FileFormat]
val format2Instance = DataSource.lookupDataSource(format2, sparkSession.sessionState.conf)
.newInstance().asInstanceOf[FileFormat]
val writerFactory1 = format1Instance.prepareWrite(sparkSession, job, options, dataSchema)
val writerFactory2 = format2Instance.prepareWrite(sparkSession, job, options, dataSchema)
new OutputWriterFactory {
override def getFileExtension(context: TaskAttemptContext): String = ".dup"
override def newInstance(path: String, dataSchema: StructType, context: TaskAttemptContext): OutputWriter = {
val path1 = path.replace(".dup", writerFactory1.getFileExtension(context))
val path2 = path.replace(".dup", writerFactory2.getFileExtension(context))
val writer1 = writerFactory1.newInstance(path1, dataSchema, context)
val writer2 = writerFactory2.newInstance(path2, dataSchema, context)
new OutputWriter {
override def write(row: InternalRow): Unit = {
writer1.write(row)
writer2.write(row)
}
override def close(): Unit = {
writer1.close()
writer2.close()
}
}
}
}
}
override def shortName(): String = "dup"
}
SPI
we should make a SPI file /META-INF/services/org.apache.spark.sql.sources.DataSourceRegister, content:
com.github.sparkdemo.DuplicateOutFormat.
demo usage
class DuplicateOutFormatTest extends FunSuite {
val spark = SparkSession.builder()
.master("local")
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
test("testDuplicateWrite") {
val data = Array(
("k1", "fa", "20210901", 16),
("k2", null, "20210902", 15),
("k3", "df", "20210903", 14),
("k4", null, "20210904", 13)
)
val tempDir = System.getProperty("java.io.tmpdir") + "spark-dup-test" + System.nanoTime()
val df = sc.parallelize(data).toDF("k", "col2", "day", "col4")
df.write
.option("format1", "csv")
.option("format2", "orc")
.format("dup").save(tempDir)
df.show(1000, false)
}
}
WARNING
Spark SQL couple some sth in DataFrameWriter#saveToV1Source and other source code, that we can't change. This custom DuplicateOutFormat is just for demo, lacking of test. Full demo in github.
Snapshot of my firebase realtime database
I want to extract the entire data under the "Orders" node, please tell me how should I model my data class for android in Kotlin?
I tried with this type of modeling,
After getting the reference of (Orders/uid/)
Order.kt
data class Order(
val items:ArrayList<Myitems>=ArrayList(),
val timeStamp:Long=0,
val totalCost:Int=0
)
MyItems.kt
data class MyItems(
val Item:ArrayList<Menu>=ArrayList()
)
Menu.kt
data class Menu(
val menCategory:String="",
val menName:String="",
val menImage:String="",
val menId:String="",
val menQuantity:Int=0,
val menCost:Int=0
)
After a lot of thinking and research online. I was finally able to model my classes and call add value event listener to it. Here it goes:
Order.kt
data class Order(
val items: ArrayList<HashMap<String, Any>> = ArrayList(),
val timeStamp: Long = 0,
val totalCost: Int = 0
)
OItem.kt
data class OItem(
val menCategory: String = "",
val menId: String = "",
val menImage: String = "",
val menName: String = "",
val menPrice: Int = 0,
var menQuantity: Int = 0
)
MainActivity.kt
val uid = FirebaseAuth.getInstance().uid
val ref = FirebaseDatabase.getInstance().getReference("Orders/$uid")
ref.addListenerForSingleValueEvent(object : ValueEventListener {
override fun onCancelled(error: DatabaseError) {
//
}
override fun onDataChange(p0: DataSnapshot) {
p0.children.forEach {
val order = it.getValue(Order::class.java)
ordList.add(order!!)
}
Log.d("hf", ordList.toString())
}
})
I am trying to read Avro files from S3 and as shown in this spark documentation I am able to read it fine. My files are like below, these files consist of 5000 record each.
s3a://bucket/part-0.avro
s3a://bucket/part-1.avro
s3a://bucket/part-2.avro
val byteRDD: RDD[Array[Byte]] = sc.binaryFiles(s"$s3URL/*.avro").map{ case(file, pds) => {
val dis = pds.open()
val len = dis.available()
val buf = Array.ofDim[Byte](len)
pds.open().readFully(buf)
buf
}}
import org.apache.avro.io.DecoderFactory
val deserialisedAvroRDD = byteRDD.map(record => {
import org.apache.avro.Schema
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(record, null)
var datum: GenericRecord = null
while (!decoder.isEnd()) {
datum = datumReader.read(datum, decoder)
}
datum
}
)
deserialisedAvroRDD.count() ---> 3
I am deserializing the binaryAvro messages to generate GenericRecords and I was expecting the deserilized RDD to have 15k records as each .avro file had 5k record however after deserializing I only get 3 record. Can someone please help in finding out the issue with my code? How can I serialize one record at a time.
This should work
val recRDD: RDD[GenericRecord] = sc.binaryFiles(s"$s3URL/*.avro").flatMap {
case (file, pds) => {
val schema = new Schema.Parser().parse(schemaJson)
val datumReader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get.binaryDecoder(pds.toArray(), null)
var datum: GenericRecord = null
val out = ArrayBuffer[GenericRecord]()
while (!decoder.isEnd()) {
out += datumReader.read(datum, decoder)
}
out
}
}
object SparkMain extends App {
System.setProperty("spark.cassandra.connection.host", "127.0.0.1")
val conf = new SparkConf().setMaster("local[2]").setAppName("kafkaspark").set("spark.streaming.concurrentJobs","4")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext= new SQLContext(sc)
val host = "localhost:2181"
val topicList = List("test","fb")
topicList.foreach{
topic=> val lines =KafkaUtils.createStream(ssc, host, topic, Map(topic -> 1)).map(_._2);
//configureStream(topic, lines)
lines.foreachRDD(rdd => rdd.map(test(_)).saveToCassandra("test","rawdata",SomeColumns("key")))
}
ssc.addStreamingListener(new StreamingListener {
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = {
System.out.println("Batch completed, Total delay :" + batchCompleted.batchInfo.totalDelay.get.toString + " ms")
}
override def onReceiverStarted(receiverStarted: StreamingListenerReceiverStarted): Unit = {
println("inside onReceiverStarted")
}
override def onReceiverError(receiverError: StreamingListenerReceiverError): Unit = {
println("inside onReceiverError")
}
override def onReceiverStopped(receiverStopped: StreamingListenerReceiverStopped): Unit = {
println("inside onReceiverStopped")
}
override def onBatchSubmitted(batchSubmitted: StreamingListenerBatchSubmitted): Unit = {
println("inside onBatchSubmitted")
}
override def onBatchStarted(batchStarted: StreamingListenerBatchStarted): Unit = {
println("inside onBatchStarted")
}
})
ssc.start()
println("===========================")
ssc.awaitTermination()
}
case class test(key: String)
If I put any one of the topics at a time, then each topic works. But when topic list has more than one topic, after getting the DataStream from kafka topic, it keeps printing "inside onBatchSubmitted".
My Bad . I configured it wrong.
setMaster("local[*]") in place of setMaster("local[2]") works.
change local[2] to local[*] and its working fine.
val conf = new SparkConf().setMaster("local[*]").setAppName("kafkaspark").set("spark.streaming.concurrentJobs","4")
I want to insert streaming data into hbase;
this is my code :
val tableName = "streamingz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
print("-----------------------------------------------------------------------------------------------------------")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
tableDesc.addFamily(new HColumnDescriptor("z2".getBytes()))
admin.createTable(tableDesc)
} else {
print("Table already exists!!--------------------------------------------------------------------------------------")
}
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set("fluxAstellia")
val kafkaParams = Map[String, String]("metadata.broker.list" - > "10.32.201.90:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val lines = stream.map(_._2).map(_.split(" ", -1)).foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val myTable = new HTable(conf, tableName)
rdd.map(rec => {
var put = new Put(rec._1.getBytes)
put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec._2))
myTable.put(put)
}).saveAsNewAPIHadoopDataset(conf)
myTable.flushCommits()
} else {
println("rdd is empty")
}
})
ssc.start()
ssc.awaitTermination()
}
}
I got this error:
:66: error: value _1 is not a member of Array[String]
var put = new Put(rec._1.getBytes)
I'm beginner so how I can't fix this error, and I have a question:
where exactly create the table; outside the streaming process or inside ?
Thank you
You error is basically on line var put = new Put(rec._1.getBytes)
You can call _n only on a Map(_1 for key and _2 for value) or a Tuple.
rec is a String Array you got by splitting the string in the stream by space characters. If you were after first element, you'd write it as var put = new Put(rec(0).getBytes). Likewise in the next line you'd write it as put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec(1)))