Loading data from Spark Structured Streaming into ArrayList - apache-spark

I need to send data from Kafka to Kinesis Firehose. I am processing Kafka data using Spark Structured Streaming. I am not sure how to process the dataset of streaming query into an ArrayList variable - say, recordList - of e.g. 100 records (could be any other value) and then call Firehose API's putRecordBatch(recordList) to put the records into Firehose.

I think you want to check out Foreach and ForeachBatch depending on your Spark Version. ForeachBatch comes in V2.4.0 and foreach is available < V2.4.0. If there is no streaming sink implementation available for Kinesis Firehouse then you should make your own implementation of the ForeachWriter. Databricks has some nice examples of using foreach to create the custom writers.
I haven't ever used Kinesis but here is an example of what your custom sink might look like.
case class MyConfigInfo(info1: String, info2: String)
class KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
val kinesisProducer = _
def open(partitionId: Long,version: Long): Boolean = {
kinesisProducer = //set up the kinesis producer using MyConfigInfo
true
}
def process(value: (String, String)): Unit = {
//ask kinesisProducer to send data
}
def close(errorOrNull: Throwable): Unit = {
//close the kinesis producer
}
}
if you're using that AWS kinesisfirehose API you might do something like this
case class MyConfigInfo(info1: String, info2: String)
class KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
val firehoseClient = _
val req = putRecordBatchRequest = new PutRecordBatchRequest()
val records = 0
val recordLimit = //maybe you need to set this?
def open(partitionId: Long,version: Long): Boolean = {
firehoseClient = //set up the firehose client using MyConfigInfo
true
}
def process(value: (String, String)): Unit = {
//ask fireHose client to send data or batch the request
val record: Record = //create Record out of value
req.setRecords(record)
records = records + 1
if(records >= recordLimit) {
firehoseClient.putRecordBatch(req)
records = 0
}
}
def close(errorOrNull: Throwable): Unit = {
//close the firehose client
//or instead you could put the batch request to the firehose client here but i'm not sure if that's good practice
}
}
Then you'd use it as such
val writer = new KinesisSink(configuration)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()

Related

Spark custom receicer not get data

I'm using spark streaming to ingest my company's internal data source. I followed this tutorial to write a receiver: https://spark.apache.org/docs/latest/streaming-custom-receivers.html. But in Spark UI streaming tag, I always see 0 msgs coming in. Also I don't see any errors in driver logs. Really confused what goes wrong. (To connect to the internal data source, need to create a client, then listen() will keep running to get the new msgs) I doubt is it because of the listen mode on the data source?
My Receiver
class MyReceiver(val clientId: String, val token: String, val env: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("My Data Source") { override def run() { receive() } }.start()
}
def onStop() { }
private def receive() {
while(!isStopped()) {
try {
val client = new Client(clientId, token, "STAGE")
client.connect()
client.listen(Client.Topic, new ClientMsgHandler() {
override def process(event: ClientMsg): Unit = {
val msg: String = event.getBody
store(msg)
}
override def onException(event: ClientEvent): Unit = {
}
})
} catch {
case ce: java.net.ConnectException =>
System.out.println("Could not connect")
case t: Throwable =>
System.out.println("Error receiving data")
}
}
}
}
==================================================================
Create Stream
class MyStream(sc: SparkContext, sqlContext: SQLContext, cpDir: String) {
def creatingFunc(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(3))
// Set the active SQLContext so that we can access it statically within the foreachRDD
SQLContext.setActiveSession(sqlContext)
ssc.checkpoint(cpDir)
val ClientId = <Myclientid>
val Token = <Mytoken>
val env = "STAGE"
val stream = ssc.receiverStream(new MyReceiver(ClientId, Token, env))
stream.foreachRDD { rdd => println("Here"+rdd.take(10).mkString(", "))
}
ssc
}
}
==================================================================
Start Streaming
val checkpoint_dir = <my_checkpoint_dir>
val MyDataSourceStream = new MyStream(sc, sqlContext, checkpoint_dir)
val ssc = StreamingContext.getActiveOrCreate(checkpoint_dir, MyDataSourceStream.creatingFunc _)
ssc.start()
ssc.awaitTermination()
Updates:
Since it's an internal source, I cannot share Client source code. But I've tested the connection. It works for below code and msg can be printed out correctly. You can think of Client as an external lib which has no connection issues.
val ClientId = <myclientid>
val Token = <mytoken>
val client = new EVClient(ClientId, Token, "STAGE")
client.connect()
client.listen(Client.Topic, new ClientMsgHandler() {
override def onEvent(event: ClientMsg): Unit = {
val res = event.getBody
println(res)
}
override def onException(event: ClientEvent): Unit = {
}
})

Why Spark not serializable exception occurs when changing RDD to DataFrame?

I am using structured streaming and following code works
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.rdd.mapPartitions(...)
}}
But following code throws an exception, object not serializable (class: redis.clients.jedis.Jedis, value: redis.clients.jedis.Jedis#a8e0378)
The code has only one place change (change RDD to DataFrame):
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(...) // only change is change batchDF.rdd to batchDF
}}
My Jedis code should be executed on driver and never reach executor. I suppose Spark RDD and DataFrame should have similar APIS? Why this happens?
I used ctrl to go into the lower level code. The batchDF.mapPartitions goes to
#Experimental
#InterfaceStability.Evolving
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] =
{
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
and batchDF.rdd.mapPartitions goes to
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
My Spark version is 2.4.3.
My simplest version of code below, and I just found something else...
val j = new Jedis() // an redis client which is not serializable.
xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
j.xtrim(...)... // call function of Jedis here
batchDF.mapPartitions(x => {
val arr = x.grouped(2).toArray // this line matters
})
// only change is change batchDF.rdd to batchDF
}}
see this DataFrame api implementation
internally its calling rdd.mapPartitions of your function.
/**
* Returns a new RDD by applying a function to each partition of this DataFrame.
* #group rdd
* #since 1.3.0
*/
def mapPartitions[R: ClassTag](f: Iterator[Row] => Iterator[R]): RDD[R] = {
rdd.mapPartitions(f)
}
There is no difference some where else you might have done mistake.
AFAIK, Ideally this should be the way
batchDF.mapPartitions { yourparition =>
// better to create a JedisPool and take object rather than new Jedis
val j = new Jedis()
val result = yourparition.map {
// do some process here
}
j.close // release and take care of connections/ resources here
result
}
}

Manipulate trigger interval in spark structured streaming

For a given scenario, I want to filter the datasets in structured streaming in a combination of continuous and batch triggers.
I know it sounds unrealistic or maybe not feasible. Below is what I am trying to achieve.
Let the processing time-interval set in the app be 5 minutes.
Let the record be of below schema:
{
"type":"record",
"name":"event",
"fields":[
{ "name":"Student", "type":"string" },
{ "name":"Subject", "type":"string" }
]}
My streaming app is supposed to write the result into the sink by considering either of the below two criteria.
If a student is having more than 5 subjects. (Priority to be given to this criteria.)
Processing time provided in the trigger expired.
private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"alarm\","
+ "\"fields\":["
+ " { \"name\":\"student\", \"type\":\"string\" },"
+ " { \"name\":\"subject\", \"type\":\"string\" }"
+ "]}";
private static Schema.Parser parser = new Schema.Parser();
private static Schema schema = parser.parse(USER_SCHEMA);
static {
recordInjection = GenericAvroCodecs.toBinary(schema);
type = (StructType) SchemaConverters.toSqlType(schema).dataType();
}
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
}, DataTypes.createStructType(type.fields()));
Dataset<Row> ds2 = ds1
.select("value").as(Encoders.BINARY())
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr("student","subject");
StreamingQuery query1 = ds2
.writeStream()
.foreachBatch(
new VoidFunction2<Dataset<Row>, Long>() {
#Override
public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
}
}
).format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "new_in")
.option("checkpointLocation", "checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime(10000))
.start();
query1.awaitTermination();
Kafka Producer console:
Student:Test, Subject:x
Student:Test, Subject:y
Student:Test, Subject:z
Student:Test1, Subject:x
Student:Test2, Subject:x
Student:Test, Subject:w
Student:Test1, Subject:y
Student:Test2, Subject:y
Student:Test, Subject:v
In the Kafka consumer console, I am expecting like below.
Test:{x,y,z,w,v} =>This should be the first response
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time

how to save Iterator to ES

I use the partitionBy functions to divide my rdd to multiple partitions, and then I want to put partitions to ES.
EsSpark.saveToEs need rdd, but the partitionBy function leave me the parameter Iterator. Is there a method to save the Iterator to ES or
convert Iterator to rdd?I use the ES-spark 5.2.2
the code is below:
var entry = Array("vpn","linux","error")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
var resultRDD=stream.map( record => {
val json = parse(record.value())
val x = json.extract[vpnLogEntry]
if (!x.innerIP.equals("-")){
("vpn",x)
}else{
("linux",x)
}
})
resultRDD.foreachRDD { (rdd,durationTime) =>
val entryToIndexDis = rdd.context.broadcast(entry.zipWithIndex.toMap)
val indexToEntryDis = rdd.context.broadcast(entry.zipWithIndex.map(_.swap).toMap)
rdd.partitionBy(new Partitioner {
override def numPartitions: Int = entryToIndexDis.value.size
override def getPartition(key: Any): Int = {
entryToIndexDis.value.get(key.toString).get
}
}).mapPartitionsWithIndex((index, data) => {
val index_type = indexToEntryDis.value(index)
//here, I want to put vpn data into vpn/vpn of ES,
//and put linux data into linux/linux of ES.
//the variable of data is type of Iterator,
//so can not use EsSpark.saveToEs function
data
}, true).count()

Spark : cleaner way to build Dataset out of Spark streaming

I want to create an API which looks like this
public Dataset<Row> getDataFromKafka(SparkContext sc, String topic, StructType schema);
here
topic - is Kafka topic name from which the data is going to be consumed.
schema - is schema information for Dataset
so my function contains following code :
JavaStreamingContext jsc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(2000L));
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(
jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class,
kafkaConsumerConfig(), topics
);
Dataset<Row> dataSet = sqlContext.createDataFrame(javaSparkContext.emptyRDD(), schema);
DataSetHolder holder = new DataSetHolder(dataSet);
LongAccumulator stopStreaming = sc.longAccumulator("stop");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
//get type of message from value
Row row = null;
if (END == msg) {
stopStreaming.add(1);
row = null;
} else {
row = new GenericRow(/*row data created from values*/);
}
return row;
}).filter(row -> row != null).rdd();
holder.union(sqlContext.createDataFrame(rows, schema));
holder.get().count();
});
jsc.start();
//stop stream if stopStreaming value is greater than 0 its spawned as new thread.
return holder.get();
Here DatasetHolder is a wrapper class around Dataset to combine the result of all the rdds.
class DataSetHolder {
private Dataset<Row> df = null;
public DataSetHolder(Dataset<Row> df) {
this.df = df;
}
public void union(Dataset<Row> frame) {
this.df = df.union(frame);
}
public Dataset<Row> get() {
return df;
}
}
This doesn't looks good at all but I had to do it. I am wondering what is the good way to do it. Or is there any provision for this by Spark?
Update
So after consuming all the data from stream i.e. from kafka topic, we create a dataframe out of it so that the data analyst can register it as a temp table and can fire any query to get the meaningful result.

Resources