Spark Structured Streaming Kafka Offset Management - apache-spark

I'm looking into storing kafka offsets inside of kafka for Spark Structured Streaming, like it's working for DStreams stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges), the same I'm looking but for Structured Streaming.
Is it supporting for structured streaming ? If yes, how can I achieve it ?
I know about hdfs checkpointing using .option("checkpointLocation", checkpointLocation), but I'm interested exactly for built-in offset management.
I'm expecting kafka to store offsets only inside without spark hdfs checkpoint.

I am using this piece of code found somewhere.
public class OffsetManager {
private String storagePrefix;
public OffsetManager(String storagePrefix) {
this.storagePrefix = storagePrefix;
}
/**
* Overwrite the offset for the topic in an external storage.
*
* #param topic - Topic name.
* #param partition - Partition of the topic.
* #param offset - offset to be stored.
*/
void saveOffsetInExternalStore(String topic, int partition, long offset) {
try {
FileWriter writer = new FileWriter(storageName(topic, partition), false);
BufferedWriter bufferedWriter = new BufferedWriter(writer);
bufferedWriter.write(offset + "");
bufferedWriter.flush();
bufferedWriter.close();
} catch (Exception e) {
e.printStackTrace();
throw new RuntimeException(e);
}
}
/**
* #return he last offset + 1 for the provided topic and partition.
*/
long readOffsetFromExternalStore(String topic, int partition) {
try {
Stream<String> stream = Files.lines(Paths.get(storageName(topic, partition)));
return Long.parseLong(stream.collect(Collectors.toList()).get(0)) + 1;
} catch (Exception e) {
e.printStackTrace();
}
return 0;
}
private String storageName(String topic, int partition) {
return "Offsets\\" + storagePrefix + "-" + topic + "-" + partition;
}
}
SaveOffset...is called after the record processing is successful otherwise no offset is stored. and I am using Kafka topics as source so I specify the startingoffsets as the retrieved offsets from ReadOffsets...

"Is it supporting for structured streaming?"
No, it is not supported in Structured Streaming to commit offsets back to Kafka, similar to what could be done using Spark Streaming (DStreams). The Spark Structured Streaming + Kafka Integration Guide on Kafka specific configurations is very precise about this:
"Kafka source doesn’t commit any offset."
I have written a more comprehensive answer about this in How to manually set groupId and commit Kafka offsets in Spark Structured Streaming.

Related

Reading from HDFS partition sequentially one record at a time using PySpark/Python

I want to read from an HDFS partition one record at a time, sequentially. I found a sample Java snippet that handles this logic. Is there a way to achieve this using PySpark/Python?
Sample Java snippet below (note the while loop):
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/file1.txt");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();

Missing Avro Custom Header when using Spark SQL Streaming

Before sending an Avro GenericRecord to Kafka, a Header is inserted like so.
ProducerRecord<String, byte[]> record = new ProducerRecord<>(topicName, key, message);
record.headers().add("schema", schema);
Consuming the record.
When using Spark Streaming, the header from the ConsumerRecord is intact.
KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)).foreachRDD(rdd -> {
rdd.foreach(record -> {
System.out.println(new String(record.headers().headers("schema").iterator().next().value()));
});
});
;
But when using Spark SQL Streaming, the header seems to be missing.
StreamingQuery query = dataset.writeStream().foreach(new ForeachWriter<>() {
...
#Override
public void process(Row row) {
String topic = (String) row.get(2);
int partition = (int) row.get(3);
long offset = (long) row.get(4);
String key = new String((byte[]) row.get(0));
byte[] value = (byte[]) row.get(1);
ConsumerRecord<String, byte[]> record = new ConsumerRecord<String, byte[]>(topic, partition, offset, key,
value);
//I need the schema to decode the Avro!
}
}).start();
Where can I find the custom header value when using Spark SQL Streaming approach?
Version:
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
UPDATE
I tried 3.0.0-preview2 of spark-sql_2.12 and spark-sql-kafka-0-10_2.12. I added
.option("includeHeaders", true)
But I still only get these columns from the Row.
+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
Kafka headers in Structured Streaming supported only from 3.0: https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
Please look for includeHeaders for more details.

Manipulate trigger interval in spark structured streaming

For a given scenario, I want to filter the datasets in structured streaming in a combination of continuous and batch triggers.
I know it sounds unrealistic or maybe not feasible. Below is what I am trying to achieve.
Let the processing time-interval set in the app be 5 minutes.
Let the record be of below schema:
{
"type":"record",
"name":"event",
"fields":[
{ "name":"Student", "type":"string" },
{ "name":"Subject", "type":"string" }
]}
My streaming app is supposed to write the result into the sink by considering either of the below two criteria.
If a student is having more than 5 subjects. (Priority to be given to this criteria.)
Processing time provided in the trigger expired.
private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"alarm\","
+ "\"fields\":["
+ " { \"name\":\"student\", \"type\":\"string\" },"
+ " { \"name\":\"subject\", \"type\":\"string\" }"
+ "]}";
private static Schema.Parser parser = new Schema.Parser();
private static Schema schema = parser.parse(USER_SCHEMA);
static {
recordInjection = GenericAvroCodecs.toBinary(schema);
type = (StructType) SchemaConverters.toSqlType(schema).dataType();
}
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
}, DataTypes.createStructType(type.fields()));
Dataset<Row> ds2 = ds1
.select("value").as(Encoders.BINARY())
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr("student","subject");
StreamingQuery query1 = ds2
.writeStream()
.foreachBatch(
new VoidFunction2<Dataset<Row>, Long>() {
#Override
public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
}
}
).format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "new_in")
.option("checkpointLocation", "checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime(10000))
.start();
query1.awaitTermination();
Kafka Producer console:
Student:Test, Subject:x
Student:Test, Subject:y
Student:Test, Subject:z
Student:Test1, Subject:x
Student:Test2, Subject:x
Student:Test, Subject:w
Student:Test1, Subject:y
Student:Test2, Subject:y
Student:Test, Subject:v
In the Kafka consumer console, I am expecting like below.
Test:{x,y,z,w,v} =>This should be the first response
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time

Loading data from Spark Structured Streaming into ArrayList

I need to send data from Kafka to Kinesis Firehose. I am processing Kafka data using Spark Structured Streaming. I am not sure how to process the dataset of streaming query into an ArrayList variable - say, recordList - of e.g. 100 records (could be any other value) and then call Firehose API's putRecordBatch(recordList) to put the records into Firehose.
I think you want to check out Foreach and ForeachBatch depending on your Spark Version. ForeachBatch comes in V2.4.0 and foreach is available < V2.4.0. If there is no streaming sink implementation available for Kinesis Firehouse then you should make your own implementation of the ForeachWriter. Databricks has some nice examples of using foreach to create the custom writers.
I haven't ever used Kinesis but here is an example of what your custom sink might look like.
case class MyConfigInfo(info1: String, info2: String)
class KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
val kinesisProducer = _
def open(partitionId: Long,version: Long): Boolean = {
kinesisProducer = //set up the kinesis producer using MyConfigInfo
true
}
def process(value: (String, String)): Unit = {
//ask kinesisProducer to send data
}
def close(errorOrNull: Throwable): Unit = {
//close the kinesis producer
}
}
if you're using that AWS kinesisfirehose API you might do something like this
case class MyConfigInfo(info1: String, info2: String)
class KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
val firehoseClient = _
val req = putRecordBatchRequest = new PutRecordBatchRequest()
val records = 0
val recordLimit = //maybe you need to set this?
def open(partitionId: Long,version: Long): Boolean = {
firehoseClient = //set up the firehose client using MyConfigInfo
true
}
def process(value: (String, String)): Unit = {
//ask fireHose client to send data or batch the request
val record: Record = //create Record out of value
req.setRecords(record)
records = records + 1
if(records >= recordLimit) {
firehoseClient.putRecordBatch(req)
records = 0
}
}
def close(errorOrNull: Throwable): Unit = {
//close the firehose client
//or instead you could put the batch request to the firehose client here but i'm not sure if that's good practice
}
}
Then you'd use it as such
val writer = new KinesisSink(configuration)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()

Storm analog for spark.streaming.kafka.maxRatePerPartition in Spark

There is spark.streaming.kafka.maxRatePerPartition property in Spark Streaming, which limits number of messages reading from Apache Kafka per second. Is there similar property for Storm?
I think there is no property to limit the number of messages per second.
If you use the new kafka client (kafka 0.9) spout you can set 'MaxUncommittedOffsets' that will throttle the number of uncommited offsets(i.e number of inflight messages).
However, if you are still using the old kafka spout(kafka prior to 0.9), you can use the storm property 'topology.max.spout.pending' which throttles the total number of unacknowledged messages per spout task.
There is a workaround, which helps to do that in Storm. You can simply write following wrapper for KafkaSpout, which count number of messages emitted by spout per second. When it reaches desired number (Config.RATE) it returns nothing.
public class MyKafkaSpout extends KafkaSpout {
private int counter = 0;
private int currentSecond = 0;
private final int tuplesPerSecond = Config.RATE;
public MyKafkaSpout(SpoutConfig spoutConf) {
super(spoutConf);
}
#Override
public void nextTuple() {
if (counter == tuplesPerSecond) {
int newSecond = (int) TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis());
if (newSecond <= currentSecond) {
return;
}
counter = 0;
currentSecond = newSecond;
}
++counter;
super.nextTuple();
}
}

Resources