Spark Structured Stream get messages from only one partition of Kafka - apache-spark

I got the situation when spark can stream and get messages from only one partition of Kafka 2-patition topic.
My topics:
C:\bigdata\kafka_2.11-0.10.1.1\bin\windows>kafka-topics --create --zookeeper localhost:2181 --partitions 2 --replication-factor 1 --topic test4
Kafka Producer:
public class KafkaFileProducer {
// kafka producer
Producer<String, String> producer;
public KafkaFileProducer() {
// configs
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
//props.put("group.id", "testgroup");
props.put("batch.size", "16384");
props.put("auto.commit.interval.ms", "1000");
props.put("linger.ms", "0");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("block.on.buffer.full", "true");
// instantiate a producer
producer = new KafkaProducer<String, String>(props);
}
/**
* #param filePath
*/
public void sendFile(String filePath) {
FileInputStream fis;
BufferedReader br = null;
try {
fis = new FileInputStream(filePath);
//Construct BufferedReader from InputStreamReader
br = new BufferedReader(new InputStreamReader(fis));
int count = 0;
String line = null;
while ((line = br.readLine()) != null) {
count ++;
// dont send the header
if (count > 1) {
producer.send(new ProducerRecord<String, String>("test4", count + "", line));
Thread.sleep(10);
}
}
System.out.println("Sent " + count + " lines of data");
} catch (Exception e) {
e.printStackTrace();
}finally{
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
producer.close();
}
}
}
Spark Structured Stream:
System.setProperty("hadoop.home.dir", "C:\\bigdata\\winutils");
final SparkSession sparkSession = SparkSession.builder().appName("Spark Data Processing").master("local[2]").getOrCreate();
// create kafka stream to get the lines
Dataset<Tuple2<String, String>> stream = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test4")
.option("startingOffsets", "{\"test4\":{\"0\":-1,\"1\":-1}}")
.option("failOnDataLoss", "false")
.load().selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as(Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
Dataset<String> lines = stream.map((MapFunction<Tuple2<String, String>, String>) (Tuple2<String, String> tuple) -> tuple._2, Encoders.STRING());
Dataset<Row> result = lines.groupBy().count();
// Start running the query that prints the running counts to the console
StreamingQuery query = result//.orderBy("callTimeBin")
.writeStream()
.outputMode("complete")
.format("console")
.start();
// wait for the query to finish
try {
query.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}
When I run the producer to send 100 lines in a file, the query only returned 51 lines. I read the debug logs of spark and noticed something as follow:
17/02/15 10:52:49 DEBUG StreamExecution: Execution stats: ExecutionStats(Map(),List(),Map(watermark -> 1970-01-01T00:00:00.000Z))
17/02/15 10:52:49 DEBUG StreamExecution: Starting Trigger Calculation
17/02/15 10:52:49 DEBUG KafkaConsumer: Pausing partition test4-1
17/02/15 10:52:49 DEBUG KafkaConsumer: Pausing partition test4-0
17/02/15 10:52:49 DEBUG KafkaSource: Partitions assigned to consumer: [test4-1, test4-0]. Seeking to the end.
17/02/15 10:52:49 DEBUG KafkaConsumer: Seeking to end of partition test4-1
17/02/15 10:52:49 DEBUG KafkaConsumer: Seeking to end of partition test4-0
17/02/15 10:52:49 DEBUG Fetcher: Resetting offset for partition test4-1 to latest offset.
17/02/15 10:52:49 DEBUG Fetcher: **Fetched {timestamp=-1, offset=49} for partition test4-1
17/02/15 10:52:49 DEBUG Fetcher: Resetting offset for partition test4-1 to earliest offset.
17/02/15 10:52:49 DEBUG Fetcher: Fetched {timestamp=-1, offset=0} for partition test4-1**
17/02/15 10:52:49 DEBUG Fetcher: Resetting offset for partition test4-0 to latest offset.
17/02/15 10:52:49 DEBUG Fetcher: Fetched {timestamp=-1, offset=51} for partition test4-0
17/02/15 10:52:49 DEBUG KafkaSource: Got latest offsets for partition : Map(test4-1 -> 0, test4-0 -> 51)
17/02/15 10:52:49 DEBUG KafkaSource: GetOffset: ArrayBuffer((test4-0,51), (test4-1,0))
17/02/15 10:52:49 DEBUG StreamExecution: getOffset took 0 ms
17/02/15 10:52:49 DEBUG StreamExecution: triggerExecution took 0 ms
I don't know why test4-1 is always reset to ealiest offset.
If someone know how to get all messages from all partitions, I would greatly appreciate.
Thanks,

There is a known Kafka issue in 0.10.1.* client: https://issues.apache.org/jira/browse/KAFKA-4547
Right now you can use 0.10.0.1 client as a workaround. It can talk to a Kafka 0.10.1.* cluster.
See https://issues.apache.org/jira/browse/SPARK-18779 for more details.

Related

SparkSession null point exception in Dataset foreach

I am new to Spark.
I want to keep getting message from kafka, and then save to S3 once the message size over 100000.
I implemented it by Dataset.collectAsList(), but it throw error with Total size of serialized results of 3 tasks (1389.3 MiB) is bigger than spark.driver.maxResultSize
So I turned to use foreach, and it said null point exception when used SparkSession to createDataFrame.
Any idea about it? Thanks.
---Code---
SparkSession spark = generateSparkSession();
registerUdf4AddPartition(spark);
Dataset<Row> dataset = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", args[0])
.option("kafka.group.id", args[1])
.option("subscribe", args[2])
.option("kafka.security.protocol", SecurityProtocol.SASL_PLAINTEXT.name)
.load();
DataStreamWriter<Row> console = dataset.toDF().writeStream().foreachBatch((rawDataset, time) -> {
Dataset<Row> rowDataset = rawDataset.selectExpr("CAST(value AS STRING)");
//using foreach
rowDataset.foreach(row -> {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
});
// using collectAsList
List<Row> rows = rowDataset.collectAsList();
for (Row row : rows) {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
}
});
StreamingQuery start = console.start();
start.awaitTermination();
public static void batchSave(SparkSession spark){
synchronized (spans){
if(spans.size() == 100000){
System.out.println(spans.isEmpty());
Dataset<Row> spanDataSet = spark.createDataFrame(spans, Span.class);
Dataset<Row> finalResult = addCustomizedTimeByUdf(spanDataSet);
StringBuilder pathBuilder = new StringBuilder("s3a://fwk-dataplatform-np/datalake/log/FWK/ART2/test/leftAndRight");
finalResult.repartition(1).write().partitionBy("year","month","day","hour").format("csv").mode("append").save(pathBuilder.toString());
spans.clear();
}
}
}
Since the main SparkSession is running in driver, and tasks in foreach... is running distributed in executors, so the spark is not defined to all other executors.
BTW, there is no meaning to use synchronized inside foreach task since everything is distributed.

Spark Structred Streaming Kafka - how to read from a specific partition of topic and do offset managerment

Me new to spark structured stream and offset management of kafka.
Using spark-streaming-kafka-0-10-2.11.
In consumer, how can I read from specific partition of a topic ?
comapany_df = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", applicationProperties.getProperty(BOOTSTRAP_SERVERS_CONFIG))
.option("subscribe", topicName)
I am using something like above. how to specify a particular partition to read from ?
You can use the following code block to read from a particular Kafka Partition.
public void processKafka() throws InterruptedException {
LOG.info("************ SparkStreamingKafka.processKafka start");
// Create the spark application
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.executor.cores", "5");
//To express any Spark Streaming computation, a StreamingContext object needs to be created.
//This object serves as the main entry point for all Spark Streaming functionality.
//This creates the spark streaming context with a 'numSeconds' second batch size
jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchInterval));
//List of parameters
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", this.getBrokerList());
kafkaParams.put("client.id", "SpliceSpark");
kafkaParams.put("group.id", "mynewgroup");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
List<TopicPartition> topicPartitions= new ArrayList<TopicPartition>();
for(int i=0; i<5; i++) {
topicPartitions.add(new TopicPartition("mytopic", i));
}
//List of kafka topics to process
Collection<String> topics = Arrays.asList(this.getTopicList().split(","));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
//Another version of an attempt
/*
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Assign(topicPartitions, kafkaParams)
);
*/
messages.foreachRDD(new PrintRDDDetails());
// Start running the job to receive and transform the data
jssc.start();
//Allows the current thread to wait for the termination of the context by stop() or by an exception
jssc.awaitTermination();
}

Spark Structured Streaming Kafka Offset Management

I'm looking into storing kafka offsets inside of kafka for Spark Structured Streaming, like it's working for DStreams stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges), the same I'm looking but for Structured Streaming.
Is it supporting for structured streaming ? If yes, how can I achieve it ?
I know about hdfs checkpointing using .option("checkpointLocation", checkpointLocation), but I'm interested exactly for built-in offset management.
I'm expecting kafka to store offsets only inside without spark hdfs checkpoint.
I am using this piece of code found somewhere.
public class OffsetManager {
private String storagePrefix;
public OffsetManager(String storagePrefix) {
this.storagePrefix = storagePrefix;
}
/**
* Overwrite the offset for the topic in an external storage.
*
* #param topic - Topic name.
* #param partition - Partition of the topic.
* #param offset - offset to be stored.
*/
void saveOffsetInExternalStore(String topic, int partition, long offset) {
try {
FileWriter writer = new FileWriter(storageName(topic, partition), false);
BufferedWriter bufferedWriter = new BufferedWriter(writer);
bufferedWriter.write(offset + "");
bufferedWriter.flush();
bufferedWriter.close();
} catch (Exception e) {
e.printStackTrace();
throw new RuntimeException(e);
}
}
/**
* #return he last offset + 1 for the provided topic and partition.
*/
long readOffsetFromExternalStore(String topic, int partition) {
try {
Stream<String> stream = Files.lines(Paths.get(storageName(topic, partition)));
return Long.parseLong(stream.collect(Collectors.toList()).get(0)) + 1;
} catch (Exception e) {
e.printStackTrace();
}
return 0;
}
private String storageName(String topic, int partition) {
return "Offsets\\" + storagePrefix + "-" + topic + "-" + partition;
}
}
SaveOffset...is called after the record processing is successful otherwise no offset is stored. and I am using Kafka topics as source so I specify the startingoffsets as the retrieved offsets from ReadOffsets...
"Is it supporting for structured streaming?"
No, it is not supported in Structured Streaming to commit offsets back to Kafka, similar to what could be done using Spark Streaming (DStreams). The Spark Structured Streaming + Kafka Integration Guide on Kafka specific configurations is very precise about this:
"Kafka source doesn’t commit any offset."
I have written a more comprehensive answer about this in How to manually set groupId and commit Kafka offsets in Spark Structured Streaming.

org.apache.spark.SparkException: Task not serializable?

here is my code :
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = stream.map(_._2)
lines.print()
lines.foreachRDD { rdd =>
rdd.foreach( data =>
if (data != null) {
println(data.toString)
val records = data.toString
CassandraConnector(conf).withSessionDo {
session =>
session.execute("INSERT INTO propatterns_test.meterreadings JSON "+records+";")
}
}
)
}
so where i am going wrong.
my error log is :
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:874)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:873)
at etc.
I am storing rdd message into cassandra.

Kafka spark directStream can not get data

I'm using spark directStream api to read data from Kafka. My code as following please:
val sparkConf = new SparkConf().setAppName("testdirectStreaming")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val kafkaParams = Map[String, String](
"auto.offset.reset" -> "smallest",
"metadata.broker.list"->"10.0.0.11:9092",
"spark.streaming.kafka.maxRatePerPartition"->"100"
)
//I set all of the 3 partitions fromOffset are 0
var fromOffsets:Map[TopicAndPartition, Long] = Map(TopicAndPartition("mytopic",0) -> 0)
fromOffsets+=(TopicAndPartition("mytopic",1) -> 0)
fromOffsets+=(TopicAndPartition("mytopic",2) -> 0)
val kafkaData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, MessageAndMetadata[String, String]](
ssc, kafkaParams, fromOffsets,(mmd: MessageAndMetadata[String, String]) => mmd)
var offsetRanges = Array[OffsetRange]()
kafkaData.transform { rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {
_.message()
}.foreachRDD { rdd =>
for (o <- offsetRanges) {
println(s"---${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
rdd.foreachPartition{ partitionOfRecords =>
partitionOfRecords.foreach { line =>
println("===============value:"+line)
}
}
}
I'm sure there are data in the kafka cluster, but my code could not get any of them. Thanks in advance.
I found the reason: The old messages in kafka have already been deleted since the retention period expired. So when I set the fromOffset is 0 it caused OutOfOffSet exception. The exception caused Spark reset the offset with the latest ones. Therefore I could not get any messages. The solution is that I need to set the appropriate fromOffset to avoid the Exception.

Resources