Spark Structured Streaming with Kafka doesn't honor startingOffset="earliest" - apache-spark

I've set up Spark Structured Streaming (Spark 2.3.2) to read from Kafka (2.0.0). I'm unable to consume from the beginning of the topic if messages entered the topic before Spark streaming job is started. Is this expected behavior of Spark streaming where it ignores Kafka messages produced prior to initial run of Spark Stream job (even with .option("stratingOffsets","earliest"))?
Steps to reproduce
Before starting streaming job, create test topic (single broker, single partition) and produce messages to the topic (3 messages in my example).
Start spark-shell with the following command: spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2.3.1.0.0-78 --repositories http://repo.hortonworks.com/content/repositories/releases/
Execute the spark scala code below.
// Local
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9097")
.option("failOnDataLoss","false")
.option("stratingOffsets","earliest")
.option("subscribe", "test")
.load()
// Sink Console
val ds = df.writeStream.format("console").queryName("Write to console")
.trigger(org.apache.spark.sql.streaming.Trigger.ProcessingTime("10 second"))
.start()
Expected vs actual output
I expect the stream to start from offset=1. However, it starts reading from offset=3. You can see that kafka client is actually resetting the starting offset: 2019-06-18 21:22:57 INFO Fetcher:583 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Resetting offset for partition test-0 to offset 3.
I can see that the spark stream processes messages that I produce after starting the streaming job.
Is this expected behavior of Spark streaming where it ignores Kafka messages produced prior to initial run of Spark Stream job (even with .option("stratingOffsets","earliest"))?
2019-06-18 21:22:57 INFO AppInfoParser:109 - Kafka version : 2.0.0.3.1.0.0-78
2019-06-18 21:22:57 INFO AppInfoParser:110 - Kafka commitId : 0f47b27cde30d177
2019-06-18 21:22:57 INFO MicroBatchExecution:54 - Starting new streaming query.
2019-06-18 21:22:57 INFO Metadata:273 - Cluster ID: LqofSZfjTu29BhZm6hsgsg
2019-06-18 21:22:57 INFO AbstractCoordinator:677 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Discovered group coordinator localhost:9097 (id: 2147483647 rack: null)
2019-06-18 21:22:57 INFO ConsumerCoordinator:462 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Revoking previously assigned partitions []
2019-06-18 21:22:57 INFO AbstractCoordinator:509 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] (Re-)joining group
2019-06-18 21:22:57 INFO AbstractCoordinator:473 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Successfully joined group with generation 1
2019-06-18 21:22:57 INFO ConsumerCoordinator:280 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Setting newly assigned partitions [test-0]
2019-06-18 21:22:57 INFO Fetcher:583 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Resetting offset for partition test-0 to offset 3.
2019-06-18 21:22:58 INFO KafkaSource:54 - Initial offsets: {"test":{"0":3}}
2019-06-18 21:22:58 INFO Fetcher:583 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Resetting offset for partition test-0 to offset 3.
2019-06-18 21:22:58 INFO MicroBatchExecution:54 - Committed offsets for batch 0. Metadata OffsetSeqMetadata(0,1560910978083,Map(spark.sql.shuffle.partitions -> 200, spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider))
2019-06-18 21:22:58 INFO KafkaSource:54 - GetBatch called with start = None, end = {"test":{"0":3}}
Spark Batch mode
I was able to confirm that batch mode reads from the beginning - so no issue with Kafka retention configuration
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9097")
.option("subscribe", "test")
.load()
df.count // Long = 3

Haha it was a simple typo: "stratingOffsets" should be "startingOffsets"

You can do this in two ways. load data from kafka to streaming dataframe or load data from kafka to static dataframe(for testing).
I think you are not seeing the data because of group-id. kafka will commit the consumer group and offsets in to an internal topic. make sure the group name is unique for each read.
Here are the two options.
Option 1: read data from kafka to streaming dataframe
// spark streaming with kafka
import org.apache.spark.sql.streaming.ProcessingTime
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers","app01.app.test.net:9097,app02.app.test.net:9097")
.option("subscribe", "kafka-testing-topic")
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("startingOffsets","earliest")
.option("maxOffsetsPerTrigger","6000")
.load()
val ds2 = ds1.select(from_json($"value".cast(StringType), dataSchema).as("data")).select("data.*")
val ds3 = ds2.groupBy("TABLE_NAME").count()
ds3.writeStream
.trigger(ProcessingTime("10 seconds"))
.queryName("query1").format("console")
.outputMode("complete")
.start()
.awaitTermination()
Option 2: read data from kafka to static dataframe (for testing, it will load from beginning)
// Subscribe to 1 topic defaults to the earliest and latest offsets
val ds1 = spark.read.format("kafka")
.option("kafka.bootstrap.servers","app01.app.test.net:9097,app02.app.test.net:9097")
.option("subscribe", "kafka-testing-topic")
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("spark.streaming.kafka.consumer.cache.enabled","false")
.load()
val ds2 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","topic","partition","offset","timestamp")
val ds3 = ds2.select("value").rdd.map(x => x.toString)
ds3.count()

Related

Spark Streaming HUDI HoodieException: Config conflict(key current value existing value): RecordKey:

As I am connecting to the kafka topic with spark and creating the dataframe and then storing into Hudi:
df
.selectExpr("key", "topic", "partition", "offset", "timestamp", "timestampType", "CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("hudi")
.options(getQuickstartWriteConfigs)
.option(PRECOMBINE_FIELD.key(), "essDateTime")
.option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator")
.option(RECORDKEY_FIELD.key(), "offset,timestamp")//"offset,essDateTime")
.option(TBL_NAME.key, streamingTableName)
.option("path", baseStreamingPath)
.trigger(ProcessingTime(10000))
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
I am getting the following exception:
9:43
ERROR] 2023-01-31 09:35:25.474 [stream execution thread for [id = 8b30fd4b-8506-490b-80ad-76868c14594f, runId = 25d34e6f-10e2-42c2-b094-654797f5d79c]] HoodieStreamingSink - Micro batch id=1 threw following exception:
org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value):
RecordKey: offset,timestamp uuid
KeyGenerator: org.apache.hudi.keygen.ComplexKeyGenerator org.apache.hudi.keygen.SimpleKeyGenerator
at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:167) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:90) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:129) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.15.jar:?]
at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:128) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:214) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:127) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:666) ~[spark-sql_2.12-3.3.1.jar:3.3.1]
To store all kafka Data into Hudi table
In apache Hudi, there are some configurations which you cannot override, like the KeyGenerator. It seems you have already wrote to the table with org.apache.hudi.keygen.SimpleKeyGenerator, so you need to recreate the table to change this config and the partition keys.
If you want to quick test, you can change baseStreamingPath to write the data into a new Hudi table.

how to determine where the Kafka consumption is made on Spark

We tried two MlLib transformers for Kafka consuming: one using Structured Batch Query Stream like here, and one using normal Kafka consumers. The normal consumers read into a list that gets converted to a dataframe
We created two MlLib pipelines starting with an empty dataframe. The first transformer of these pipes did the reading from Kafka. We had a pipe for each kafka transformer type.
then ran the pipes: 1st with normal kafka consumers, 2nd with Spark consumer, 3rd with normal consumers again.
The spark config had 4 executors, with 1 core each:
spark.executor.cores": "1", "spark.executor.instances": 4,
Questions are:
A. where is the consumption made? On the executors or on the driver? according to driver UI, it looks like in both cases the executors did all the work - the driver did not pass any data and 4 executors got created.
B. Why do we have a different number of executors running? In the 1st run, with normal consumers, we see 4 executors working; In the 2nd run, with spark Kafka connector, 1 executor; In 3rd run with normal consumers, 1 executor but 2 cores?
you`ll see the driver's UI attached at the bottom.
this is the relevant code:
Normal Consumer:
var kafkaConsumer: KafkaConsumer[String, String] = null
val readMessages = () => {
for (record <- records) {
recordList.append(record.value())
}
}
kafkaConsumer.subscribe(util.Arrays.asList($(topic)))
readMessages()
var df = recordList.toDF
kafkaConsumer.close()
val json_schema =
df.sparkSession.read.json(df.select("value").as[String]).schema
df = df.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))
Spark consumer:
val records = dataset
.sparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", $(url))
.option("subscribe", $(this.topic))
.option("kafkaConsumer.pollTimeoutMs", s"${$(timeoutMs)}")
.option("startingOffsets", $(startingOffsets))
.option("endingOffsets", $(endingOffsets))
.load
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
OmnixLogger.warn(uid, "Executor ID AFTER polling: " + SparkEnv.get.executorId)
val json_schema = records.sparkSession.read.json(records.select("value").as[String]).schema
var df: DataFrame = records.select(from_json(col("value"), json_schema).as("json"))
df = df.select(col("json.*"))

Duplicates while publishing data to kafka topic using spark-streaming

I have spark-streaming application which consumes data from topic1 and parse it then publish same records into 2 processes one is into topic2 and other is to hive table. while publishing data to kafka topic2 I see duplicates and i don't see duplicates in hive table
using
spark 2.2, Kafka 0.10.0
KafkaWriter.write(spark, storeSalesStreamingFinalDF, config)
writeToHIVE(spark, storeSalesStreamingFinalDF, config)
object KafkaWriter {
def write(spark: SparkSession, df: DataFrame, config: Config)
{
df.select(to_json(struct("*")) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", config.getString("kafka.dev.bootstrap.servers"))
.option("topic",config.getString("kafka.topic"))
.option("kafka.compression.type",config.getString("kafka.compression.type"))
.option("kafka.session.timeout.ms",config.getString("kafka.session.timeout.ms"))
.option("kafka.request.timeout.ms",config.getString("kafka.request.timeout.ms"))
.save()
}
}
Can some one help on this,
Expecting no duplicates in kafka topic2.
To handle the duplicate data ,we should set the .option("kafka.processing.guarantee","exactly_once")

How to know the name of Kafka consumer group that streaming query uses for Kafka data source?

I am consuming data from kafka topic through spark structured streaming, the topic has 3 partitions. As Spark structured streaming does not allow you to explicitly provide group.id and assigns some random id to consumer, I tried to check the consumer group id's using below kafka command
./kafka-consumer-groups.sh --bootstrap-server kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092 --list
output
spark-kafka-source-054e8dac-bea9-46e8-9374-8298daafcd23--1587684247-driver-0
spark-kafka-source-756c08e8-6a84-447c-8326-5af1ef0412f5-209267112-driver-0
spark-kafka-source-9528b191-4322-4334-923d-8c1500ef8194-2006218471-driver-0
Below are my questions
1) Why does it create 3 consumer groups? Is it because of 3 partitions?
2) Is there any way I can get these consumer group names in spark application?
3) Even though my spark application was still running, after some time these group names didn't show up in consumer groups list. Is this because all the data was consumed by spark application and there was no more data in that kafka topic?
4) If my assumption is right about point 3, will it create a new consumer group id if new data arrives or the name of the consumer group will remain the same?
Below is my read stream
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
// .option("assign"," {\""+topic+"\":[0]}")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 60000)
.load()
I have 3 writestreams in the application as below
val df = inputDf.selectExpr("CAST(value AS STRING)","CAST(topic AS STRING)","CAST (partition AS INT)","CAST (offset AS INT)","CAST (timestamp AS STRING)")
val df1 = inputDf.selectExpr("CAST (partition AS INT)","CAST (offset AS INT)","CAST (timestamp AS STRING)")
//First stream
val checkpoint_loc1= "/warehouse/test_duplicate/download/chk1"
df1.agg(min("offset"), max("offset"))
.writeStream
.foreach(writer)
.outputMode("complete")
.option("checkpointLocation", checkpoint_loc1).start()
val result = df.select(
df1("result").getItem("_1").as("col1"),
df1("result").getItem("_2").as("col2"),
df1("result").getItem("_5").as("eventdate"))
val distDates = result.select(result("eventdate")).distinct
//Second stream
val checkpoint_loc2= "/warehouse/test_duplicate/download/chk2"
distDates.writeStream.foreach(writer1)
.option("checkpointLocation", checkpoint_loc2).start()
//Third stream
val kafkaOutput =result.writeStream
.outputMode("append")
.format("orc")
.option("path",data_dir)
.option("checkpointLocation", checkpoint_loc3)
.start()
The streaming query is used only once in the code and there are no joins.
Execution plan
== Parsed Logical Plan ==
StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#109e44ba, kafka, Map(maxOffsetsPerTrigger -> 60000, startingOffsets -> earliest, subscribe -> downloading, kafka.bootstrap.servers -> kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#593197cb,kafka,List(),None,List(),None,Map(maxOffsetsPerTrigger -> 60000, startingOffsets -> earliest, subscribe -> downloading, kafka.bootstrap.servers -> kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
== Analyzed Logical Plan ==
key: binary, value: binary, topic: string, partition: int, offset: bigint, timestamp: timestamp, timestampType: int
StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#109e44ba, kafka, Map(maxOffsetsPerTrigger -> 60000, startingOffsets -> earliest, subscribe -> downloading, kafka.bootstrap.servers -> kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#593197cb,kafka,List(),None,List(),None,Map(maxOffsetsPerTrigger -> 60000, startingOffsets -> earliest, subscribe -> downloading, kafka.bootstrap.servers -> kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
== Optimized Logical Plan ==
StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider#109e44ba, kafka, Map(maxOffsetsPerTrigger -> 60000, startingOffsets -> earliest, subscribe -> downloading, kafka.bootstrap.servers -> kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession#593197cb,kafka,List(),None,List(),None,Map(maxOffsetsPerTrigger -> 60000, startingOffsets -> earliest, subscribe -> downloading, kafka.bootstrap.servers -> kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
== Physical Plan ==
StreamingRelation kafka, [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
1) Why does it create 3 consumer groups? Is it because of 3 partitions?
Certainly not. It is just a coincidence. You seem to have run the application 3 times already and the topic has 3 partitions.
Let's start over to back it up.
I deleted all consumer groups to make sure that we start afresh.
$ ./bin/kafka-consumer-groups.sh --list --bootstrap-server :9092
spark-kafka-source-cd8c4070-cac0-4653-81bd-4819501769f9-1567209638-driver-0
spark-kafka-source-6a60f735-f05c-49e4-ae88-a6193e7d4bf8--525530617-driver-0
$ ./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --delete --group spark-kafka-source-cd8c4070-cac0-4653-81bd-4819501769f9-1567209638-driver-0
Deletion of requested consumer groups ('spark-kafka-source-cd8c4070-cac0-4653-81bd-4819501769f9-1567209638-driver-0') was successful.
$ ./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --delete --group spark-kafka-source-6a60f735-f05c-49e4-ae88-a6193e7d4bf8--525530617-driver-0
Deletion of requested consumer groups ('spark-kafka-source-6a60f735-f05c-49e4-ae88-a6193e7d4bf8--525530617-driver-0') was successful.
$ ./bin/kafka-consumer-groups.sh --list --bootstrap-server :9092
// nothing got printed out
I created a topic with 5 partitions.
$ ./bin/kafka-topics.sh --create --zookeeper :2181 --topic jacek-five-partitions --partitions 5 --replication-factor 1
Created topic "jacek-five-partitions".
$ ./bin/kafka-topics.sh --describe --zookeeper :2181 --topic jacek-five-partitions
Topic:jacek-five-partitions PartitionCount:5 ReplicationFactor:1 Configs:
Topic: jacek-five-partitions Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic: jacek-five-partitions Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: jacek-five-partitions Partition: 2 Leader: 0 Replicas: 0 Isr: 0
Topic: jacek-five-partitions Partition: 3 Leader: 0 Replicas: 0 Isr: 0
Topic: jacek-five-partitions Partition: 4 Leader: 0 Replicas: 0 Isr: 0
The code I use is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
object SparkApp extends App {
val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
val q = spark
.readStream
.format("kafka")
.option("startingoffsets", "latest")
.option("subscribe", "jacek-five-partitions")
.option("kafka.bootstrap.servers", ":9092")
.load
.select($"value" cast "string")
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start
q.awaitTermination()
}
When I run the above Spark Structured Streaming application, I have just one consumer group created.
$ ./bin/kafka-consumer-groups.sh --list --bootstrap-server :9092
spark-kafka-source-380da653-c829-45db-859f-09aa9b37784d-338965656-driver-0
And that makes sense since all the Spark processing should use as many Kafka consumers as there are partitions, but regardless of the number of consumers, there should only be one consumer group (or the Kafka consumers would consume all records and there would be duplicates).
2) Is there any way I can get these consumer group names in spark application?
There is no public API for this so the answer is no.
You could however "hack" Spark and go below the public API up to the internal Kafka consumer that uses this line:
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
Or even this line to be precise:
val kafkaOffsetReader = new KafkaOffsetReader(
strategy(caseInsensitiveParams),
kafkaParamsForDriver(specifiedKafkaParams),
parameters,
driverGroupIdPrefix = s"$uniqueGroupId-driver")
Just find the KafkaMicroBatchReader for the Kafka data source, request it for the KafkaOffsetReader that knows groupId. That seems doable.
Even though my spark application was still running, after some time these group names didn't show up in consumer groups list. Is this because all the data was consumed by spark application and there was no more data in that kafka topic?
Could that be related to KIP-211: Revise Expiration Semantics of Consumer Group Offsets which says:
The offset of a topic partition within a consumer group expires when the expiration timestamp associated with that partition is reached. This expiration timestamp is usually affected by the broker config offsets.retention.minutes, unless user overrides that default and uses a custom retention.
4) will it create new consumer group id if new data arrives or the name of consumer group will remain same?
Will remain same.
Moreover, the consumer group must not be deleted when at least one consumer from the group is active.
group.id: Kafka source will create a unique group id for each query automatically.
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

Resources