Spark DStream from Kafka always starts at beginning

Spark DStream from Kafka always starts at beginning - apache-spark

Look at my last comment of the accepted answer for the solution
I configured a DStream like so:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "kafka1.example.com:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[KafkaAvroDeserializer],
"group.id" -> "mygroup",
"specific.avro.reader" -> true,
"schema.registry.url" -> "http://schema.example.com:8081"
)
val stream = KafkaUtils.createDirectStream(
ssc,
PreferConsistent,
Subscribe[String, DataFile](topics, kafkaParams)
)
While this works and I get the DataFiles as expected, when I stop and re-run the job, it always starts at the beginning of the topic. How can I achieve that it continues where it last went off?
Follow up 1
As in the answer by Bhima Rao Gogineni, I changed my configuration like this:
val consumerParams =
Map("bootstrap.servers" -> bootstrapServerString,
"schema.registry.url" -> schemaRegistryUri.toString,
"specific.avro.reader" -> "true",
"group.id" -> "measuring-data-files",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[KafkaAvroDeserializer],
"enable.auto.commit" -> (false: JavaBool),
"auto.offset.reset" -> "earliest")
And I set up the stream:
val stream = KafkaUtils.
createDirectStream(ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, DataFile](List(inTopic), consumerParams))
And then I process it:
stream.
foreachRDD { rdd =>
... // Do stuff with the RDD - transform, produce to other topic etc.
// Commit the offsets
log.info("Committing the offsets")
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
But it still always starts from the beginning when re-running.
Here is an excerpt from my Kafka log:
A run:
[2018-07-04 07:47:31,593] INFO [GroupCoordinator 0]: Preparing to rebalance group measuring-data-files with old generation 22 (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:47:31,594] INFO [GroupCoordinator 0]: Stabilized group measuring-data-files generation 23 (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:47:31,599] INFO [GroupCoordinator 0]: Assignment received from leader for group measuring-data-files for generation 23 (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:48:06,690] INFO [ProducerStateManager partition=data-0] Writing producer snapshot at offset 131488999 (kafka.log.ProducerStateManager)
[2018-07-04 07:48:06,690] INFO [Log partition=data-0, dir=E:\confluent-4.1.1\data\kafka] Rolled new log segment at offset 131488999 in 1 ms. (kafka.log.Log)
[2018-07-04 07:48:10,788] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-07-04 07:48:30,074] INFO [GroupCoordinator 0]: Member consumer-1-262ece09-93c4-483e-b488-87057578dabc in group measuring-data-files has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:48:30,074] INFO [GroupCoordinator 0]: Preparing to rebalance group measuring-data-files with old generation 23 (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:48:30,074] INFO [GroupCoordinator 0]: Group measuring-data-files with generation 24 is now empty (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:48:45,761] INFO [ProducerStateManager partition=data-0] Writing producer snapshot at offset 153680971 (kafka.log.ProducerStateManager)
[2018-07-04 07:48:45,763] INFO [Log partition=data-0, dir=E:\confluent-4.1.1\data\kafka] Rolled new log segment at offset 153680971 in 3 ms. (kafka.log.Log)
[2018-07-04 07:49:24,819] INFO [ProducerStateManager partition=data-0] Writing producer snapshot at offset 175872864 (kafka.log.ProducerStateManager)
[2018-07-04 07:49:24,820] INFO [Log partition=data-0, dir=E:\confluent-4.1.1\data\kafka] Rolled new log segment at offset 175872864 in 1 ms. (kafka.log.Log)
Next run:
[2018-07-04 07:50:13,748] INFO [GroupCoordinator 0]: Preparing to rebalance group measuring-data-files with old generation 24 (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:50:13,749] INFO [GroupCoordinator 0]: Stabilized group measuring-data-files generation 25 (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:50:13,754] INFO [GroupCoordinator 0]: Assignment received from leader for group measuring-data-files for generation 25 (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:50:43,758] INFO [GroupCoordinator 0]: Member consumer-1-906c2eaa-f012-4283-96fc-c34582de33fb in group measuring-data-files has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:50:43,758] INFO [GroupCoordinator 0]: Preparing to rebalance group measuring-data-files with old generation 25 (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
[2018-07-04 07:50:43,758] INFO [GroupCoordinator 0]: Group measuring-data-files with generation 26 is now empty (__consumer_offsets-8) (kafka.coordinator.group.GroupCoordinator)
Follow up 2
I made saving the offsets more verbose like this:
// Commit the offsets
log.info("Committing the offsets")
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(offsetRanges.isEmpty) {
log.info("Offset ranges is empty...")
} else {
log.info("# offset ranges: %d" format offsetRanges.length)
}
object cb extends OffsetCommitCallback {
def onComplete(offsets: util.Map[TopicPartition, OffsetAndMetadata],
exception: Exception): Unit =
if(exception != null) {
log.info("Commit FAILED")
log.error(exception.getMessage, exception)
} else {
log.info("Commit SUCCEEDED - count: %d" format offsets.size())
offsets.
asScala.
foreach {
case (p, omd) =>
log.info("partition = %d; topic = %s; offset = %d; metadata = %s".
format(p.partition(), p.topic(), omd.offset(), omd.metadata()))
}
}
}
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges, cb)
And I get this exception:
2018-07-04 10:14:00 ERROR DumpTask$:136 - Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:600)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:541)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:679)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:658)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:426)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:278)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:163)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:182)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:209)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
How should I solve this?

With new Spark Kafka Connect API, We can try async commits.
Read the offsets and commit once done with the process.
Kafka config for the same:
enable.auto.commit=false
auto.offset.reset=earliest or auto.offset.reset=latest --> this config takes effect if there is no last committed offset available in Kafka topic then it will read the offsets from start or end based this config.
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
Here is the source: https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html

Spark provides two APIs to read messages from kafka.
From Spark documentation
Approach 1: Receiver-based Approach
This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka
high-level consumer API. As with all receivers, the data received from
Kafka through a Receiver is stored in Spark executors, and then jobs
launched by Spark Streaming processes the data.
Approach 2: Direct Approach (No Receivers)
This new receiver-less “direct” approach has been introduced in Spark
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka
for the latest offsets in each topic+partition, and accordingly
defines the offset ranges to process in each batch. When the jobs to
process the data are launched, Kafka’s simple consumer API is used to
read the defined ranges of offsets from Kafka (similar to read files
from a file system).
Note that one disadvantage of this approach is
that it does not update offsets in Zookeeper, hence Zookeeper-based
Kafka monitoring tools will not show progress. However, you can access
the offsets processed by this approach in each batch and update
Zookeeper yourself
In your case you are using Direct Approach, so you need to handle your message-offset yourself and specify the offset range from where you want to read the messages. Or if you want zookeeper to take care of your message-offset then you can use Receiver-based Approach by using KafkaUtils.createStream() API.
You can find more on how to handle kafka offset in spark documentation.

Related

Abstract Method Exception when connecting to Azure Cosmos DB Cassandra API table using Azure databricks job

I am trying to write data to Cassandra table (cosmos DB) via Azure DBR job (spark streaming). Getting below exception:
Query [id = , runId = ] terminated with exception: Failed to open native connection to Cassandra at {<name>.cassandra.cosmosdb.azure.com:10350} :: Method com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory$.createSession(Lcom/datastax/spark/connector/cql/CassandraConnectorConf;)Lcom/datastax/oss/driver/api/core/CqlSession; is abstract`
`Caused by: IOException: Failed to open native connection to Cassandra at {<name>.cassandra.cosmosdb.azure.com:10350} :: Method com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory$.createSession(Lcom/datastax/spark/connector/cql/CassandraConnectorConf;)Lcom/datastax/oss/driver/api/core/CqlSession; is abstract
Caused by: AbstractMethodError: Method com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory$.createSession(Lcom/datastax/spark/connector/cql/CassandraConnectorConf;)Lcom/datastax/oss/driver/api/core/CqlSession; is abstract`
What I did to get here:
created cosmos DB account
created cassandra keyspace
created cassandra table
created DBR job
added com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0 to the job cluster
added com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0 to the job cluster
What I tried:
different versions of connectors or azure cosmos db helper libraries, but some or the other ClassNotFoundExceptions or MethodNotFound errors
Code Snippet:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.log4j.Logger
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.cassandra._
import java.time.LocalDateTime
def writeDelta(spark:SparkSession,dataFrame: DataFrame,sourceName: String,checkpointLocation: String,dataPath: String,loadType: String,log: Logger): Boolean = {
spark.conf.set("spark.cassandra.output.batch.size.rows", "1")
spark.conf.set("spark.cassandra.connection.remoteConnectionsPerExecutor", "10")
spark.conf.set("spark.cassandra.connection.localConnectionsPerExecutor", "10")
spark.conf.set("spark.cassandra.output.concurrent.writes", "100")
spark.conf.set("spark.cassandra.concurrent.reads", "512")
spark.conf.set("spark.cassandra.output.batch.grouping.buffer.size", "1000")
spark.conf.set("spark.cassandra.connection.keepAliveMS", "60000000") //Increase this number as needed
spark.conf.set("spark.cassandra.output.ignoreNulls","true")
spark.conf.set("spark.cassandra.connection.host", "*******.cassandra.cosmosdb.azure.com")
spark.conf.set("spark.cassandra.connection.port", "10350")
spark.conf.set("spark.cassandra.connection.ssl.enabled", "true")
// spark.cassandra.auth.username and password are set in cluster conf
val write=dataFrame.writeStream.
format("org.apache.spark.sql.cassandra").
options(Map( "table" -> "****", "keyspace" -> "****")).
foreachBatch(upsertToDelta _).
outputMode("update").
option("mergeSchema", "true").
option("mode","PERMISSIVE").
option("checkpointLocation", checkpointLocation).
start()
write.awaitTermination()
}
def upsertToDelta(newBatch: DataFrame, batchId: Long) {
try {
val spark = SparkSession.active
println(LocalDateTime.now())
println("BATCH ID = "+batchId+" REC COUNT = "+newBatch.count())
newBatch.persist()
val userWindow = Window.partitionBy(keyColumn).orderBy(col(timestampCol).desc)
val deDup = newBatch.withColumn("rank", row_number().over(userWindow)).where(col("rank") === 1).drop("rank")
deDup.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "****", "keyspace" -> "****"))
.mode("append")
.save()
newBatch.unpersist()
} catch {
case e: Exception =>
throw e
}
}
############################
After implementing solution suggested by #theo-van-kraay, Getting error in executor's logs (Job keeps on running even after this error)
23/02/13 07:28:55 INFO CassandraConnector: Connected to Cassandra cluster.
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 9 (task 26, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Committed partition 9 (task 26, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO Executor: Finished task 9.0 in stage 6.0 (TID 26). 1511 bytes result sent to driver
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 7 (task 24, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 1 (task 18, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 3 (task 20, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 5 (task 22, attempt 0, stage 6.0)
23/02/13 07:28:56 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Unable to get Token Metadata
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$tokenMap$1(LocalNodeFirstLoadBalancingPolicy.scala:86)
at scala.Option.orElse(Option.scala:447)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.tokenMap(LocalNodeFirstLoadBalancingPolicy.scala:86)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.replicasForRoutingKey$1(LocalNodeFirstLoadBalancingPolicy.scala:103)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$getReplicas$8(LocalNodeFirstLoadBalancingPolicy.scala:107)
at scala.Option.flatMap(Option.scala:271)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$getReplicas$7(LocalNodeFirstLoadBalancingPolicy.scala:107)
at scala.Option.orElse(Option.scala:447)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$getReplicas$3(LocalNodeFirstLoadBalancingPolicy.scala:107)
at scala.Option.flatMap(Option.scala:271)
...
...
23/02/13 07:28:56 ERROR Utils: Aborting task

You can remove:
com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0
It is not required with Spark 3 Cassandra Connector and was created for Spark 2 only. Also remove references to it in the code.

The "Unable to get Token Metadata" error is a known issue that affects Spark 3 (Java 4 driver) and Cosmos DB API for Apache Cassandra in certain scenarios. It has been fixed recently but is still in the process of being rolled out across the service. If resolution is urgent, you can raise a support case in Azure and we can expedite by enabling the fix explicitly on your account until it has been fully deployed. Feel free to mention this Stack Overflow question when raising the support case so that the engineer who handles it will have context.

One Kafka consumer in a group consistently rejects coordinator, but only when Spark and Kafka are both in EC2

I have a Java application trying to consume topics from a Kafka 2.5.1 cluster in EC2, using spark-streaming-kafka-0-10_2.11. It works only from Spark clusters or standalone installs outside AWS: when Spark is also hosted in EC2, the Kafka consumer group never fully initializes. For three topics in all, only two consumers ever connect, while the third repeatedly rejects the group coordinator as "unavailable or invalid".
It is always the same third topic whose consumer fails, but the second and third topics are identically configured and both empty; the only difference between them is the name. Deleting and recreating the third topic does not change anything. Ignoring topic #3 in the application code (the first two can't easily be extricated) results in a successful startup.
All different Sparks are version 2.4.5, with the Google Guava JAR updated to 19.0 from the shipped 14.0.1, but no special configuration otherwise.
Kafka is a three-node EC2 cluster, with each node hosting a broker, a Zookeeper instance, and a Spark worker. Everything's talking and pingable from everything else. server.properties configures listeners to the internal DNS name, while advertised.listeners is the external.
listeners=PLAINTEXT://ip-abc-def-ghi-jkl.region.compute.internal:9092
advertised.listeners=PLAINTEXT://ec2-mno-pqr-stu-vwx.region.compute.amazonaws.com:9092
The failing Spark application launch from inside EC2:
2020-10-08/21:09:32.694/UTC org.apache.kafka.clients.consumer.ConsumerConfig INFO ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [ip-(broker 1 private dns).region.compute.internal:9092]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = mygroup
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
2020-10-08/21:09:32.843/UTC org.apache.kafka.common.utils.AppInfoParser INFO Kafka version : 2.0.0
2020-10-08/21:09:32.844/UTC org.apache.kafka.common.utils.AppInfoParser INFO Kafka commitId : 3402a8361b734732
2020-10-08/21:09:33.098/UTC org.apache.kafka.clients.Metadata INFO Cluster ID: mk34tRyzT1m1VR1ZC9GYnQ
2020-10-08/21:09:33.100/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] Discovered group coordinator ec2-(broker 3 public dns).region.compute.amazonaws.com:9092 (id: 2147483644 rack: null)
2020-10-08/21:09:33.135/UTC org.apache.kafka.clients.consumer.internals.ConsumerCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] Revoking previously assigned partitions []
2020-10-08/21:09:33.135/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] (Re-)joining group
2020-10-08/21:09:39.155/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] Successfully joined group with generation 1
2020-10-08/21:09:39.159/UTC org.apache.kafka.clients.consumer.internals.ConsumerCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] Setting newly assigned partitions [partitions here]
2020-10-08/21:09:39.188/UTC org.apache.kafka.clients.consumer.internals.Fetcher INFO [Consumer clientId=consumer-1, groupId=mygroup] Resetting offset for partition station-data-19 to offset 1976.
(more offset resets; consumer 2 has also joined the group successfully between 21:09:32 and 21:09:39. No activity from consumer 3 yet, unlike launches from an external Spark. Consumer 3 spin-up starts next)
2020-10-08/21:09:39.200/UTC org.apache.kafka.common.utils.AppInfoParser INFO Kafka version : 2.0.0
2020-10-08/21:09:39.205/UTC org.apache.kafka.common.utils.AppInfoParser INFO Kafka commitId : 3402a8361b734732
2020-10-08/21:09:39.213/UTC org.apache.kafka.clients.Metadata INFO Cluster ID: mk34tRyzT1m1VR1ZC9GYnQ
2020-10-08/21:09:39.214/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-3, groupId=mygroup] Discovered group coordinator ec2-(broker 3 public dns).region.compute.amazonaws.com:9092 (id: 2147483644 rack: null)
2020-10-08/21:09:39.219/UTC org.apache.kafka.clients.consumer.internals.ConsumerCoordinator INFO [Consumer clientId=consumer-3, groupId=mygroup] Revoking previously assigned partitions []
2020-10-08/21:09:39.219/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-3, groupId=mygroup] (Re-)joining group
2020-10-08/21:09:42.268/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] Attempt to heartbeat failed since group is rebalancing
2020-10-08/21:09:42.268/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-2, groupId=mygroup] Attempt to heartbeat failed since group is rebalancing
(more heartbeat failures for consumers 1 and 2)
2020-10-08/21:10:09.254/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-3, groupId=mygroup] Group coordinator ec2-(broker 3 public dns).region.compute.amazonaws.com:9092 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery
2020-10-08/21:10:09.330/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-2, groupId=mygroup] Attempt to heartbeat failed since group is rebalancing
2020-10-08/21:10:09.331/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-1, groupId=mygroup] Attempt to heartbeat failed since group is rebalancing
2020-10-08/21:10:09.377/UTC org.apache.kafka.clients.consumer.internals.AbstractCoordinator INFO [Consumer clientId=consumer-3, groupId=mygroup] Discovered group coordinator ec2-(broker 3 public dns).region.compute.amazonaws.com:9092 (id: 2147483644 rack: null)
With Spark outside EC2, this doesn't happen: all three consumers discover the coordinator and join the group more or less simultaneously, work out their offsets, and it's off to the races. But when the application is submitted to the Spark cluster in EC2, only the first two consumers join the group successfully. The third consumer doesn't start to initialize until after the first two have connected and reset their offsets, whereupon it discovers the group coordinator, tries to talk to it (causing a rebalance which prevents the other consumers from heartbeating), fails and decides it's invalid, and then finds the same coordinator again, repeat ad nauseam.
The only significant configuration difference between failing Spark submissions like this and successful Spark submissions from outside EC2 is that the latter bootstrap.servers have to point to brokers' external DNS names. However, the internal application launch fails whether it's pointed at brokers' external or internal names, one broker or multiple.
Here's the Kafka server.log from broker 3, identified above as the group coordinator:
[2020-10-08 21:09:33,128] INFO [GroupCoordinator 3]: Dynamic Member with unknown member id joins group mygroup in Empty state. Created a new member id consumer-2-cd4f7a30-d897-4902-81e7-4211b6a1e233 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:09:33,128] INFO [GroupCoordinator 3]: Preparing to rebalance group mygroup in state PreparingRebalance with old generation 0 (__consumer_offsets-25) (reason: Adding new member consumer-2-cd4f7a30-d897-4902-81e7-4211b6a1e233 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:09:33,136] INFO [GroupCoordinator 3]: Dynamic Member with unknown member id joins group mygroup in PreparingRebalance state. Created a new member id consumer-1-63909784-c821-4903-a08a-98a250d49b19 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:09:39,128] INFO [GroupCoordinator 3]: Stabilized group mygroup generation 1 (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:09:39,146] INFO [GroupCoordinator 3]: Assignment received from leader for group mygroup for generation 1 (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:09:39,222] INFO [GroupCoordinator 3]: Dynamic Member with unknown member id joins group mygroup in Stable state. Created a new member id consumer-3-3a91c573-a90b-4b5c-9707-af285bf9bbac for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:09:39,222] INFO [GroupCoordinator 3]: Preparing to rebalance group mygroup in state PreparingRebalance with old generation 1 (__consumer_offsets-25) (reason: Adding new member consumer-3-3a91c573-a90b-4b5c-9707-af285bf9bbac with group instance id None) (kafka.coordinator.group.GroupCoordinator)
... (more unknown member ids joining the group in PreparingRebalance)
[2020-10-08 21:14:37,756] INFO [GroupCoordinator 3]: Member consumer-2-cd4f7a30-d897-4902-81e7-4211b6a1e233 in group mygroup has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:14:37,757] INFO [GroupCoordinator 3]: Member consumer-1-63909784-c821-4903-a08a-98a250d49b19 in group mygroup has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:14:37,757] INFO [GroupCoordinator 3]: Stabilized group mygroup generation 2 (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:14:39,625] INFO [GroupMetadataManager brokerId=3] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-10-08 21:14:47,758] INFO [GroupCoordinator 3]: Member consumer-3-057c4229-7183-4733-b973-9f758b9a69d0 in group mygroup has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:14:47,758] INFO [GroupCoordinator 3]: Preparing to rebalance group mygroup in state PreparingRebalance with old generation 2 (__consumer_offsets-25) (reason: removing member consumer-3-057c4229-7183-4733-b973-9f758b9a69d0 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2020-10-08 21:14:47,758] INFO [GroupCoordinator 3]: Member consumer-3-58c8ca8c-2daa-46c9-964b-7be883193287 in group mygroup has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
... (more members failing and being removed)
[2020-10-08 21:14:47,759] INFO [GroupCoordinator 3]: Group mygroup with generation 3 is now empty (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)

This turned out to be the key:
The third consumer doesn't start to initialize until after the first two have connected and reset their offsets
The first two consumers had a few seconds of no activity while consumer-3 spun up, after which the heartbeat failures started.
The Spark driver is an older instance (m3.large) with two vCPU cores. Our successful tests were from more recent machines with more cores, and when we restricted CPU availability with taskset on our test machines we were able to reproduce the problem exactly. Allowing spark-submit three cores succeeded.
This hadn't been an issue for us previously with the "simple" Kafka 0.8+ API, but getting started with the "new" Consumer API for 0.10+ seems to require a core per consumer in the group.

Spark Dataframe leftanti Join Fails

We are trying to publish deltas from a Hive table to Kafka. The table in question is a single partition, single block file of 244 MB. Our cluster is configured for a 256M block size, so we're just about at the max for a single file in this case.
Each time that table is updated, a copy is archived, then we run our delta process.
In the function below, we have isolated the different joins and have confirmed that the inner join performs acceptably (about 3 minutes), but the two antijoin dataframes will not complete -- we keep throwing more resources at the Spark job, but are continuing to see the errors below.
Is there a practical limit on dataframe sizes for this kind of join?
private class DeltaColumnPublisher(spark: SparkSession, sink: KafkaSink, source: RegisteredDataset)
extends BasePublisher(spark, sink, source) with Serializable {
val deltaColumn = "hadoop_update_ts" // TODO: move to the dataset object
def publishDeltaRun(dataLocation: String, archiveLocation: String): (Long, Long) = {
val current = spark.read.parquet(dataLocation)
val previous = spark.read.parquet(archiveLocation)
val inserts = current.join(previous, keys, "leftanti")
val updates = current.join(previous, keys).where(current.col(deltaColumn) =!= previous.col(deltaColumn))
val deletes = previous.join(current, keys, "leftanti")
val upsertCounter = spark.sparkContext.longAccumulator("upserts")
val deleteCounter = spark.sparkContext.longAccumulator("deletes")
logInfo("sending inserts to kafka")
sink.sendDeltasToKafka(inserts, "U", upsertCounter)
logInfo("sending updates to kafka")
sink.sendDeltasToKafka(updates, "U", upsertCounter)
logInfo("sending deletes to kafka")
sink.sendDeltasToKafka(deletes, "D", deleteCounter)
(upsertCounter.value, deleteCounter.value)
}
}
The errors we're seeing seems to indicate that the driver is losing contact with the executors. We have increased the executor memory up to 24G and the network timeout as high as 900s and the heartbeat interval as high as 120s.
17/11/27 20:36:18 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#596e3aa6,BlockManagerId(1, server, 46292, None))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at ...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at ...
Later in the logs:
17/11/27 20:42:37 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#25d1bd5f,BlockManagerId(1, server, 46292, None))] in 3 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at ...
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Could not find HeartbeatReceiver.
The config switches we have been manipulating (without success) are --executor-memory 24G --conf spark.network.timeout=900s --conf spark.executor.heartbeatInterval=120s

The option I failed to consider is to increase my driver resources. I added --driver-memory 4G and --driver-cores 2 and saw my job complete in about 9 minutes.
It appears that an inner join of these two files (or using the built-in except() method) puts memory pressure on the executors. Partitioning on one of the key columns seems to help ease that memory pressure, but increases overall time because there is more shuffling involved.
Doing the left-anti join between these two files requires that we have more driver resources. Didn’t expect that.

Message getting lost in Kafka + Spark Streaming

I am facing an issue of data loss in spark streaming with Kafka, my use case is as follow:
Spark streaming(DirectStream) application reading messages from
Kafka topic and processing it.
On the basis of the processed message, an app will write the
processed message to different Kafka topics for e.g. if the message
is harmonized then write to the harmonized topic else unharmonized
topic.
Now, the problem is that during the streaming somehow I am losing some messaged i.e all the incoming messages are not written to harmonized or unharmonized topics.
for e.g., if app received 30 messages in one batch then sometimes it writes all the messages to output topics(this is the expected behaviour) but sometimes it writes only 27 (3 messages are lost, this number can change).
Following is the version I am using:
Spark 1.6.0
Kafka 0.9
Kafka topics configuration is as follow:
num of brokers: 3
num replication factor: 3
num of partitions: 3
Following are the properties I am using for kafka:
val props = new Properties()
props.put("metadata.broker.list", properties.getProperty("metadataBrokerList"))
props.put("auto.offset.reset", properties.getProperty("autoOffsetReset"))
props.put("group.id", properties.getProperty("group.id"))
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("outTopicHarmonized", properties.getProperty("outletKafkaTopicHarmonized"))
props.put("outTopicUnharmonized", properties.getProperty("outletKafkaTopicUnharmonized"))
props.put("acks", "all");
props.put("retries", "5");
props.put("request.required.acks", "-1")
Following is the piece of code where I am writing processed messages to Kafka:
val schemaRdd2 = finalHarmonizedDF.toJSON
schemaRdd2.foreachPartition { partition =>
val producerConfig = new ProducerConfig(props)
val producer = new Producer[String, String](producerConfig)
partition.foreach { row =>
if (debug) println(row.mkString)
val keyedMessage = new KeyedMessage[String, String](props.getProperty("outTopicHarmonized"),
null, row.toString())
producer.send(keyedMessage)
}
//hack, should be done with the flush
Thread.sleep(1000)
producer.close()
}
I have explicitly added sleep(1000) for testing purpose.
But this is also not solving the problem :(
Any suggestion would be appreciated.

Try to tune the batchDuration parameter (when initializing StreamingContext ) to a number larger than the processing time of each rdd. This solved my problem.

Because you don't want to lose any messages, you might want to choose the 'exactly once' delivery semantics, which provides no data loss. In order to configure the exactly once delivery semantics you have to use acks='all', which you did.
According to this resource[1], acks='all' property must be used in conjunction with min.insync.replicas property.
[1] https://www.linkedin.com/pulse/kafka-producer-delivery-semantics-sylvester-daniel/

Apache Spark Kinesis Integration: connected, but no records received

tldr; Can't use Kinesis Spark Streaming integration, because it receives no data.
Testing stream is set up, nodejs app sends 1 simple record per second.
Standard Spark 1.5.2 cluster is set up with master and worker nodes (4 cores) with docker-compose, AWS credentials in environment
spark-streaming-kinesis-asl-assembly_2.10-1.5.2.jar is downloaded and added to classpath
job.py or job.jar (just reads and prints) submitted.
Everything seems to be okay, but no records what-so-ever are received.
From time to time the KCL Worker thread says "Sleeping ..." - it might be broken silently (I checked all the stderr I could find, but no hints). Maybe swallowed OutOfMemoryError... but I doubt that, because of the amount of 1 record per second.
-------------------------------------------
Time: 1448645109000 ms
-------------------------------------------
15/11/27 17:25:09 INFO JobScheduler: Finished job streaming job 1448645109000 ms.0 from job set of time 1448645109000 ms
15/11/27 17:25:09 INFO KinesisBackedBlockRDD: Removing RDD 102 from persistence list
15/11/27 17:25:09 INFO JobScheduler: Total delay: 0.002 s for time 1448645109000 ms (execution: 0.001 s)
15/11/27 17:25:09 INFO BlockManager: Removing RDD 102
15/11/27 17:25:09 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[102] at createStream at NewClass.java:25 of time 1448645109000 ms
15/11/27 17:25:09 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1448645107000 ms)
15/11/27 17:25:09 INFO InputInfoTracker: remove old batch metadata: 1448645107000 ms
15/11/27 17:25:10 INFO JobScheduler: Added jobs for time 1448645110000 ms
15/11/27 17:25:10 INFO JobScheduler: Starting job streaming job 1448645110000 ms.0 from job set of time 1448645110000 ms
-------------------------------------------
Time: 1448645110000 ms
-------------------------------------------
<----- Some data expected to show up here!
15/11/27 17:25:10 INFO JobScheduler: Finished job streaming job 1448645110000 ms.0 from job set of time 1448645110000 ms
15/11/27 17:25:10 INFO JobScheduler: Total delay: 0.003 s for time 1448645110000 ms (execution: 0.001 s)
15/11/27 17:25:10 INFO KinesisBackedBlockRDD: Removing RDD 103 from persistence list
15/11/27 17:25:10 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[103] at createStream at NewClass.java:25 of time 1448645110000 ms
15/11/27 17:25:10 INFO BlockManager: Removing RDD 103
15/11/27 17:25:10 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1448645108000 ms)
15/11/27 17:25:10 INFO InputInfoTracker: remove old batch metadata: 1448645108000 ms
15/11/27 17:25:11 INFO JobScheduler: Added jobs for time 1448645111000 ms
15/11/27 17:25:11 INFO JobScheduler: Starting job streaming job 1448645111000 ms.0 from job set of time 1448645111000 ms
Please let me know any hints, I'd really like to use Spark for real time analytics... everything but this small detail of not receiving data :) seems to be ok.
PS: I find strange that somehow Spark ignores my settings of Storage level (mem and disk 2) and Checkpoint interval (20,000 ms)
15/11/27 17:23:26 INFO KinesisInputDStream: metadataCleanupDelay = -1
15/11/27 17:23:26 INFO KinesisInputDStream: Slide time = 1000 ms
15/11/27 17:23:26 INFO KinesisInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
15/11/27 17:23:26 INFO KinesisInputDStream: Checkpoint interval = null
15/11/27 17:23:26 INFO KinesisInputDStream: Remember duration = 1000 ms
15/11/27 17:23:26 INFO KinesisInputDStream: Initialized and validated org.apache.spark.streaming.kinesis.KinesisInputDStream#74b21a6
Source code (java):
public class NewClass {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("appname").setMaster("local[3]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
JavaReceiverInputDStream kinesisStream = KinesisUtils.createStream(
ssc, "webassist-test", "test", "https://kinesis.us-west-1.amazonaws.com", "us-west-1",
InitialPositionInStream.LATEST,
new Duration(20000),
StorageLevel.MEMORY_AND_DISK_2()
);
kinesisStream.print();
ssc.start();
ssc.awaitTermination();
}
}
Python code (tried both pprinting before and sending to MongoDB):
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
from pyspark import SparkContext, StorageLevel
from pyspark.streaming import StreamingContext
from sys import argv
sc = SparkContext(appName="webassist-test")
ssc = StreamingContext(sc, 5)
stream = KinesisUtils.createStream(ssc,
"appname",
"test",
"https://kinesis.us-west-1.amazonaws.com",
"us-west-1",
InitialPositionInStream.LATEST,
5,
StorageLevel.MEMORY_AND_DISK_2)
stream.pprint()
ssc.start()
ssc.awaitTermination()
Note: I also tried sending data to MongoDB with stream.foreachRDD(lambda rdd: rdd.foreachPartition(send_partition)) but not pasting it here, since you'd need a MongoDB instance and it's not related to the problem - no records come in on the input already.
One more thing - the KCL never commits. The corresponding DynamoDB looks like this:
leaseKey checkpoint leaseCounter leaseOwner ownerSwitchesSinceCheckpoint
shardId-000000000000 LATEST 614 localhost:d92516... 8
The command used for submitting:
spark-submit --executor-memory 1024m --master spark://IpAddress:7077 /path/test.py
In the MasterUI I can see:
Input Rate
Receivers: 1 / 1 active
Avg: 0.00 events/sec
KinesisReceiver-0
Avg: 0.00 events/sec
...
Completed Batches (last 76 out of 76)
Thanks for any help!

I've had issues with no record activity being shown in Spark Streaming in the past when connecting with Kinesis.
I'd try these things to get more feedback/a different behaviour from Spark:
Make sure that you force the evaluation of your DStream transformation operations with output operations like foreachRDD, print, saveas...
Create a new KCL Application in DynamoDB using a new name for the "Kinesis app name" parameter when creating the stream or purge the existing one.
Switch between TRIM_HORIZON and LATEST for initial position when creating the stream.
Restart the context when you try these changes.
EDIT after code was added:
Perhaps I'm missing something obvious, but I cannot spot anything wrong with your source code. Do you have n+1 cpus running this application (n is the number of Kinesis shards)?
If you run a KCL application (Java/Python/...) reading from the shards in your docker instance, does it work? Perhaps there's something wrong with your network configuration, but I'd expect some error messages pointing it out.
If this is important enough / you have a bit of time, you can quickly implement kcl reader in your docker instance and will allow you to compare with your Spark Application. Some urls:
Python
Java
Python example
Another option is to run your Spark Streaming application in a different cluster and to compare.
P.S.: I'm currently using Spark Streaming 1.5.2 with Kinesis in different clusters and it processes records / shows activity as expected.

I was facing this issue when I used the suggested documentation and examples for the same, the following scala code works fine for me(you can always use java instead)--
val conf = ConfigFactory.load
val config = new SparkConf().setAppName(conf.getString("app.name"))
val ssc = new StreamingContext(config, Seconds(conf.getInt("app.aws.batchDuration")))
val stream = if (conf.hasPath("app.aws.key") && conf.hasPath("app.aws.secret")){
logger.info("Specifying AWS account using credentials.")
KinesisUtils.createStream(
ssc,
conf.getString("app.name"),
conf.getString("app.aws.stream"),
conf.getString("app.aws.endpoint"),
conf.getString("app.aws.region"),
InitialPositionInStream.LATEST,
Seconds(conf.getInt("app.aws.batchDuration")),
StorageLevel.MEMORY_AND_DISK_2,
conf.getString("app.aws.key"),
conf.getString("app.aws.secret")
)
} else {
logger.info("Specifying AWS account using EC2 profile.")
KinesisUtils.createStream(
ssc,
conf.getString("app.name"),
conf.getString("app.aws.stream"),
conf.getString("app.aws.endpoint"),
conf.getString("app.aws.region"),
InitialPositionInStream.LATEST,
Seconds(conf.getInt("app.aws.batchDuration")),
StorageLevel.MEMORY_AND_DISK_2
)
}
stream.foreachRDD((rdd: RDD[Array[Byte]], time) => {
val rddstr: RDD[String] = rdd
.map(arrByte => new String(arrByte))
rddstr.foreach(x => println(x))
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark DStream from Kafka always starts at beginning - apache-spark

Related

Abstract Method Exception when connecting to Azure Cosmos DB Cassandra API table using Azure databricks job

One Kafka consumer in a group consistently rejects coordinator, but only when Spark and Kafka are both in EC2

Spark Dataframe leftanti Join Fails

Message getting lost in Kafka + Spark Streaming

Apache Spark Kinesis Integration: connected, but no records received

Categories

Resources