I have a spark streaming app that runs on spark cluster with 4 node. A few days ago the app keeps resetting Kafka offset and does not fetch Kafka data anymore while the AUTO OFFSET RESET is set,
this is the log:
22/06/28 21:39:38 INFO AppInfoParser: Kafka version : 2.0.0
18|stream | 22/06/28 21:39:38 INFO AppInfoParser: Kafka commitId : 3402a8361b734732
18|stream | 22/06/28 21:39:39 INFO Metadata: Cluster ID: 3cAbAp6-QNyO1cKEc1dtUA
18|stream | 22/06/28 21:39:39 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Discovered group coordinator xxx.xxx.xxx.xxx:9092 (id: 2147483647 rack: null)
18|stream | 22/06/28 21:39:39 INFO ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Revoking previously assigned partitions []
18|stream | 22/06/28 21:39:39 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=testid1] (Re-)joining group
18|stream | 22/06/28 21:39:39 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Successfully joined group with generation 9042
18|stream | 22/06/28 21:39:39 INFO ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Setting newly assigned partitions [applog-15, applog-14, applog-13, applog-12, applog-11, applog-10, applog-9, new_apploglog-0, applog-8, applog-7, applog-6, applog-5, applog-4, applog-3, applog-2, applog-1, applog-0]
18|stream | 22/06/28 21:39:39 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16767946.
18|stream | 22/06/28 21:39:39 INFO RecurringTimer: Started timer for JobGenerator at time 1656452400000
18|stream | 22/06/28 21:39:39 INFO JobGenerator: Started JobGenerator at 1656452400000 ms
18|stream | 22/06/28 21:39:39 INFO JobScheduler: Started JobScheduler
18|stream | 22/06/28 21:39:39 INFO StreamingContext: StreamingContext started
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:46588) with ID 2
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:48860) with ID 3
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:35981 with 4.6 GB RAM, BlockManagerId(2, xxx.xxx.xxx.xxx, 35981, None)
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:40001 with 4.6 GB RAM, BlockManagerId(3, xxx.xxx.xxx.xxx, 40001, None)
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:39858) with ID 1
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:57696) with ID 0
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:44765 with 4.6 GB RAM, BlockManagerId(1, xxx.xxx.xxx.xxx, 44765, None)
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:46661 with 4.6 GB RAM, BlockManagerId(0, xxx.xxx.xxx.xxx, 46661, None)
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-15 to offset 285007408.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-14 to offset 285006512.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-13 to offset 285006673.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-12 to offset 285006392.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-11 to offset 285006399.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-10 to offset 285006961.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-9 to offset 285007334.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16838546.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-8 to offset 285007057.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-7 to offset 285005614.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-6 to offset 285007348.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-5 to offset 285004512.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-4 to offset 285005570.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-3 to offset 285008145.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-2 to offset 285007214.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-1 to offset 285007686.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-0 to offset 316632614.
18|stream | 22/06/28 21:40:00 INFO JobScheduler: Added jobs for time 1656452400000 ms
18|stream | 22/06/28 21:40:00 INFO JobScheduler: Starting job streaming job 1656452400000 ms.0 from job set of time 1656452400000 ms
18|stream | 22/06/28 21:40:00 INFO SparkContext: Starting job: collect at Main.scala:76
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Registering RDD 1 (repartition at Main.scala:69)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Got job 0 (collect at Main.scala:76) with 16 output partitions
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Final stage: ResultStage 1 (collect at Main.scala:76)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[1] at repartition at Main.scala:69), which has no missing parents
18|stream | 22/06/28 21:40:00 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.9 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.1 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:41399 (size: 3.1 KB, free: 5.2 GB)
18|stream | 22/06/28 21:40:00 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Submitting 17 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[1] at repartition at Main.scala:69) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
18|stream | 22/06/28 21:40:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 17 tasks
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 0, xxx.xxx.xxx.xxx, executor 0, partition 1, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 1, xxx.xxx.xxx.xxx, executor 1, partition 8, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 2, xxx.xxx.xxx.xxx, executor 2, partition 3, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 3, xxx.xxx.xxx.xxx, executor 3, partition 0, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:44765 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:40001 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:35981 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:46661 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:02 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 4, xxx.xxx.xxx.xxx, executor 1, partition 10, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 5, xxx.xxx.xxx.xxx, executor 3, partition 4, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 1) in 2181 ms on xxx.xxx.xxx.xxx (executor 1) (1/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 3) in 2187 ms on xxx.xxx.xxx.xxx (executor 3) (2/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 6, xxx.xxx.xxx.xxx, executor 2, partition 7, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 2) in 2463 ms on xxx.xxx.xxx.xxx (executor 2) (3/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 7, xxx.xxx.xxx.xxx, executor 3, partition 5, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 5) in 343 ms on xxx.xxx.xxx.xxx (executor 3) (4/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 13.0 in stage 0.0 (TID 8, xxx.xxx.xxx.xxx, executor 1, partition 13, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 10.0 in stage 0.0 (TID 4) in 389 ms on xxx.xxx.xxx.xxx (executor 1) (5/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 9, xxx.xxx.xxx.xxx, executor 0, partition 2, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 0) in 2773 ms on xxx.xxx.xxx.xxx (executor 0) (6/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 10, xxx.xxx.xxx.xxx, executor 2, partition 11, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 6) in 403 ms on xxx.xxx.xxx.xxx (executor 2) (7/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 11, xxx.xxx.xxx.xxx, executor 3, partition 9, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 7) in 362 ms on xxx.xxx.xxx.xxx (executor 3) (8/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 15.0 in stage 0.0 (TID 12, xxx.xxx.xxx.xxx, executor 1, partition 15, PROCESS_LOCAL, 7746 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 8) in 369 ms on xxx.xxx.xxx.xxx (executor 1) (9/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 13, xxx.xxx.xxx.xxx, executor 1, partition 16, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 15.0 in stage 0.0 (TID 12) in 146 ms on xxx.xxx.xxx.xxx (executor 1) (10/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 11) in 247 ms on xxx.xxx.xxx.xxx (executor 3) (11/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 14, xxx.xxx.xxx.xxx, executor 0, partition 6, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 9) in 382 ms on xxx.xxx.xxx.xxx (executor 0) (12/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 15, xxx.xxx.xxx.xxx, executor 2, partition 14, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 11.0 in stage 0.0 (TID 10) in 337 ms on xxx.xxx.xxx.xxx (executor 2) (13/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 13) in 331 ms on xxx.xxx.xxx.xxx (executor 1) (14/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 12.0 in stage 0.0 (TID 16, xxx.xxx.xxx.xxx, executor 0, partition 12, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 14) in 303 ms on xxx.xxx.xxx.xxx (executor 0) (15/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 15) in 271 ms on xxx.xxx.xxx.xxx (executor 2) (16/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 16) in 291 ms on xxx.xxx.xxx.xxx (executor 0) (17/17)
18|stream | 22/06/28 21:40:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: ShuffleMapStage 0 (repartition at Main.scala:69) finished in 4.222 s
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: looking for newly runnable stages
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: running: Set()
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: waiting: Set(ResultStage 1)
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: failed: Set()
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at map at Main.scala:73), which has no missing parents
18|stream | 22/06/28 21:40:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.7 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:41399 (size: 2.7 KB, free: 5.2 GB)
18|stream | 22/06/28 21:40:04 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: Submitting 16 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at map at Main.scala:73) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
18|stream | 22/06/28 21:40:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 16 tasks
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 17, xxx.xxx.xxx.xxx, executor 2, partition 0, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 18, xxx.xxx.xxx.xxx, executor 3, partition 1, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 19, xxx.xxx.xxx.xxx, executor 1, partition 2, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 20, xxx.xxx.xxx.xxx, executor 0, partition 3, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:44765 (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:40001 (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:46661 (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:48860
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:39858
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:46588
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:57696
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:41399 in memory (size: 3.1 KB, free: 5.2 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:44765 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:35981 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:46661 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:40001 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-15 to offset 285007532.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-14 to offset 285006636.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-13 to offset 285006799.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-12 to offset 285006518.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-11 to offset 285006525.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-10 to offset 285007087.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-9 to offset 285007459.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16838553.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-8 to offset 285007182.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-7 to offset 285005739.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-6 to offset 285007471.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-5 to offset 285004635.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-4 to offset 285005693.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-3 to offset 285008268.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-2 to offset 285007337.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-1 to offset 285007810.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-0 to offset 316632738.
18|stream | 22/06/28 21:42:00 INFO JobScheduler: Added jobs for time 1656452520000 ms
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-15 to offset 285007665.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-14 to offset 285006770.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-13 to offset 285006931.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-12 to offset 285006650.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-11 to offset 285006657.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-10 to offset 285007219.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-9 to offset 285007591.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16838556.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-8 to offset 285007314.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-7 to offset 285005871.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-6 to offset 285007603.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-5 to offset 285004767.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-4 to offset 285005825.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-3 to offset 285008400.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-2 to offset 285007469.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-1 to offset 285007942.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-0 to offset 316632870.
18|stream | 22/06/28 21:44:00 INFO JobScheduler: Added jobs for time 1656452640000 ms
resetting Kafka offset repeats forever without even an error.
I did these actions to solve the problem but it did not any help:
reset Kafka offset to earliest or latest
delete consumer group and create new one
I even changed the topic but nothing changed, so I guessed it was from spark cluster, but I can load Kafka data with Pyspark shell on the same cluster
notes:
the app was working OK about 3 years!
recently we had server migration and some of our resources has been decreased
other non-streaming jobs run on spark cluster without any issue
Is there anything that I'm missing?
Can you confirm what was the value set in AUTO OFFSET RESET before the upgrade? Also you can inspect the offset against the consumer group on that particular topic executing the following command.
bin/kafka-consumer-groups — bootstrap-server localhost:9092 — describe — group consumer-group
Also, check for any KAFKA Upgrade w.r.t broker configuration changes recently.
There are edge case scenarios also for this behaviour due to any poison pills or changes in the consumer behaviour. Because, there are many factors when auto.offset.reset property will kicks in efficiently.
One such case from the doc,
There is an edge case that could result in data loss, whereby a message is not redelivered in a retryable exception scenario. This scenario applies to a new consumer group that is yet to have recorded any current offset (or the offset has been deleted).
Two consumer instances, A and B, join a new consumer group.
The consumer instances are configured with auto.offset.reset as latest (i.e. new messages only).
Consumer A consumes a new message from the topic partition.
Consumer A dies before processing of the message has completed. The consumer offsets are not updated to mark the message as consumed.
The consumer group rebalances, and Consumer B is assigned to the topic partition.
As there is no valid offset, and auto.offset.reset is set to latest, the message is not consumed.
It was from my redis server. After migration it became unstable and sometimes it was not working correctly. so I restarted the service and everything worked fine.
I am using Kafka version 2.0, Spark version is 2.2.0.2.6.4.0-91 , Python version is 2.7.5
I am running this below code and it streams without any error but the count is not printing in the output.
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import os
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 60)
print("spark cotext set")
zkQuorum, topic = 'master.hdp:2181','streamit'
kvs = KafkaUtils.createStream(ssc, zkQuorum, "console-consumer-68081", {topic: 1})
print("connection set")
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
My Spark submit Command is
/usr/hdp/current/spark2-client/bin/spark-submit --principal hdfs-ivory#KDCAUTH.COM --keytab /etc/security/keytabs/hdfs.headless.keytab --master yarn --deploy-mode client --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 kstream.py
The last part of my output log shows. It gets stream but doesn't show the desired processed output.
-------------------------------------------
Time: 2020-01-22 19:29:00
-------------------------------------------
20/01/22 19:29:00 INFO JobScheduler: Finished job streaming job 1579701540000 ms.0 from job set of time 1579701540000 ms
20/01/22 19:29:00 INFO JobScheduler: Starting job streaming job 1579701540000 ms.1 from job set of time 1579701540000 ms
20/01/22 19:29:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:29:00 INFO DAGScheduler: Registering RDD 7 (call at /usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py:2230)
20/01/22 19:29:00 INFO DAGScheduler: Got job 2 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:29:00 INFO DAGScheduler: Final stage: ResultStage 4 (runJob at PythonRDD.scala:455)
20/01/22 19:29:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
20/01/22 19:29:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:29:00 INFO DAGScheduler: Submitting ResultStage 4 (PythonRDD[11] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 8.1 KB, free 366.2 MB)
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.2 MB)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
20/01/22 19:29:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (PythonRDD[11] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
20/01/22 19:29:00 INFO YarnScheduler: Adding task set 4.0 with 1 tasks
20/01/22 19:29:00 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 71, master.hdp, executor 1, partition 0, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 172.16.0.21:51120
20/01/22 19:29:00 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 83 bytes
20/01/22 19:29:00 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 71) in 473 ms on master.hdp (executor 1) (1/1)
20/01/22 19:29:00 INFO YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
20/01/22 19:29:00 INFO DAGScheduler: ResultStage 4 (runJob at PythonRDD.scala:455) finished in 0.476 s
20/01/22 19:29:00 INFO DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:455, took 0.497775 s
20/01/22 19:29:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:29:00 INFO DAGScheduler: Got job 3 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:29:00 INFO DAGScheduler: Final stage: ResultStage 6 (runJob at PythonRDD.scala:455)
20/01/22 19:29:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
20/01/22 19:29:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:29:00 INFO DAGScheduler: Submitting ResultStage 6 (PythonRDD[12] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 8.1 KB, free 366.1 MB)
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.1 MB)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
20/01/22 19:29:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (PythonRDD[12] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1))
20/01/22 19:29:00 INFO YarnScheduler: Adding task set 6.0 with 1 tasks
20/01/22 19:29:00 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 72, master.hdp, executor 1, partition 1, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 72) in 123 ms on master.hdp (executor 1) (1/1)
20/01/22 19:29:00 INFO YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool
20/01/22 19:29:00 INFO DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:455) finished in 0.125 s
20/01/22 19:29:00 INFO DAGScheduler: Job 3 finished: runJob at PythonRDD.scala:455, took 0.136936 s
-------------------------------------------
Time: 2020-01-22 19:29:00
-------------------------------------------
20/01/22 19:29:00 INFO JobScheduler: Finished job streaming job 1579701540000 ms.1 from job set of time 1579701540000 ms
20/01/22 19:29:00 INFO JobScheduler: Total delay: 0.811 s for time 1579701540000 ms (execution: 0.684 s)
20/01/22 19:29:00 INFO ReceivedBlockTracker: Deleting batches:
20/01/22 19:29:00 INFO InputInfoTracker: remove old batch metadata:
20/01/22 19:30:00 INFO JobScheduler: Added jobs for time 1579701600000 ms
20/01/22 19:30:00 INFO JobScheduler: Starting job streaming job 1579701600000 ms.0 from job set of time 1579701600000 ms
-------------------------------------------
Time: 2020-01-22 19:30:00
-------------------------------------------
20/01/22 19:30:00 INFO JobScheduler: Finished job streaming job 1579701600000 ms.0 from job set of time 1579701600000 ms
20/01/22 19:30:00 INFO JobScheduler: Starting job streaming job 1579701600000 ms.1 from job set of time 1579701600000 ms
20/01/22 19:30:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:30:00 INFO DAGScheduler: Registering RDD 16 (call at /usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py:2230)
20/01/22 19:30:00 INFO DAGScheduler: Got job 4 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:30:00 INFO DAGScheduler: Final stage: ResultStage 8 (runJob at PythonRDD.scala:455)
20/01/22 19:30:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7)
20/01/22 19:30:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:30:00 INFO DAGScheduler: Submitting ResultStage 8 (PythonRDD[20] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 8.1 KB, free 366.1 MB)
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.1 MB)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.2 MB)
20/01/22 19:30:00 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
20/01/22 19:30:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (PythonRDD[20] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
20/01/22 19:30:00 INFO YarnScheduler: Adding task set 8.0 with 1 tasks
20/01/22 19:30:00 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 73, master.hdp, executor 1, partition 0, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:30:00 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.16.0.21:51120
20/01/22 19:30:00 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 83 bytes
20/01/22 19:30:00 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 73) in 120 ms on master.hdp (executor 1) (1/1)
20/01/22 19:30:00 INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
20/01/22 19:30:00 INFO DAGScheduler: ResultStage 8 (runJob at PythonRDD.scala:455) finished in 0.121 s
20/01/22 19:30:00 INFO DAGScheduler: Job 4 finished: runJob at PythonRDD.scala:455, took 0.134627 s
20/01/22 19:30:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:30:00 INFO DAGScheduler: Got job 5 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:30:00 INFO DAGScheduler: Final stage: ResultStage 10 (runJob at PythonRDD.scala:455)
20/01/22 19:30:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)
20/01/22 19:30:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:30:00 INFO DAGScheduler: Submitting ResultStage 10 (PythonRDD[21] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 8.1 KB, free 366.1 MB)
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.1 MB)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.2 MB)
20/01/22 19:30:00 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
20/01/22 19:30:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 10 (PythonRDD[21] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1))
20/01/22 19:30:00 INFO YarnScheduler: Adding task set 10.0 with 1 tasks
20/01/22 19:30:00 INFO TaskSetManager: Starting task 0.0 in stage 10.0 (TID 74, master.hdp, executor 1, partition 1, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:30:00 INFO TaskSetManager: Finished task 0.0 in stage 10.0 (TID 74) in 132 ms on master.hdp (executor 1) (1/1)
20/01/22 19:30:00 INFO YarnScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool
20/01/22 19:30:00 INFO DAGScheduler: ResultStage 10 (runJob at PythonRDD.scala:455) finished in 0.133 s
20/01/22 19:30:00 INFO DAGScheduler: Job 5 finished: runJob at PythonRDD.scala:455, took 0.143611 s
Give key.deserializer and value.deserializer in kafkaParams and use createDirectStream
kafkaParams = {"metadata.broker.list": config['kstream']['broker'], "auto_offset_reset" : 'earliest',"auto.create.topics.enable": "true","key.deserializer": "org.springframework.kafka.support.serializer.JsonDeserializer","value.deserializer": "org.springframework.kafka.support.serializer.JsonDeserializer"}
kvs = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams,fromOffsets=None,messageHandler=None,keyDecoder=utf8_decoder,valueDecoder=utf8_decoder)
It seems that you are not using a logger but instead, you are print()ing to standard output. In order to log to your log fils, you need to setup a logger. For example, the following gets the logger from SparkContext:
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("Example info")
LOGGER.error("Example error")
...
spark version is 2.2.0
pseudocode:
read data1 from kafka with 5minute window
read data2 from kafka with 10minute window and 5miute slide duration
data1 join data2 on some condition
do some agg and write to mysql
question:
batch time is 15:00 but submitted time is 15:50, processing time is less than 1 minute. what happened?
val shareDs = KafkaUtils.createDirectStream[String, String](streamContext, LocationStrategies.PreferBrokers, shareReqConsumer)
val shareResDS = KafkaUtils.createDirectStream[String, String](streamContext, LocationStrategies.PreferBrokers, shareResConsumer).window(Minutes(WindowTime), Minutes(StreamTime))
shareDs doSomeMap join (shareResDs doSomeMap) forEachRddd{do some things then write to mysql}
there are some logs:
19/07/22 11:20:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:20:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:20:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:20:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_afp_com_input_result-2, topic_wh_sparkstream_afp_com_input_result-1, topic_wh_sparkstream_afp_com_input_result-0]
19/07/22 11:20:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] (Re-)joining group
19/07/22 11:25:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] Successfully joined group with generation 820
19/07/22 11:25:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] Setting newly assigned partitions [topic_wh_sparkstream_afp_com_input_result-2, topic_wh_sparkstream_afp_com_input_result-1, topic_wh_sparkstream_afp_com_input_result-0]
19/07/22 11:25:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:25:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:25:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_decision_report_result-1, topic_wh_sparkstream_decision_report_result-2, topic_wh_sparkstream_decision_report_result-0]
19/07/22 11:25:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] (Re-)joining group
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] Successfully joined group with generation 821
19/07/22 11:30:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] Setting newly assigned partitions [topic_wh_sparkstream_decision_report_result-1, topic_wh_sparkstream_decision_report_result-2, topic_wh_sparkstream_decision_report_result-0]
19/07/22 11:30:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_echo_mixed_risk_record-1, topic_wh_sparkstream_echo_mixed_risk_record-2, topic_wh_sparkstream_echo_mixed_risk_record-0]
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] (Re-)joining group
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Marking the coordinator 10.124.35.112:9092 (id: 2147483534 rack: null) dead
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Discovered group coordinator 10.124.35.112:9092 (id: 2147483534 rack: null)
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] (Re-)joining group
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Successfully joined group with generation 822
19/07/22 11:35:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Setting newly assigned partitions [topic_wh_sparkstream_echo_mixed_risk_record-1, topic_wh_sparkstream_echo_mixed_risk_record-2, topic_wh_sparkstream_echo_mixed_risk_record-0]
19/07/22 11:35:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:35:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_echo_mixed_risk_result_detail-2, topic_wh_sparkstream_echo_mixed_risk_result_detail-1, topic_wh_sparkstream_echo_mixed_risk_result_detail-0, topic_wh_sparkstream_echo_behavior_features_result-0, topic_wh_sparkstream_echo_behavior_features_result-1, topic_wh_sparkstream_echo_behavior_features_result-2]
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] (Re-)joining group
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] Marking the coordinator 10.124.35.112:9092 (id: 2147483534 rack: null) dead
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] Discovered group coordinator 10.124.35.112:9092 (id: 2147483534 rack: null)
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] (Re-)joining group
at the window timestamp, only do kafka re-partitions instead of add a job.
I resolved this problem.
Use spark-Streaming with kafka, congfig each stream with separate group_id and disable auto-commit,
Config kafka parameters appropriate. Especially heartbeat, session-timeout, request-timesout, max-poll-interval.
I'm doing some kind of string processing on Spark. My code snippet:
val rdd = sc.objectFile[(String, String)]("some hdfs url", 1);
rdd.cache.count // let cache happen
val combOp = (f: List[String], g: List[String]) => {
for (x <- f) {
finder.processEntry(x)
}
for (x <- g) {
finder.processEntry(x)
}
finder.result
}
val res = rdd.mapPartitions( x => {
for (e<-x) {
finder.processEntry(e)
}
Iterator(finder.result)
}, true).reduce(combOp)
The dataset I have is about 10GB. I'm running Spark on a 24 cores machine, with 48GB memory. Configure file:
spark.driver.memory 1g
spark.executor.memory 30g
spark.executor.extraJavaOptions -Xloggc:/var/log/gcmemory.log -XX:+PrintGCDetails
spark.executor.cores 4
Execution log snippet:
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.60.1.143, ANY, 1642 bytes)
INFO BlockManagerMasterEndpoint: Registering block manager 10.60.1.143:42850 with 15.5 GB RAM, BlockManagerId(0, 10.60.1.143, 42850)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.60.1.143:42850 (size: 1766.0 B, free: 15.5 GB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.60.1.143:42850 (size: 16.8 KB, free: 15.5 GB)
INFO BlockManagerInfo: Added rdd_1_3 in memory on 10.60.1.143:42850 (size: 219.7 MB, free: 15.3 GB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on 10.60.1.143:42850 (size: 229.7 MB, free: 15.1 GB)
INFO BlockManagerInfo: Added rdd_1_2 in memory on 10.60.1.143:42850 (size: 221.5 MB, free: 14.9 GB)
INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 6345 ms on 10.60.1.143 (1/34)
INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 6351 ms on 10.60.1.143 (2/34)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6354 ms on 10.60.1.143 (3/34)
INFO BlockManagerInfo: Added rdd_1_0 in memory on 10.60.1.143:42850 (size: 220.6 MB, free: 14.7 GB)
INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6454 ms on 10.60.1.143 (4/34)
INFO BlockManagerInfo: Added rdd_1_5 in memory on 10.60.1.143:42850 (size: 219.9 MB, free: 14.4 GB)
INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 2287 ms on 10.60.1.143 (5/34)
INFO BlockManagerInfo: Added rdd_1_4 in memory on 10.60.1.143:42850 (size: 222.7 MB, free: 14.2 GB)
INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 10.60.1.143, ANY, 1642 bytes)
INFO BlockManagerInfo: Added rdd_1_6 in memory on 10.60.1.143:42850 (size: 210.7 MB, free: 14.0 GB)
INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 2350 ms on 10.60.1.143 (6/34)
INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 2356 ms on 10.60.1.143 (7/34)
INFO BlockManagerInfo: Added rdd_1_7 in memory on 10.60.1.143:42850 (size: 214.6 MB, free: 13.8 GB)
INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 2289 ms on 10.60.1.143 (8/34)
INFO BlockManagerInfo: Added rdd_1_8 in memory on 10.60.1.143:42850 (size: 216.3 MB, free: 13.6 GB)
INFO TaskSetManager: Starting task 12.0 in stage 0.0 (TID 12, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 2430 ms on 10.60.1.143 (9/34)
INFO BlockManagerInfo: Added rdd_1_11 in memory on 10.60.1.143:42850 (size: 216.5 MB, free: 13.4 GB)
INFO BlockManagerInfo: Added rdd_1_10 in memory on 10.60.1.143:42850 (size: 216.5 MB, free: 13.2 GB)
INFO TaskSetManager: Starting task 13.0 in stage 0.0 (TID 13, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in 2416 ms on 10.60.1.143 (10/34)
INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in 2445 ms on 10.60.1.143 (11/34)
INFO BlockManagerInfo: Added rdd_1_9 in memory on 10.60.1.143:42850 (size: 231.4 MB, free: 12.9 GB)
INFO TaskSetManager: Starting task 15.0 in stage 0.0 (TID 15, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 2528 ms on 10.60.1.143 (12/34)
INFO BlockManagerInfo: Added rdd_1_12 in memory on 10.60.1.143:42850 (size: 217.3 MB, free: 12.7 GB)
INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 1797 ms on 10.60.1.143 (13/34)
INFO BlockManagerInfo: Added rdd_1_14 in memory on 10.60.1.143:42850 (size: 215.8 MB, free: 12.5 GB)
INFO TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 1748 ms on 10.60.1.143 (14/34)
INFO BlockManagerInfo: Added rdd_1_13 in memory on 10.60.1.143:42850 (size: 220.9 MB, free: 12.3 GB)
INFO TaskSetManager: Starting task 18.0 in stage 0.0 (TID 18, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 1812 ms on 10.60.1.143 (15/34)
INFO BlockManagerInfo: Added rdd_1_15 in memory on 10.60.1.143:42850 (size: 233.8 MB, free: 12.1 GB)
INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 15.0 in stage 0.0 (TID 15) in 1756 ms on 10.60.1.143 (16/34)
INFO BlockManagerInfo: Added rdd_1_16 in memory on 10.60.1.143:42850 (size: 221.6 MB, free: 11.9 GB)
INFO TaskSetManager: Starting task 20.0 in stage 0.0 (TID 20, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 2600 ms on 10.60.1.143 (17/34)
How can the first runners in the same task set execute longer than its latter runners? Any help is very much appreciated.
A common cause for stragglers (or executors that take longer for a certain partitions than others) is unevenly partitioner data. I'd suggest trying to repartition your data. The Spark UI may also have some helpful information (you can take a look at the input sizes etc.) Sometimes some machines are slower for random reasons (especially common in virtualized environments where we can have noisy neighbors on certain machines), you can try enabling speculative execution (see https://spark.apache.org/docs/latest/configuration.html) / set the spark.speculation flag so Spark can try and solve the problem on another executor if it happens to be running slowly on one machine.