Spark streaming app resets kafka offset continuously - apache-spark

I have a spark streaming app that runs on spark cluster with 4 node. A few days ago the app keeps resetting Kafka offset and does not fetch Kafka data anymore while the AUTO OFFSET RESET is set,
this is the log:
22/06/28 21:39:38 INFO AppInfoParser: Kafka version : 2.0.0
18|stream | 22/06/28 21:39:38 INFO AppInfoParser: Kafka commitId : 3402a8361b734732
18|stream | 22/06/28 21:39:39 INFO Metadata: Cluster ID: 3cAbAp6-QNyO1cKEc1dtUA
18|stream | 22/06/28 21:39:39 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Discovered group coordinator xxx.xxx.xxx.xxx:9092 (id: 2147483647 rack: null)
18|stream | 22/06/28 21:39:39 INFO ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Revoking previously assigned partitions []
18|stream | 22/06/28 21:39:39 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=testid1] (Re-)joining group
18|stream | 22/06/28 21:39:39 INFO AbstractCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Successfully joined group with generation 9042
18|stream | 22/06/28 21:39:39 INFO ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=testid1] Setting newly assigned partitions [applog-15, applog-14, applog-13, applog-12, applog-11, applog-10, applog-9, new_apploglog-0, applog-8, applog-7, applog-6, applog-5, applog-4, applog-3, applog-2, applog-1, applog-0]
18|stream | 22/06/28 21:39:39 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16767946.
18|stream | 22/06/28 21:39:39 INFO RecurringTimer: Started timer for JobGenerator at time 1656452400000
18|stream | 22/06/28 21:39:39 INFO JobGenerator: Started JobGenerator at 1656452400000 ms
18|stream | 22/06/28 21:39:39 INFO JobScheduler: Started JobScheduler
18|stream | 22/06/28 21:39:39 INFO StreamingContext: StreamingContext started
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:46588) with ID 2
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:48860) with ID 3
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:35981 with 4.6 GB RAM, BlockManagerId(2, xxx.xxx.xxx.xxx, 35981, None)
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:40001 with 4.6 GB RAM, BlockManagerId(3, xxx.xxx.xxx.xxx, 40001, None)
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:39858) with ID 1
18|stream | 22/06/28 21:39:40 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xxx.xxx.xxx:57696) with ID 0
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:44765 with 4.6 GB RAM, BlockManagerId(1, xxx.xxx.xxx.xxx, 44765, None)
18|stream | 22/06/28 21:39:40 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xxx.xxx.xxx:46661 with 4.6 GB RAM, BlockManagerId(0, xxx.xxx.xxx.xxx, 46661, None)
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-15 to offset 285007408.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-14 to offset 285006512.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-13 to offset 285006673.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-12 to offset 285006392.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-11 to offset 285006399.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-10 to offset 285006961.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-9 to offset 285007334.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16838546.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-8 to offset 285007057.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-7 to offset 285005614.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-6 to offset 285007348.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-5 to offset 285004512.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-4 to offset 285005570.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-3 to offset 285008145.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-2 to offset 285007214.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-1 to offset 285007686.
18|stream | 22/06/28 21:40:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-0 to offset 316632614.
18|stream | 22/06/28 21:40:00 INFO JobScheduler: Added jobs for time 1656452400000 ms
18|stream | 22/06/28 21:40:00 INFO JobScheduler: Starting job streaming job 1656452400000 ms.0 from job set of time 1656452400000 ms
18|stream | 22/06/28 21:40:00 INFO SparkContext: Starting job: collect at Main.scala:76
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Registering RDD 1 (repartition at Main.scala:69)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Got job 0 (collect at Main.scala:76) with 16 output partitions
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Final stage: ResultStage 1 (collect at Main.scala:76)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[1] at repartition at Main.scala:69), which has no missing parents
18|stream | 22/06/28 21:40:00 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.9 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.1 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:41399 (size: 3.1 KB, free: 5.2 GB)
18|stream | 22/06/28 21:40:00 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
18|stream | 22/06/28 21:40:00 INFO DAGScheduler: Submitting 17 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[1] at repartition at Main.scala:69) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
18|stream | 22/06/28 21:40:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 17 tasks
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 0, xxx.xxx.xxx.xxx, executor 0, partition 1, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 1, xxx.xxx.xxx.xxx, executor 1, partition 8, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 2, xxx.xxx.xxx.xxx, executor 2, partition 3, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 3, xxx.xxx.xxx.xxx, executor 3, partition 0, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:44765 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:40001 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:35981 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xxx.xxx.xxx.xxx:46661 (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:02 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 4, xxx.xxx.xxx.xxx, executor 1, partition 10, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 5, xxx.xxx.xxx.xxx, executor 3, partition 4, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 1) in 2181 ms on xxx.xxx.xxx.xxx (executor 1) (1/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 3) in 2187 ms on xxx.xxx.xxx.xxx (executor 3) (2/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 6, xxx.xxx.xxx.xxx, executor 2, partition 7, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 2) in 2463 ms on xxx.xxx.xxx.xxx (executor 2) (3/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 7, xxx.xxx.xxx.xxx, executor 3, partition 5, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 5) in 343 ms on xxx.xxx.xxx.xxx (executor 3) (4/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 13.0 in stage 0.0 (TID 8, xxx.xxx.xxx.xxx, executor 1, partition 13, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 10.0 in stage 0.0 (TID 4) in 389 ms on xxx.xxx.xxx.xxx (executor 1) (5/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 9, xxx.xxx.xxx.xxx, executor 0, partition 2, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 0) in 2773 ms on xxx.xxx.xxx.xxx (executor 0) (6/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 10, xxx.xxx.xxx.xxx, executor 2, partition 11, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 6) in 403 ms on xxx.xxx.xxx.xxx (executor 2) (7/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 11, xxx.xxx.xxx.xxx, executor 3, partition 9, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 7) in 362 ms on xxx.xxx.xxx.xxx (executor 3) (8/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 15.0 in stage 0.0 (TID 12, xxx.xxx.xxx.xxx, executor 1, partition 15, PROCESS_LOCAL, 7746 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 8) in 369 ms on xxx.xxx.xxx.xxx (executor 1) (9/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 13, xxx.xxx.xxx.xxx, executor 1, partition 16, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 15.0 in stage 0.0 (TID 12) in 146 ms on xxx.xxx.xxx.xxx (executor 1) (10/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 11) in 247 ms on xxx.xxx.xxx.xxx (executor 3) (11/17)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 14, xxx.xxx.xxx.xxx, executor 0, partition 6, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:03 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 9) in 382 ms on xxx.xxx.xxx.xxx (executor 0) (12/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 15, xxx.xxx.xxx.xxx, executor 2, partition 14, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 11.0 in stage 0.0 (TID 10) in 337 ms on xxx.xxx.xxx.xxx (executor 2) (13/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 13) in 331 ms on xxx.xxx.xxx.xxx (executor 1) (14/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 12.0 in stage 0.0 (TID 16, xxx.xxx.xxx.xxx, executor 0, partition 12, PROCESS_LOCAL, 7748 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 14) in 303 ms on xxx.xxx.xxx.xxx (executor 0) (15/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 15) in 271 ms on xxx.xxx.xxx.xxx (executor 2) (16/17)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 16) in 291 ms on xxx.xxx.xxx.xxx (executor 0) (17/17)
18|stream | 22/06/28 21:40:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: ShuffleMapStage 0 (repartition at Main.scala:69) finished in 4.222 s
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: looking for newly runnable stages
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: running: Set()
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: waiting: Set(ResultStage 1)
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: failed: Set()
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at map at Main.scala:73), which has no missing parents
18|stream | 22/06/28 21:40:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.7 KB, free 5.2 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:41399 (size: 2.7 KB, free: 5.2 GB)
18|stream | 22/06/28 21:40:04 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
18|stream | 22/06/28 21:40:04 INFO DAGScheduler: Submitting 16 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at map at Main.scala:73) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
18|stream | 22/06/28 21:40:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 16 tasks
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 17, xxx.xxx.xxx.xxx, executor 2, partition 0, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 18, xxx.xxx.xxx.xxx, executor 3, partition 1, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 19, xxx.xxx.xxx.xxx, executor 1, partition 2, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 20, xxx.xxx.xxx.xxx, executor 0, partition 3, NODE_LOCAL, 7942 bytes)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:44765 (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:40001 (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on xxx.xxx.xxx.xxx:46661 (size: 2.7 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:48860
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:39858
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:46588
18|stream | 22/06/28 21:40:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to xxx.xxx.xxx.xxx:57696
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:41399 in memory (size: 3.1 KB, free: 5.2 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:44765 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:35981 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:46661 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:40:32 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xxx.xxx.xxx.xxx:40001 in memory (size: 3.1 KB, free: 4.6 GB)
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-15 to offset 285007532.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-14 to offset 285006636.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-13 to offset 285006799.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-12 to offset 285006518.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-11 to offset 285006525.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-10 to offset 285007087.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-9 to offset 285007459.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16838553.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-8 to offset 285007182.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-7 to offset 285005739.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-6 to offset 285007471.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-5 to offset 285004635.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-4 to offset 285005693.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-3 to offset 285008268.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-2 to offset 285007337.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-1 to offset 285007810.
18|stream | 22/06/28 21:42:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-0 to offset 316632738.
18|stream | 22/06/28 21:42:00 INFO JobScheduler: Added jobs for time 1656452520000 ms
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-15 to offset 285007665.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-14 to offset 285006770.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-13 to offset 285006931.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-12 to offset 285006650.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-11 to offset 285006657.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-10 to offset 285007219.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-9 to offset 285007591.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition new_apploglog-0 to offset 16838556.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-8 to offset 285007314.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-7 to offset 285005871.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-6 to offset 285007603.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-5 to offset 285004767.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-4 to offset 285005825.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-3 to offset 285008400.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-2 to offset 285007469.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-1 to offset 285007942.
18|stream | 22/06/28 21:44:00 INFO Fetcher: [Consumer clientId=consumer-1, groupId=testid1] Resetting offset for partition applog-0 to offset 316632870.
18|stream | 22/06/28 21:44:00 INFO JobScheduler: Added jobs for time 1656452640000 ms
resetting Kafka offset repeats forever without even an error.
I did these actions to solve the problem but it did not any help:
reset Kafka offset to earliest or latest
delete consumer group and create new one
I even changed the topic but nothing changed, so I guessed it was from spark cluster, but I can load Kafka data with Pyspark shell on the same cluster
notes:
the app was working OK about 3 years!
recently we had server migration and some of our resources has been decreased
other non-streaming jobs run on spark cluster without any issue
Is there anything that I'm missing?

Can you confirm what was the value set in AUTO OFFSET RESET before the upgrade? Also you can inspect the offset against the consumer group on that particular topic executing the following command.
bin/kafka-consumer-groups — bootstrap-server localhost:9092 — describe — group consumer-group
Also, check for any KAFKA Upgrade w.r.t broker configuration changes recently.
There are edge case scenarios also for this behaviour due to any poison pills or changes in the consumer behaviour. Because, there are many factors when auto.offset.reset property will kicks in efficiently.
One such case from the doc,
There is an edge case that could result in data loss, whereby a message is not redelivered in a retryable exception scenario. This scenario applies to a new consumer group that is yet to have recorded any current offset (or the offset has been deleted).
Two consumer instances, A and B, join a new consumer group.
The consumer instances are configured with auto.offset.reset as latest (i.e. new messages only).
Consumer A consumes a new message from the topic partition.
Consumer A dies before processing of the message has completed. The consumer offsets are not updated to mark the message as consumed.
The consumer group rebalances, and Consumer B is assigned to the topic partition.
As there is no valid offset, and auto.offset.reset is set to latest, the message is not consumed.

It was from my redis server. After migration it became unstable and sometimes it was not working correctly. so I restarted the service and everything worked fine.

Related

Kafka create stream running but not printing the processed output from Kafka Topic in Pyspark

I am using Kafka version 2.0, Spark version is 2.2.0.2.6.4.0-91 , Python version is 2.7.5
I am running this below code and it streams without any error but the count is not printing in the output.
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import os
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 60)
print("spark cotext set")
zkQuorum, topic = 'master.hdp:2181','streamit'
kvs = KafkaUtils.createStream(ssc, zkQuorum, "console-consumer-68081", {topic: 1})
print("connection set")
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
My Spark submit Command is
/usr/hdp/current/spark2-client/bin/spark-submit --principal hdfs-ivory#KDCAUTH.COM --keytab /etc/security/keytabs/hdfs.headless.keytab --master yarn --deploy-mode client --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 kstream.py
The last part of my output log shows. It gets stream but doesn't show the desired processed output.
-------------------------------------------
Time: 2020-01-22 19:29:00
-------------------------------------------
20/01/22 19:29:00 INFO JobScheduler: Finished job streaming job 1579701540000 ms.0 from job set of time 1579701540000 ms
20/01/22 19:29:00 INFO JobScheduler: Starting job streaming job 1579701540000 ms.1 from job set of time 1579701540000 ms
20/01/22 19:29:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:29:00 INFO DAGScheduler: Registering RDD 7 (call at /usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py:2230)
20/01/22 19:29:00 INFO DAGScheduler: Got job 2 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:29:00 INFO DAGScheduler: Final stage: ResultStage 4 (runJob at PythonRDD.scala:455)
20/01/22 19:29:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
20/01/22 19:29:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:29:00 INFO DAGScheduler: Submitting ResultStage 4 (PythonRDD[11] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 8.1 KB, free 366.2 MB)
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.2 MB)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
20/01/22 19:29:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (PythonRDD[11] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
20/01/22 19:29:00 INFO YarnScheduler: Adding task set 4.0 with 1 tasks
20/01/22 19:29:00 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 71, master.hdp, executor 1, partition 0, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 172.16.0.21:51120
20/01/22 19:29:00 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 83 bytes
20/01/22 19:29:00 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 71) in 473 ms on master.hdp (executor 1) (1/1)
20/01/22 19:29:00 INFO YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
20/01/22 19:29:00 INFO DAGScheduler: ResultStage 4 (runJob at PythonRDD.scala:455) finished in 0.476 s
20/01/22 19:29:00 INFO DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:455, took 0.497775 s
20/01/22 19:29:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:29:00 INFO DAGScheduler: Got job 3 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:29:00 INFO DAGScheduler: Final stage: ResultStage 6 (runJob at PythonRDD.scala:455)
20/01/22 19:29:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
20/01/22 19:29:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:29:00 INFO DAGScheduler: Submitting ResultStage 6 (PythonRDD[12] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 8.1 KB, free 366.1 MB)
20/01/22 19:29:00 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.1 MB)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
20/01/22 19:29:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (PythonRDD[12] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1))
20/01/22 19:29:00 INFO YarnScheduler: Adding task set 6.0 with 1 tasks
20/01/22 19:29:00 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 72, master.hdp, executor 1, partition 1, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:29:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:29:00 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 72) in 123 ms on master.hdp (executor 1) (1/1)
20/01/22 19:29:00 INFO YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool
20/01/22 19:29:00 INFO DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:455) finished in 0.125 s
20/01/22 19:29:00 INFO DAGScheduler: Job 3 finished: runJob at PythonRDD.scala:455, took 0.136936 s
-------------------------------------------
Time: 2020-01-22 19:29:00
-------------------------------------------
20/01/22 19:29:00 INFO JobScheduler: Finished job streaming job 1579701540000 ms.1 from job set of time 1579701540000 ms
20/01/22 19:29:00 INFO JobScheduler: Total delay: 0.811 s for time 1579701540000 ms (execution: 0.684 s)
20/01/22 19:29:00 INFO ReceivedBlockTracker: Deleting batches:
20/01/22 19:29:00 INFO InputInfoTracker: remove old batch metadata:
20/01/22 19:30:00 INFO JobScheduler: Added jobs for time 1579701600000 ms
20/01/22 19:30:00 INFO JobScheduler: Starting job streaming job 1579701600000 ms.0 from job set of time 1579701600000 ms
-------------------------------------------
Time: 2020-01-22 19:30:00
-------------------------------------------
20/01/22 19:30:00 INFO JobScheduler: Finished job streaming job 1579701600000 ms.0 from job set of time 1579701600000 ms
20/01/22 19:30:00 INFO JobScheduler: Starting job streaming job 1579701600000 ms.1 from job set of time 1579701600000 ms
20/01/22 19:30:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:30:00 INFO DAGScheduler: Registering RDD 16 (call at /usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py:2230)
20/01/22 19:30:00 INFO DAGScheduler: Got job 4 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:30:00 INFO DAGScheduler: Final stage: ResultStage 8 (runJob at PythonRDD.scala:455)
20/01/22 19:30:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7)
20/01/22 19:30:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:30:00 INFO DAGScheduler: Submitting ResultStage 8 (PythonRDD[20] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 8.1 KB, free 366.1 MB)
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.1 MB)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.2 MB)
20/01/22 19:30:00 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
20/01/22 19:30:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (PythonRDD[20] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
20/01/22 19:30:00 INFO YarnScheduler: Adding task set 8.0 with 1 tasks
20/01/22 19:30:00 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 73, master.hdp, executor 1, partition 0, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:30:00 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to 172.16.0.21:51120
20/01/22 19:30:00 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 83 bytes
20/01/22 19:30:00 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 73) in 120 ms on master.hdp (executor 1) (1/1)
20/01/22 19:30:00 INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
20/01/22 19:30:00 INFO DAGScheduler: ResultStage 8 (runJob at PythonRDD.scala:455) finished in 0.121 s
20/01/22 19:30:00 INFO DAGScheduler: Job 4 finished: runJob at PythonRDD.scala:455, took 0.134627 s
20/01/22 19:30:00 INFO SparkContext: Starting job: runJob at PythonRDD.scala:455
20/01/22 19:30:00 INFO DAGScheduler: Got job 5 (runJob at PythonRDD.scala:455) with 1 output partitions
20/01/22 19:30:00 INFO DAGScheduler: Final stage: ResultStage 10 (runJob at PythonRDD.scala:455)
20/01/22 19:30:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)
20/01/22 19:30:00 INFO DAGScheduler: Missing parents: List()
20/01/22 19:30:00 INFO DAGScheduler: Submitting ResultStage 10 (PythonRDD[21] at RDD at PythonRDD.scala:48), which has no missing parents
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 8.1 KB, free 366.1 MB)
20/01/22 19:30:00 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.4 KB, free 366.1 MB)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 172.16.0.21:40801 (size: 4.4 KB, free: 366.2 MB)
20/01/22 19:30:00 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
20/01/22 19:30:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 10 (PythonRDD[21] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1))
20/01/22 19:30:00 INFO YarnScheduler: Adding task set 10.0 with 1 tasks
20/01/22 19:30:00 INFO TaskSetManager: Starting task 0.0 in stage 10.0 (TID 74, master.hdp, executor 1, partition 1, PROCESS_LOCAL, 4632 bytes)
20/01/22 19:30:00 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on master.hdp:41184 (size: 4.4 KB, free: 366.3 MB)
20/01/22 19:30:00 INFO TaskSetManager: Finished task 0.0 in stage 10.0 (TID 74) in 132 ms on master.hdp (executor 1) (1/1)
20/01/22 19:30:00 INFO YarnScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool
20/01/22 19:30:00 INFO DAGScheduler: ResultStage 10 (runJob at PythonRDD.scala:455) finished in 0.133 s
20/01/22 19:30:00 INFO DAGScheduler: Job 5 finished: runJob at PythonRDD.scala:455, took 0.143611 s
Give key.deserializer and value.deserializer in kafkaParams and use createDirectStream
kafkaParams = {"metadata.broker.list": config['kstream']['broker'], "auto_offset_reset" : 'earliest',"auto.create.topics.enable": "true","key.deserializer": "org.springframework.kafka.support.serializer.JsonDeserializer","value.deserializer": "org.springframework.kafka.support.serializer.JsonDeserializer"}
kvs = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams,fromOffsets=None,messageHandler=None,keyDecoder=utf8_decoder,valueDecoder=utf8_decoder)
It seems that you are not using a logger but instead, you are print()ing to standard output. In order to log to your log fils, you need to setup a logger. For example, the following gets the logger from SparkContext:
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("Example info")
LOGGER.error("Example error")
...

50 minute difference between batch time and submission time in spark streaming

spark version is 2.2.0
pseudocode:
read data1 from kafka with 5minute window
read data2 from kafka with 10minute window and 5miute slide duration
data1 join data2 on some condition
do some agg and write to mysql
question:
batch time is 15:00 but submitted time is 15:50, processing time is less than 1 minute. what happened?
val shareDs = KafkaUtils.createDirectStream[String, String](streamContext, LocationStrategies.PreferBrokers, shareReqConsumer)
val shareResDS = KafkaUtils.createDirectStream[String, String](streamContext, LocationStrategies.PreferBrokers, shareResConsumer).window(Minutes(WindowTime), Minutes(StreamTime))
shareDs doSomeMap join (shareResDs doSomeMap) forEachRddd{do some things then write to mysql}
there are some logs:
19/07/22 11:20:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:20:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:20:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:20:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_afp_com_input_result-2, topic_wh_sparkstream_afp_com_input_result-1, topic_wh_sparkstream_afp_com_input_result-0]
19/07/22 11:20:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] (Re-)joining group
19/07/22 11:25:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] Successfully joined group with generation 820
19/07/22 11:25:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-6, groupId=dashboard] Setting newly assigned partitions [topic_wh_sparkstream_afp_com_input_result-2, topic_wh_sparkstream_afp_com_input_result-1, topic_wh_sparkstream_afp_com_input_result-0]
19/07/22 11:25:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:25:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:25:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_decision_report_result-1, topic_wh_sparkstream_decision_report_result-2, topic_wh_sparkstream_decision_report_result-0]
19/07/22 11:25:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] (Re-)joining group
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] Successfully joined group with generation 821
19/07/22 11:30:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-5, groupId=dashboard] Setting newly assigned partitions [topic_wh_sparkstream_decision_report_result-1, topic_wh_sparkstream_decision_report_result-2, topic_wh_sparkstream_decision_report_result-0]
19/07/22 11:30:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_echo_mixed_risk_record-1, topic_wh_sparkstream_echo_mixed_risk_record-2, topic_wh_sparkstream_echo_mixed_risk_record-0]
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] (Re-)joining group
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Marking the coordinator 10.124.35.112:9092 (id: 2147483534 rack: null) dead
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Discovered group coordinator 10.124.35.112:9092 (id: 2147483534 rack: null)
19/07/22 11:30:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] (Re-)joining group
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Successfully joined group with generation 822
19/07/22 11:35:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-4, groupId=dashboard] Setting newly assigned partitions [topic_wh_sparkstream_echo_mixed_risk_record-1, topic_wh_sparkstream_echo_mixed_risk_record-2, topic_wh_sparkstream_echo_mixed_risk_record-0]
19/07/22 11:35:00 INFO dstream.MappedDStream: Slicing from 1563765000000 ms to 1563765600000 ms (aligned to 1563765000000 ms and 1563765600000 ms)
19/07/22 11:35:00 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] Revoking previously assigned partitions [topic_wh_sparkstream_echo_mixed_risk_result_detail-2, topic_wh_sparkstream_echo_mixed_risk_result_detail-1, topic_wh_sparkstream_echo_mixed_risk_result_detail-0, topic_wh_sparkstream_echo_behavior_features_result-0, topic_wh_sparkstream_echo_behavior_features_result-1, topic_wh_sparkstream_echo_behavior_features_result-2]
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] (Re-)joining group
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] Marking the coordinator 10.124.35.112:9092 (id: 2147483534 rack: null) dead
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] Discovered group coordinator 10.124.35.112:9092 (id: 2147483534 rack: null)
19/07/22 11:35:00 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-3, groupId=dashboard] (Re-)joining group
at the window timestamp, only do kafka re-partitions instead of add a job.
I resolved this problem.
Use spark-Streaming with kafka, congfig each stream with separate group_id and disable auto-commit,
Config kafka parameters appropriate. Especially heartbeat, session-timeout, request-timesout, max-poll-interval.

Why Cassandra TableWriter writing 0 records and how to fix it?

I am trying to write an RDD into a Cassandra table.
As shown below TableWriter wrote 0 rows several times and finally writes to Cassandra.
18/10/22 07:15:50 INFO TableWriter: Wrote 0 rows to log_by_date in 0.171 s.
18/10/22 07:15:50 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 622 bytes result sent to driver
18/10/22 07:15:50 INFO TableWriter: Wrote 0 rows to log_by_date in 0.220 s.
18/10/22 07:15:50 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 665 bytes result sent to driver
18/10/22 07:15:50 INFO TableWriter: Wrote 0 rows to log_by_date in 0.194 s.
18/10/22 07:15:50 INFO TableWriter: Wrote 0 rows to log_by_date in 0.224 s.
18/10/22 07:15:50 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 708 bytes result sent to driver
18/10/22 07:15:50 INFO TableWriter: Wrote 0 rows to log_by_date in 0.231 s.
18/10/22 07:15:50 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 622 bytes result sent to driver
18/10/22 07:15:50 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 622 bytes result sent to driver
18/10/22 07:15:50 INFO TableWriter: Wrote 0 rows to log_by_date in 0.246 s.
18/10/22 07:15:50 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 708 bytes result sent to driver
18/10/22 07:15:50 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 418 ms on localhost (executor driver) (1/8)
18/10/22 07:15:50 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 433 ms on localhost (executor driver) (2/8)
18/10/22 07:15:50 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 426 ms on localhost (executor driver) (3/8)
18/10/22 07:15:50 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 433 ms on localhost (executor driver) (4/8)
18/10/22 07:15:50 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 456 ms on localhost (executor driver) (5/8)
18/10/22 07:15:50 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 436 ms on localhost (executor driver) (6/8)
18/10/22 07:15:50 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 424 ms on localhost (executor driver) (7/8)
18/10/22 07:15:50 INFO **TableWriter: Wrote 1 rows to log_by_date in 0.342 s.**
Why it is failing to save it sevaral times prior, how to tune it for production?
This is not a failure as noted by user10465355. When Spark breaks a job into Tasks it is possible that the work is not evenly distributed or that there isn't enough work for every task to have work to do. This results in some tasks being empty, so when they are processed by the Spark Cassandra Connector they write 0 rows.
For example say;
You read 100 records into 10 Spark Partitions/Tasks
You do a filter which eliminate values with a filter so now only 30 records remain in 5 tasks. The other 5 are empty.
When you write you will now only see records written for 5 tasks, and 5 tasks will report they had no rows written.

query takes a long time 'selecting' nothing

I have a query that I ran in thrift that takes a very long time. I run it on a single partition of a table which has 500k rows.
the query looks like this:
select col0 from <table> where partition=<partition> and <col1>=<val>
I made it so col1 != val, so the query returns 0 rows.
This query takes about a 30 seconds (a minute if I use select *).
When I run the exact same query but with select count(col0) it takes 2 seconds.
What could cause queries to take a long time with select col but not with select count(col)?
Here's the queries explained
explain select col0 from table where `partition` = partition and col=val;
*Project [col0#607]
+- *Filter (isnotnull(col1#607) && (col1#607 = aaaa))
+- *FileScan parquet table[col1#607,partition#611]
Batched: true,
Format: Parquet,
Location: PrunedInMemoryFileIndex[...,
PartitionCount: 23,
PartitionFilters: [isnotnull(partition#611),
(cast(partition#611 as int) = partition_name)],
PushedFilters: [IsNotNull(col1),
EqualTo(col1,aaaa)],
ReadSchema: struct
explain select count(col0) from table where `partition` = partition and col=val;
*HashAggregate(keys=[], functions=[count(col0#625)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(col0#625)])
+- *Project [col0#625]
+- *Filter (isnotnull(col1#625) && (col1#625 = aaaa))
+- *FileScan parquet table[col1#625,partition#629]
Batched: true,
Format: Parquet,
Location: PrunedInMemoryFileIndex[...,
PartitionCount: 23,
PartitionFilters: [isnotnull(partition#629),
(cast(partition#629 as int) = partition_name)],
PushedFilters: [IsNotNull(col1),
EqualTo(col1,aaaa)],
ReadSchema: struct
As far as I can tell, the process is exactly the same, only the count query has more steps. So how come it's 15x faster?
Edit:
I found this interesting nugget in the logs:
with count:
18/06/28 11:42:55 INFO TaskSetManager: Starting task 0.0 in stage 2509.0 (TID 8092, ip-123456, executor 36, partition 0, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 1.0 in stage 2509.0 (TID 8093, ip-123456, executor 35, partition 1, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 2.0 in stage 2509.0 (TID 8094, ip-123456, executor 36, partition 2, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 3.0 in stage 2509.0 (TID 8095, ip-123456, executor 35, partition 3, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 4.0 in stage 2509.0 (TID 8096, ip-123456, executor 36, partition 4, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 5.0 in stage 2509.0 (TID 8097, ip-123456, executor 35, partition 5, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 6.0 in stage 2509.0 (TID 8098, ip-123456, executor 36, partition 6, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 7.0 in stage 2509.0 (TID 8099, ip-123456, executor 35, partition 7, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 8.0 in stage 2509.0 (TID 8100, ip-123456, executor 36, partition 8, RACK_LOCAL, 5521 bytes)
18/06/28 11:42:55 INFO TaskSetManager: Starting task 9.0 in stage 2509.0 (TID 8101, ip-123456, executor 35, partition 9, RACK_LOCAL, 5521 bytes)
without: *
18/06/28 11:45:32 INFO TaskSetManager: Starting task 0.0 in stage 2512.0 (TID 8136, ip-10-117-49-97.eu-west-1.compute.internal, executor 37, partition 1, RACK_LOCAL, 5532 bytes)
18/06/28 11:45:32 INFO BlockManagerInfo: Added broadcast_2352_piece0 in memory on ip-10-117-49-97.eu-west-1.compute.internal:40489 (size: 12.6 KB, free: 11.6 GB)
18/06/28 11:45:32 INFO TaskSetManager: Finished task 0.0 in stage 2512.0 (TID 8136) in 667 ms on ip-10-117-49-97.eu-west-1.compute.internal (executor 37) (1/1)
18/06/28 11:45:32 INFO YarnScheduler: Removed TaskSet 2512.0, whose tasks have all completed, from pool
18/06/28 11:45:32 INFO DAGScheduler: ResultStage 2512 (getNextRowSet at OperationManager.java:220) finished in 0.668 s
18/06/28 11:45:32 INFO DAGScheduler: Job 2293 finished: getNextRowSet at OperationManager.java:220, took 0.671740 s
18/06/28 11:45:32 INFO SparkContext: Starting job: getNextRowSet at OperationManager.java:220
18/06/28 11:45:32 INFO DAGScheduler: Got job 2294 (getNextRowSet at OperationManager.java:220) with 1 output partitions
18/06/28 11:45:32 INFO DAGScheduler: Final stage: ResultStage 2513 (getNextRowSet at OperationManager.java:220)
18/06/28 11:45:32 INFO DAGScheduler: Parents of final stage: List()
18/06/28 11:45:32 INFO DAGScheduler: Missing parents: List()
18/06/28 11:45:32 INFO DAGScheduler: Submitting ResultStage 2513 (MapPartitionsRDD[312] at run at AccessController.java:0), which has no missing parents
18/06/28 11:45:32 INFO MemoryStore: Block broadcast_2353 stored as values in memory (estimated size 66.6 KB, free 12.1 GB)
18/06/28 11:45:32 INFO MemoryStore: Block broadcast_2353_piece0 stored as bytes in memory (estimated size 12.6 KB, free 12.1 GB)
18/06/28 11:45:32 INFO BlockManagerInfo: Added broadcast_2353_piece0 in memory on 10.117.48.68:41493 (size: 12.6 KB, free: 12.1 GB)
18/06/28 11:45:32 INFO SparkContext: Created broadcast 2353 from broadcast at DAGScheduler.scala:1047
18/06/28 11:45:32 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2513 (MapPartitionsRDD[312] at run at AccessController.java:0) (first 15 tasks are for partitions Vector(2))
18/06/28 11:45:32 INFO YarnScheduler: Adding task set 2513.0 with 1 tasks
18/06/28 11:45:32 INFO TaskSetManager: Starting task 0.0 in stage 2513.0 (TID 8137, ip-10-117-49-97.eu-west-1.compute.internal, executor 37, partition 2, RACK_LOCAL, 5532 bytes)
18/06/28 11:45:33 INFO BlockManagerInfo: Added broadcast_2353_piece0 in memory on ip-10-117-49-97.eu-west-1.compute.internal:40489 (size: 12.6 KB, free: 11.6 GB)
18/06/28 11:45:38 INFO TaskSetManager: Finished task 0.0 in stage 2513.0 (TID 8137) in 5238 ms on ip-10-117-49-97.eu-west-1.compute.internal (executor 37) (1/1)
18/06/28 11:45:38 INFO YarnScheduler: Removed TaskSet 2513.0, whose tasks have all completed, from pool
18/06/28 11:45:38 INFO DAGScheduler: ResultStage 2513 (getNextRowSet at OperationManager.java:220) finished in 5.238 s
18/06/28 11:45:38 INFO DAGScheduler: Job 2294 finished: getNextRowSet at OperationManager.java:220, took 5.242084 s
18/06/28 11:45:38 INFO SparkContext: Starting job: getNextRowSet at OperationManager.java:220
18/06/28 11:45:38 INFO DAGScheduler: Got job 2295 (getNextRowSet at OperationManager.java:220) with 1 output partitions
18/06/28 11:45:38 INFO DAGScheduler: Final stage: ResultStage 2514 (getNextRowSet at OperationManager.java:220)
18/06/28 11:45:38 INFO DAGScheduler: Parents of final stage: List()
18/06/28 11:45:38 INFO DAGScheduler: Missing parents: List()
18/06/28 11:45:38 INFO DAGScheduler: Submitting ResultStage 2514 (MapPartitionsRDD[312] at run at AccessController.java:0), which has no missing parents
18/06/28 11:45:38 INFO MemoryStore: Block broadcast_2354 stored as values in memory (estimated size 66.6 KB, free 12.1 GB)
18/06/28 11:45:38 INFO MemoryStore: Block broadcast_2354_piece0 stored as bytes in memory (estimated size 12.6 KB, free 12.1 GB)
18/06/28 11:45:38 INFO BlockManagerInfo: Added broadcast_2354_piece0 in memory on 10.117.48.68:41493 (size: 12.6 KB, free: 12.1 GB)
18/06/28 11:45:38 INFO SparkContext: Created broadcast 2354 from broadcast at DAGScheduler.scala:1047
18/06/28 11:45:38 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2514 (MapPartitionsRDD[312] at run at AccessController.java:0) (first 15 tasks are for partitions Vector(3))
(i.e. it repeats this block, looks like it runs tasks sequentially ant not in parallel like in the count case)
I also tried doing "order by" and it actually made the query run 2x faster
Running the same query on the same data using spark instead of thrift was much faster.
I run thrift on aws emr-5.11.1
Hive 2.3.2
Spark 2.2.1
thrift 0.11.0
Found the problem. I had this flag
spark.sql.thriftServer.incrementalCollect=true
in thriftserver. It collects the output from every worker sequentially which is what creates this massive overhead. Removing the flag fixed the issue. I guess it's optimized to not do is sequentially when doing "count", as it will necessarily not have a lot of data.

Why do tasks in the same task set have very different execution time?

I'm doing some kind of string processing on Spark. My code snippet:
val rdd = sc.objectFile[(String, String)]("some hdfs url", 1);
rdd.cache.count // let cache happen
val combOp = (f: List[String], g: List[String]) => {
for (x <- f) {
finder.processEntry(x)
}
for (x <- g) {
finder.processEntry(x)
}
finder.result
}
val res = rdd.mapPartitions( x => {
for (e<-x) {
finder.processEntry(e)
}
Iterator(finder.result)
}, true).reduce(combOp)
The dataset I have is about 10GB. I'm running Spark on a 24 cores machine, with 48GB memory. Configure file:
spark.driver.memory 1g
spark.executor.memory 30g
spark.executor.extraJavaOptions -Xloggc:/var/log/gcmemory.log -XX:+PrintGCDetails
spark.executor.cores 4
Execution log snippet:
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.60.1.143, ANY, 1642 bytes)
INFO BlockManagerMasterEndpoint: Registering block manager 10.60.1.143:42850 with 15.5 GB RAM, BlockManagerId(0, 10.60.1.143, 42850)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.60.1.143:42850 (size: 1766.0 B, free: 15.5 GB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.60.1.143:42850 (size: 16.8 KB, free: 15.5 GB)
INFO BlockManagerInfo: Added rdd_1_3 in memory on 10.60.1.143:42850 (size: 219.7 MB, free: 15.3 GB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on 10.60.1.143:42850 (size: 229.7 MB, free: 15.1 GB)
INFO BlockManagerInfo: Added rdd_1_2 in memory on 10.60.1.143:42850 (size: 221.5 MB, free: 14.9 GB)
INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 6345 ms on 10.60.1.143 (1/34)
INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 6351 ms on 10.60.1.143 (2/34)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6354 ms on 10.60.1.143 (3/34)
INFO BlockManagerInfo: Added rdd_1_0 in memory on 10.60.1.143:42850 (size: 220.6 MB, free: 14.7 GB)
INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6454 ms on 10.60.1.143 (4/34)
INFO BlockManagerInfo: Added rdd_1_5 in memory on 10.60.1.143:42850 (size: 219.9 MB, free: 14.4 GB)
INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 2287 ms on 10.60.1.143 (5/34)
INFO BlockManagerInfo: Added rdd_1_4 in memory on 10.60.1.143:42850 (size: 222.7 MB, free: 14.2 GB)
INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 10.60.1.143, ANY, 1642 bytes)
INFO BlockManagerInfo: Added rdd_1_6 in memory on 10.60.1.143:42850 (size: 210.7 MB, free: 14.0 GB)
INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 2350 ms on 10.60.1.143 (6/34)
INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 2356 ms on 10.60.1.143 (7/34)
INFO BlockManagerInfo: Added rdd_1_7 in memory on 10.60.1.143:42850 (size: 214.6 MB, free: 13.8 GB)
INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 2289 ms on 10.60.1.143 (8/34)
INFO BlockManagerInfo: Added rdd_1_8 in memory on 10.60.1.143:42850 (size: 216.3 MB, free: 13.6 GB)
INFO TaskSetManager: Starting task 12.0 in stage 0.0 (TID 12, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 2430 ms on 10.60.1.143 (9/34)
INFO BlockManagerInfo: Added rdd_1_11 in memory on 10.60.1.143:42850 (size: 216.5 MB, free: 13.4 GB)
INFO BlockManagerInfo: Added rdd_1_10 in memory on 10.60.1.143:42850 (size: 216.5 MB, free: 13.2 GB)
INFO TaskSetManager: Starting task 13.0 in stage 0.0 (TID 13, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in 2416 ms on 10.60.1.143 (10/34)
INFO TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in 2445 ms on 10.60.1.143 (11/34)
INFO BlockManagerInfo: Added rdd_1_9 in memory on 10.60.1.143:42850 (size: 231.4 MB, free: 12.9 GB)
INFO TaskSetManager: Starting task 15.0 in stage 0.0 (TID 15, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 2528 ms on 10.60.1.143 (12/34)
INFO BlockManagerInfo: Added rdd_1_12 in memory on 10.60.1.143:42850 (size: 217.3 MB, free: 12.7 GB)
INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 1797 ms on 10.60.1.143 (13/34)
INFO BlockManagerInfo: Added rdd_1_14 in memory on 10.60.1.143:42850 (size: 215.8 MB, free: 12.5 GB)
INFO TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 1748 ms on 10.60.1.143 (14/34)
INFO BlockManagerInfo: Added rdd_1_13 in memory on 10.60.1.143:42850 (size: 220.9 MB, free: 12.3 GB)
INFO TaskSetManager: Starting task 18.0 in stage 0.0 (TID 18, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 1812 ms on 10.60.1.143 (15/34)
INFO BlockManagerInfo: Added rdd_1_15 in memory on 10.60.1.143:42850 (size: 233.8 MB, free: 12.1 GB)
INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 15.0 in stage 0.0 (TID 15) in 1756 ms on 10.60.1.143 (16/34)
INFO BlockManagerInfo: Added rdd_1_16 in memory on 10.60.1.143:42850 (size: 221.6 MB, free: 11.9 GB)
INFO TaskSetManager: Starting task 20.0 in stage 0.0 (TID 20, 10.60.1.143, ANY, 1642 bytes)
INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 2600 ms on 10.60.1.143 (17/34)
How can the first runners in the same task set execute longer than its latter runners? Any help is very much appreciated.
A common cause for stragglers (or executors that take longer for a certain partitions than others) is unevenly partitioner data. I'd suggest trying to repartition your data. The Spark UI may also have some helpful information (you can take a look at the input sizes etc.) Sometimes some machines are slower for random reasons (especially common in virtualized environments where we can have noisy neighbors on certain machines), you can try enabling speculative execution (see https://spark.apache.org/docs/latest/configuration.html) / set the spark.speculation flag so Spark can try and solve the problem on another executor if it happens to be running slowly on one machine.

Resources