TB data write to parquet - apache-spark

we have 2.5TB data in hbase, and the region size is 5g or 10g, and the hbase table have 450 reigons. and we need transform to spark-sql. and the method used below:
1.snapshot hbase table.
2.read hfile by newHadoopAPIRDD
3.write to parquet.
val hconf = HBaseConfiguration.create()
hconf.set("hbase.rootdir", "/hbase")
hconf.set("hbase.zookeeper.quorum", HbaseToSparksqlBySnapshotParam.zookeeperQurum)
hconf.set(TableInputFormat.SCAN, convertScanToString(scan))
val job = Job.getInstance(hconf)
val path = new Path("/snapshot")
val snapshotName = HbaseToSparksqlBySnapshotParam.snapshotName
TableSnapshotInputFormat.setInput(job, snapshotName, path)
val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],classOf[ImmutableBytesWritable], classOf[Result])
val rdd = hbaseRDD.map{
case(_,result) =>
...
Row
}
val df = spark.createDataFrame(rdd, schema)
df.write.parquet("/test")
num-executors executor-memory executor-cores run_time
16 6 2 error following
5 15 8 6hours
i don't know how to set the params(num-executors,executor-memory,executor-cores), and can run faster. when i just get one region(10g), i use the param as num-executors 1,executor-memory 3g,executor-cores 1, it run 14min.
i use spark2.1.0
error:
19/01/18 00:55:26 INFO TaskSetManager: Finished task 386.0 in stage 0.0 (TID 331) in 1187343 ms on hdh68 (executor 5) (331/470)
19/01/18 00:55:31 INFO TaskSetManager: Starting task 423.0 in stage 0.0 (TID 363, hdh68, executor 1, partition 423, NODE_LOCAL, 6905 bytes)
19/01/18 00:55:31 INFO TaskSetManager: Finished task 383.0 in stage 0.0 (TID 328) in 1427677 ms on hdh68 (executor 1) (332/470)
19/01/18 00:57:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 10.
19/01/18 00:57:36 INFO DAGScheduler: Executor lost: 10 (epoch 0)
19/01/18 00:57:36 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/01/18 00:57:36 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(10, hdh68, 40679, None)
19/01/18 00:57:36 INFO BlockManagerMaster: Removed 10 successfully in removeExecutor
19/01/18 00:57:36 INFO DAGScheduler: Shuffle files lost for executor: 10 (epoch 0)
19/01/18 00:57:36 WARN DFSClient: Slow ReadProcessor read fields took 114044ms (threshold=30000ms); ack: seqno: 49 reply: 0 downstreamAckTimeNanos: 0, targets: [DatanodeInfoWithStorage[10.41.2.68:50010,DS-26a72d43-2e16-41a9-9a71-99593f14ab6f,DISK]]
19/01/18 00:57:39 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 6.6 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/01/18 00:57:39 ERROR YarnScheduler: Lost executor 10 on hdh68: Container killed by YARN for exceeding memory limits. 6.6 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/01/18 00:57:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/01/18 00:57:39 INFO BlockManagerMaster: Removal of executor 10 requested
19/01/18 00:57:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 10
19/01/18 00:57:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.41.2.68:49771) with ID 17
19/01/18 00:57:52 INFO TaskSetManager: Starting task 387.1 in stage 0.0 (TID 364, hdh68, executor 17, partition 387, NODE_LOCAL, 6906 bytes)
19/01/18 00:57:52 INFO TaskSetManager: Starting task 417.1 in stage 0.0 (TID 365, hdh68, executor 17, partition 417, NODE_LOCAL, 6906 bytes)
19/01/18 00:57:53 INFO BlockManagerMasterEndpoint: Registering block manager hdh68:38247 with 3.0 GB RAM, BlockManagerId(17, hdh68, 38247, None)
19/01/18 00:58:42 INFO TaskSetManager: Starting task 426.0 in stage 0.0 (TID 366, hdh68, executor 13, partition 426, NODE_LOCAL, 6905 bytes)
19/01/18 00:58:42 INFO TaskSetManager: Finished task 396.0 in stage 0.0 (TID 341) in 1014645 ms on hdh68 (executor 13) (333/470)
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:313)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:770)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1256)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:807)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:150)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
19/01/18 00:59:58 INFO TaskSetManager: Starting task 428.0 in stage 0.0 (TID 367, hdh68, executor 17, partition 428, NODE_LOCAL, 6906 bytes)
19/01/18 00:59:58 WARN TaskSetManager: Lost task 387.1 in stage 0.0 (TID 364, hdh68, executor 17): java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
19/01/18 01:00:04 INFO TaskSetManager: Starting task 387.2 in stage 0.0 (TID 368, hdh68, executor 14, partition 387, NODE_LOCAL, 6906 bytes)
19/01/18 01:00:04 INFO TaskSetManager: Finished task 385.0 in stage 0.0 (TID 330) in 1471628 ms on hdh68 (executor 14) (334/470)
19/01/18 01:00:12 INFO TaskSetManager: Starting task 429.0 in stage 0.0 (TID 369, hdh68, executor 8, partition 429, NODE_LOCAL, 6905 bytes)
19/01/18 01:00:12 INFO TaskSetManager: Finished task 392.0 in stage 0.0 (TID 337) in 1220925 ms on hdh68 (executor 8) (335/470)
19/01/18 01:00:27 INFO TaskSetManager: Starting task 430.0 in stage 0.0 (TID 370, hdh68, executor 11, partition 430, NODE_LOCAL, 6906 bytes)
19/01/18 01:01:16 INFO TaskSetManager: Finished task 390.0 in stage 0.0 (TID 335) in 1353136 ms on hdh68 (executor 14) (337/470)
19/01/18 01:01:30 INFO TaskSetManager: Starting task 432.0 in stage 0.0 (TID 372, hdh68, executor 5, partition 432, NODE_LOCAL, 6906 bytes)
19/01/18 01:01:30 INFO TaskSetManager: Finished task 393.0 in stage 0.0 (TID 338) in 1244707 ms on hdh68 (executor 5) (338/470)
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:319)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
19/01/18 01:01:59 INFO TaskSetManager: Starting task 434.0 in stage 0.0 (TID 374, hdh68, executor 17, partition 434, NODE_LOCAL, 6905 bytes)
19/01/18 01:01:59 WARN TaskSetManager: Lost task 417.1 in stage 0.0 (TID 365, hdh68, executor 17): java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
update:
the cluser is a pseudo-distributed. Memory Tota 200g, VCores Total 64. and the root.root queue resource, the queue has not other application:
Used Resources: <memory:171776, vCores:44>
Num Active Applications: 7
Num Pending Applications: 0
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:204800, vCores:64>
Steady Fair Share: <memory:54614, vCores:0>
Instantaneous Fair Share: <memory:102400, vCores:0>
Preemptable: true

Related

java.io.FileNotFoundException error in Apache Spark even though my file exists

I'm new to spark and doing on POC to download a file and then read it. However, I am facing issue that the file doesn't exists.
java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
But when I printed the path of the file I find out the file exists and the path is also correct.
This is the output
23/02/19 13:10:46 INFO BlockManagerMasterEndpoint: Registering block manager 10.14.142.21:37515 with 2.2 GiB RAM, BlockManagerId(1, 10.14.142.21, 37515, None)
FILE IS DOWNLOADED
['/app/data-Feb-19-2023_131049.json']
23/02/19 13:10:49 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/02/19 13:10:49 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
23/02/19 13:10:50 INFO InMemoryFileIndex: It took 39 ms to list leaf files for 1 paths.
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 206.6 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.8 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on experian-el-d41b428669cc1e8e-driver-svc.environments-quin-dev-1.svc:7079 (size: 35.8 KiB, free: 1048.8 MiB)
23/02/19 13:10:51 INFO SparkContext: Created broadcast 0 from json at <unknown>:0
23/02/19 13:10:51 INFO FileInputFormat: Total input files to process : 1
23/02/19 13:10:51 INFO FileInputFormat: Total input files to process : 1
23/02/19 13:10:51 INFO SparkContext: Starting job: json at <unknown>:0
23/02/19 13:10:51 INFO DAGScheduler: Got job 0 (json at <unknown>:0) with 1 output partitions
23/02/19 13:10:51 INFO DAGScheduler: Final stage: ResultStage 0 (json at <unknown>:0)
23/02/19 13:10:51 INFO DAGScheduler: Parents of final stage: List()
23/02/19 13:10:51 INFO DAGScheduler: Missing parents: List()
23/02/19 13:10:51 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at json at <unknown>:0), which has no missing parents
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.0 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.8 KiB, free 1048.5 MiB)
23/02/19 13:10:51 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on experian-el-d41b428669cc1e8e-driver-svc.environments-quin-dev-1.svc:7079 (size: 4.8 KiB, free: 1048.8 MiB)
23/02/19 13:10:51 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1513
23/02/19 13:10:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at json at <unknown>:0) (first 15 tasks are for partitions Vector(0))
23/02/19 13:10:51 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
23/02/19 13:10:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.14.142.21:37515 (size: 4.8 KiB, free: 2.2 GiB)
23/02/19 13:10:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.14.142.21:37515 (size: 35.8 KiB, free: 2.2 GiB)
23/02/19 13:10:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.14.142.21 executor 1): java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStream(CodecStreams.scala:40)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStreamWithCloseResource(CodecStreams.scala:52)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.dataToInputStream(JsonDataSource.scala:195)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.createParser(JsonDataSource.scala:199)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.$anonfun$infer$4(JsonDataSource.scala:165)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$3(JsonInferSchema.scala:86)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$2(JsonInferSchema.scala:86)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceLeftOption(TraversableOnce.scala:249)
at scala.collection.TraversableOnce.reduceLeftOption$(TraversableOnce.scala:248)
at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceOption(TraversableOnce.scala:256)
at scala.collection.TraversableOnce.reduceOption$(TraversableOnce.scala:256)
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1431)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:103)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 1]
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 2]
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 3]
23/02/19 13:10:52 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
23/02/19 13:10:52 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
23/02/19 13:10:52 INFO TaskSchedulerImpl: Cancelling stage 0
23/02/19 13:10:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
23/02/19 13:10:52 INFO DAGScheduler: ResultStage 0 (json at <unknown>:0) failed in 1.128 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.14.142.21 executor 1): java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStream(CodecStreams.scala:40)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStreamWithCloseResource(CodecStreams.scala:52)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.dataToInputStream(JsonDataSource.scala:195)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.createParser(JsonDataSource.scala:199)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.$anonfun$infer$4(JsonDataSource.scala:165)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$3(JsonInferSchema.scala:86)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$2(JsonInferSchema.scala:86)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceLeftOption(TraversableOnce.scala:249)
at scala.collection.TraversableOnce.reduceLeftOption$(TraversableOnce.scala:248)
at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceOption(TraversableOnce.scala:256)
at scala.collection.TraversableOnce.reduceOption$(TraversableOnce.scala:256)
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1431)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:103)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
This is my code to download the file and and print its path
def find_files(self, filename, search_path):
result = []
# Wlaking top-down from the root
for root, dir, files in os.walk(search_path):
if filename in files:
result.append(os.path.join(root, filename))
return result
def downloadData(self, access_token, data):
headers = {
'Content-Type': 'application/json',
'Charset': 'UTF-8',
'Authorization': f'Bearer {access_token}'
}
try:
response = requests.post(self.kyc_url, data=json.dumps(data), headers=headers)
response.raise_for_status()
logger.debug("received kyc data")
response_filename = ("data-" + time.strftime('%b-%d-%Y_%H%M%S', time.localtime()) + ".json")
with open(response_filename, 'w', encoding='utf-8') as f:
json.dump(response.json(), f, ensure_ascii=False, indent=4)
f.close()
print("FILE IS DOWNLOADED")
print(self.find_files(response_filename, "/"))
except requests.exceptions.HTTPError as err:
logger.error("failed to fetch kyc data")
raise SystemExit(err)
return response_filename
This is my code to read the file and upload to minio
def load(spark: SparkSession, json_file_path: str, destination_path: str) -> None:
df = spark.read.option("multiline", "true").json(json_file_path)
df.write.format("delta").save(f"s3a://{destination_path}")
I'm running spark in k8s with spark operator.
This is my SparkApplication manifest
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: myApp
namespace: demo
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "myImage"
imagePullPolicy: Always
mainApplicationFile: local:///app/main.py
sparkVersion: "3.3.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
timeToLiveSeconds: 86400
deps:
packages:
- io.delta:delta-core_2.12:2.2.0
- org.apache.hadoop:hadoop-aws:3.3.1
driver:
env:
- name: NAMESPACE
value: demo
cores: 2
coreLimit: "2000m"
memory: "2048m"
labels:
version: 3.3.1
serviceAccount: spark-driver
executor:
cores: 4
instances: 1
memory: "4096m"
coreRequest: "500m"
coreLimit: "4000m"
labels:
version: 3.3.1
dynamicAllocation:
enabled: false
Can someone please point out what I am doing wrong ?
Thank you
If you are running in cluster mode then you need your input files to be shared on a shared FS like HDFS or S3 but not on local FS, since both of driver and executors should have access to the input file.

Structured Streaming delta file does not exist

I am running spark2.2.1 structured streaming,the program failed after some time because the file did not exist,I fount this in enter link description here
,but it didn't work for me.And then I think the question might be checkpoint,I changed my code to the following
`
Dataset<Row> df = this.spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingOffsets", startingOffsets)
.option("failOnDataLoss", "false")
.load();
……
StreamingQuery start = result.writeStream()
.foreach(new CrossVhcLaneForeach(kafkaProperties, laneTopic))
.outputMode("update")
.option("checkpointLocation", this.checkPointLocation+"/laneDir")
.trigger(Trigger.ProcessingTime(Long.parseLong(delayTime),TimeUnit.SECONDS))
.start();
`
But then the program will be in a kind of suspended animation, it won't stop running, it won't give an error,I hope someone has a way to help me. Thanks.
I used java1.8, spark2.2.1standalone, hadoo2.7.3.The mistakes I encountered are as follows:
19/01/24 10:50:22 INFO TaskSetManager: Starting task 5.1 in stage 13.0 (TID 979, 34.55.0.164, executor 1, partition 5, AN
Y, 4730 bytes)19/01/24 10:50:22 WARN TaskSetManager: Lost task 4.0 in stage 13.0 (TID 976, 34.55.0.164, executor 1): java.lang.IllegalS
tateException: Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta of HDFSStateStoreProvider[id = (op=0, part=4), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$str
eaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$str
eaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvi
der.scala:265) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does
not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$str
eaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:407) ... 21 more
19/01/24 10:50:22 INFO TaskSetManager: Starting task 4.1 in stage 13.0 (TID 980, 34.55.0.164, executor 1, partition 4, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 3.1 in stage 13.0 (TID 978) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0, part=3), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta does not exist) [duplicate 1]19/01/24 10:50:22 INFO TaskSetManager: Starting task 3.2 in stage 13.0 (TID 981, 34.55.0.164, executor 1, partition 3, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 5.1 in stage 13.0 (TID 979) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta of HDFSStateStoreProvider[id = (op=0, part=5), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta does not exist) [duplicate 1]19/01/24 10:50:22 INFO TaskSetManager: Starting task 5.2 in stage 13.0 (TID 982, 34.55.0.164, executor 1, partition 5, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 3.2 in stage 13.0 (TID 981) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0, part=3), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta does not exist) [duplicate 2]19/01/24 10:50:22 INFO TaskSetManager: Starting task 3.3 in stage 13.0 (TID 983, 34.55.0.164, executor 1, partition 3, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 4.1 in stage 13.0 (TID 980) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta of HDFSStateStoreProvider[id = (op=0, part=4), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does not exist) [duplicate 1]19/01/24 10:50:22 INFO TaskSetManager: Starting task 4.2 in stage 13.0 (TID 984, 34.55.0.164, executor 1, partition 4, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 5.2 in stage 13.0 (TID 982) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta of HDFSStateStoreProvider[id = (op=0, part=5), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta does not exist) [duplicate 2]19/01/24 10:50:22 INFO TaskSetManager: Starting task 5.3 in stage 13.0 (TID 985, 34.55.0.164, executor 1, partition 5, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 4.2 in stage 13.0 (TID 984) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta of HDFSStateStoreProvider[id = (op=0, part=4), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does not exist) [duplicate 2]19/01/24 10:50:22 INFO TaskSetManager: Starting task 4.3 in stage 13.0 (TID 986, 34.55.0.164, executor 1, partition 4, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 3.3 in stage 13.0 (TID 983) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0, part=3), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta does not exist) [duplicate 3]19/01/24 10:50:22 ERROR TaskSetManager: Task 3 in stage 13.0 failed 4 times; aborting job
19/01/24 10:50:22 INFO TaskSchedulerImpl: Cancelling stage 13
19/01/24 10:50:22 INFO TaskSchedulerImpl: Stage 13 was cancelled
Spark saves/stores streaming states in checkpoint locations (hdfs if you've preferred). if a particular state stopped abruptly/failed (for example losing an executor in dynamic nodes allocation app).. spark can't pick from it... The solution is to clear the offset and start again(it mainly depends on use case)

When would pyspark fail with "java.lang.AssertionError: assertion failed" from BlockInfo.checkInvariants?

I am using pyspark and got the following messages:
17/12/03 11:57:48 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 1800, 172.31.27.9, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
17/12/03 11:57:48 INFO TaskSetManager: Starting task 0.1 in stage 5.0 (TID 1801, 172.31.27.9, executor 0, partition 0, PROCESS_LOCAL, 4871 bytes)
17/12/03 11:57:48 INFO TaskSetManager: Lost task 0.1 in stage 5.0 (TID 1801) on 172.31.27.9, executor 0: java.lang.AssertionError (assertion failed) [duplicate 1]
17/12/03 11:57:48 INFO TaskSetManager: Starting task 0.2 in stage 5.0 (TID 1802, 172.31.27.9, executor 0, partition 0, PROCESS_LOCAL, 4871 bytes)
17/12/03 11:57:48 INFO TaskSetManager: Lost task 0.2 in stage 5.0 (TID 1802) on 172.31.27.9, executor 0: java.lang.AssertionError (assertion failed) [duplicate 2]
17/12/03 11:57:48 INFO TaskSetManager: Starting task 0.3 in stage 5.0 (TID 1803, 172.31.27.9, executor 0, partition 0, PROCESS_LOCAL, 4871 bytes)
17/12/03 11:57:48 INFO TaskSetManager: Lost task 0.3 in stage 5.0 (TID 1803) on 172.31.27.9, executor 0: java.lang.AssertionError (assertion failed) [duplicate 3]
17/12/03 11:57:48 ERROR TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job
17/12/03 11:57:48 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
17/12/03 11:57:48 INFO TaskSchedulerImpl: Cancelling stage 5
17/12/03 11:57:48 INFO DAGScheduler: ResultStage 5 (runJob at PythonRDD.scala:446) failed in 0.078 s due to Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1803, 172.31.27.9, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It seems it happened to others TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job, but I am particularly concerned the messages: ... Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1803, 172.31.27.9, executor 0): java.lang.AssertionError: assertion failed.
I doubt the problem happens when writing to RDD, there may be errors on write-in, and Spark has trouble to parse later.
I am using ALS library and the Rating object. So before save to RDD it is more convenient for me to map RDD to
data.map(lambda x: (x.user, x.product, x.rating)).saveAsTextFile ("hdfs://"+master_ip+":9000/RDD/data")
and read and parse as
data = sc.textFile("hdfs://"+master_ip+":9000/RDD/data")
data = data.map(lambda x: x[1:-1]).map(lambda x: x.split(", ")).\
map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
I am pretty curious since these error messages didn't appear every time and are not reproducible always.
I am using spark 2.2.0 and hapdoop 2.7. Did anyone see this before?
Thanks!

Spark(Standalone) didnot execute my job

Alive Workers: 4
Cores in use: 16 Total
Memory in use: 27.2 GB Total
My driver program:
if __name__ == '__main__':
sc = SparkContext()
image_list = []
init_weight = 1 / (const.POSITIVE_SAMPLES_NUMBER * 2)
sample_type = True
acc_sample_count = sc.accumulator(-1)
samp_list = []
index = 0
for image_name in os.listdir(const.POSITIVE_SAMPLES_PATH):
image = Image.open(const.POSITIVE_SAMPLES_PATH+image_name)
image_resize = image.resize((24, 24), Image.ANTIALIAS)
image_gray_format = preoperate.format_gray(array(image_resize))
image_list.append([index ,image_gray_format])
index += 1
image_rdd = sc.parallelize(image_list, 128).setName('image_rdd')
samples_rdd = image_rdd.map(lambda img_index : generate_sample(img_index[1], sample_type, init_weight, img_index[0])).setName('sample_rdd')
samples_list = samples_rdd.collect()
print('#######################################')
print(sys.getsizeof(samples_list))
bc_samples = sc.broadcast(samples_list)
del samples_list
image_rdd.unpersist()
y2_position_rdd = sc.parallelize(const.Y2_POSITION, 2000)
y2_fea_val_rdd = y2_position_rdd.mapPartitions(lambda pos : map_cal(pos, bc_samples.value, haarcal.haar_like_Y2_cal), 2000).persist(StorageLevel.DISK_ONLY)
take_list = y2_fea_val_rdd.take(1)
print('#####################################')
for sample in take_list[0][1]:
print(sample.index)
print('#####################################')
print(y2_fea_val_rdd.count())
Here are two jobs:
The first job it works well, but the secend job was ignored without any log.
stderr in web ui without any err logs
My log:
17/03/11 16:49:32 INFO TaskSetManager: Finished task 115.0 in stage 0.0 (TID 115) in 1463 ms on 10.29.90.41 (executor 3) (116/128)
17/03/11 16:49:32 INFO TaskSetManager: Finished task 116.0 in stage 0.0 (TID 116) in 1469 ms on 10.30.147.199 (executor 9) (117/128)
17/03/11 16:49:32 INFO TaskSetManager: Finished task 117.0 in stage 0.0 (TID 117) in 1380 ms on 10.30.147.84 (executor 13) (118/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 119.0 in stage 0.0 (TID 119) in 1376 ms on 10.30.147.84 (executor 12) (119/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 118.0 in stage 0.0 (TID 118) in 1410 ms on 10.30.147.154 (executor 4) (120/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 120.0 in stage 0.0 (TID 120) in 1495 ms on 10.30.147.154 (executor 7) (121/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 122.0 in stage 0.0 (TID 122) in 1355 ms on 10.30.147.154 (executor 6) (122/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 123.0 in stage 0.0 (TID 123) in 1424 ms on 10.30.147.84 (executor 14) (123/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 121.0 in stage 0.0 (TID 121) in 1566 ms on 10.30.147.199 (executor 8) (124/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 124.0 in stage 0.0 (TID 124) in 1342 ms on 10.30.147.154 (executor 5) (125/128)
17/03/11 16:49:33 INFO TaskSetManager: Finished task 125.0 in stage 0.0 (TID 125) in 1344 ms on 10.29.90.41 (executor 0) (126/128)
17/03/11 16:49:34 INFO TaskSetManager: Finished task 126.0 in stage 0.0 (TID 126) in 1599 ms on 10.30.147.199 (executor 11) (127/128)
17/03/11 16:49:35 INFO TaskSetManager: Finished task 127.0 in stage 0.0 (TID 127) in 2667 ms on 10.29.90.41 (executor 2) (128/128)
17/03/11 16:49:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/11 16:49:35 INFO DAGScheduler: ResultStage 0 (collect at /home/dcooo/projects/FaceDetection/main/facedetection/spark_cal_features.py:70) finished in 16.419 s
17/03/11 16:49:35 INFO DAGScheduler: Job 0 finished: collect at
/home/dcooo/projects/FaceDetection/main/facedetection/spark_cal_features.py:70, took 17.078753 s
#
40816
17/03/11 16:49:37 INFO SparkContext: Invoking stop() from shutdown hook
17/03/11 16:49:37 INFO SparkUI: Stopped Spark web UI at http://10.165.51.174:4040
17/03/11 16:49:37 INFO StandaloneSchedulerBackend: Shutting down all executors
17/03/11 16:49:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/03/11 16:49:37 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
17/03/11 16:49:37 INFO MemoryStore: MemoryStore cleared
17/03/11 16:49:37 INFO BlockManager: BlockManager stopped
17/03/11 16:49:37 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/11 16:49:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/11 16:49:37 INFO SparkContext: Successfully stopped SparkContext
17/03/11 16:49:37 INFO ShutdownHookManager: Shutdown hook called
17/03/11 16:49:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8666f9e-c811-48f1-b116-7b48cd74347a
17/03/11 16:49:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8666f9e-c811-48f1-b116-7b48cd74347a/pyspark-a74f8c26-c7ef-43ed-ad5e-100a7e4c57e5
My first job works well, and spark didnot execute my second job
My environments:
spark.executor.cores 1
spark.cores.max 16
spark.executor.memory 1g
spark.default.parallelism 128

spark pyspark mllib model - when prediction rdd is generated using map, it throws exception on collect()

I am using spark 1.2.0 (cannot upgrade as I dont have control over it). I am using mllib to build a model
points = labels.zip(tfidf).map(lambda t: LabeledPoint(t[0], t[1] ))
train_data, test_data = points.randomSplit([0.6, 0.4], 17)
iterations = 3
model = LogisticRegressionWithSGD.train(train_data, iterations)
labelsAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)) )
print("labels = "+str(labelsAndPreds.collect()))
When I run this code I get a NullPointerException on collect(). Infact any operation on the predicted data result throws this exception.
15/08/26 04:02:43 INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 26, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:43 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on dojo3s10118.rtp1.hadoop.fmr.com:41145 (size: 9.6 KB, free: 529.8 MB)
15/08/26 04:02:43 INFO BlockManagerInfo: Added broadcast_14_piece0 in memory on dojo3s10118.rtp1.hadoop.fmr.com:41145 (size: 68.0 B, free: 529.8 MB)
15/08/26 04:02:43 WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 26, dojo3s10118.rtp1.hadoop.fmr.com): java.lang.NullPointerException
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/08/26 04:02:43 INFO TaskSetManager: Starting task 0.1 in stage 17.0 (TID 27, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:44 INFO TaskSetManager: Lost task 0.1 in stage 17.0 (TID 27) on executor dojo3s10118.rtp1.hadoop.fmr.com: java.lang.NullPointerException (null) [duplicate 1]
15/08/26 04:02:44 INFO TaskSetManager: Starting task 0.2 in stage 17.0 (TID 28, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:44 INFO TaskSetManager: Lost task 0.2 in stage 17.0 (TID 28) on executor dojo3s10118.rtp1.hadoop.fmr.com: java.lang.NullPointerException (null) [duplicate 2]
15/08/26 04:02:44 INFO TaskSetManager: Starting task 0.3 in stage 17.0 (TID 29, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:44 INFO TaskSetManager: Lost task 0.3 in stage 17.0 (TID 29) on executor dojo3s10118.rtp1.hadoop.fmr.com: java.lang.NullPointerException (null) [duplicate 3]
15/08/26 04:02:44 ERROR TaskSetManager: Task 0 in stage 17.0 failed 4 times; aborting job
15/08/26 04:02:44 INFO YarnClientClusterScheduler: Removed TaskSet 17.0, whose tasks have all completed, from pool
15/08/26 04:02:44 INFO YarnClientClusterScheduler: Cancelling stage 17
15/08/26 04:02:44 INFO DAGScheduler: Job 8 failed: collect at /home/a560975/spark-exp/./ml-py-exp-2.py:102, took 0.209401 s
Traceback (most recent call last):
File "/home/a560975/spark-exp/./ml-py-exp-2.py", line 102, in <module>
print("labels = "+str(labelsAndPreds.collect()))
File "/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p711.386/lib/spark/python/pyspark/rdd.py", line 676, in collect
bytesInJava = self._jrdd.collect().iterator()
File "/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p711.386/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p711.386/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o118.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 29, dojo3s10118.rtp1.hadoop.fmr.com
): java.lang.NullPointerException
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
If instead of doing a test_data.map(lambda p: (p.label, model.predict(p.features)) )
I do the following
for lp in test_data.collect():
print("predicted = "+str(model.predict(lp.features)))
Then the prediction does not throw any exception, but this is not parallel.
Why do I get the exception when I try to do model prediction by map function ? How do I get past it ?
I have tried sc.broadcast(model) to broadcast the model but still I see the same problem. Please help.
If you used Python ,The reason is that “In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead.”.

Resources