How to fit Spark's classifier in parallel? - apache-spark

Guys I have a strange problem...
I'm trying to train multiclass SVM classifier like this:
JavaPairRDD<Tuple2<String, String>, SVMModel> jp = scmap.mapToPair(new PairFunction<Tuple2<Tuple2<String, String>, RDD<LabeledPoint>>,Tuple2<String, String>, SVMModel >(){
#Override
public Tuple2<Tuple2<String, String>, SVMModel> call(Tuple2<Tuple2<String, String>, RDD<LabeledPoint>> tup)
{
SVMWithSGD svmAlg = new SVMWithSGD();
svmAlg.optimizer()
.setNumIterations(100)
.setRegParam(0.1)
.setUpdater(new SquaredL2Updater());
final SVMModel model = svmAlg.run(tup._2());
model.clearThreshold();
return new Tuple2<Tuple2<String, String>, SVMModel>(tup._1(), model);
}
});
But when I'm trying to collect() jp - I have this error:
15/01/16 20:06:30 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 5.0 (TID 147, fujitsu11.in.nu): java.lang.NullPointerException:
org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:157)
org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
scala.Option.getOrElse(Option.scala:120)
org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
org.apache.spark.rdd.RDD.take(RDD.scala:1060)
org.apache.spark.rdd.RDD.first(RDD.scala:1092)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:141)
maven.maven1.App$10.call(App.java:430)
maven.maven1.App$10.call(App.java:1)
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:926)
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:926)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Starting task 7.1 in stage 5.0 (TID 148, fujitsu11.in.nu, PROCESS_LOCAL, 2219 bytes)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Lost task 7.1 in stage 5.0 (TID 148) on executor fujitsu11.in.nu: java.lang.NullPointerException (null) [duplicate 1]
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Starting task 7.2 in stage 5.0 (TID 149, fujitsu11.in.nu, PROCESS_LOCAL, 2219 bytes)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Lost task 7.2 in stage 5.0 (TID 149) on executor fujitsu11.in.nu: java.lang.NullPointerException (null) [duplicate 2]
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Starting task 7.3 in stage 5.0 (TID 150, fujitsu11.in.nu, PROCESS_LOCAL, 2219 bytes)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Lost task 7.3 in stage 5.0 (TID 150) on executor fujitsu11.in.nu: java.lang.NullPointerException (null) [duplicate 3]
15/01/16 20:06:30 ERROR scheduler.TaskSetManager: Task 7 in stage 5.0 failed 4 times; aborting job
15/01/16 20:06:30 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
15/01/16 20:06:30 INFO scheduler.TaskSchedulerImpl: Cancelling stage 5
15/01/16 20:06:30 INFO scheduler.DAGScheduler: Failed to run collectAsMap at App.java:452
Why I get here NullPointer? I checked several times, that my
RDD<LabeledPoint>
and
Tuple2<String, String>
are not null. Maybe it's not capable to train classifier in parallel on workers?
Thank You.

You cannot run distributed operations inside distributed operations. But your first mapToPair need not be a distributed operation. Just .par.map a collection locally on the driver, each of which spawns a distributed operation to fit the model.

Related

Structured Streaming delta file does not exist

I am running spark2.2.1 structured streaming,the program failed after some time because the file did not exist,I fount this in enter link description here
,but it didn't work for me.And then I think the question might be checkpoint,I changed my code to the following
`
Dataset<Row> df = this.spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingOffsets", startingOffsets)
.option("failOnDataLoss", "false")
.load();
……
StreamingQuery start = result.writeStream()
.foreach(new CrossVhcLaneForeach(kafkaProperties, laneTopic))
.outputMode("update")
.option("checkpointLocation", this.checkPointLocation+"/laneDir")
.trigger(Trigger.ProcessingTime(Long.parseLong(delayTime),TimeUnit.SECONDS))
.start();
`
But then the program will be in a kind of suspended animation, it won't stop running, it won't give an error,I hope someone has a way to help me. Thanks.
I used java1.8, spark2.2.1standalone, hadoo2.7.3.The mistakes I encountered are as follows:
19/01/24 10:50:22 INFO TaskSetManager: Starting task 5.1 in stage 13.0 (TID 979, 34.55.0.164, executor 1, partition 5, AN
Y, 4730 bytes)19/01/24 10:50:22 WARN TaskSetManager: Lost task 4.0 in stage 13.0 (TID 976, 34.55.0.164, executor 1): java.lang.IllegalS
tateException: Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta of HDFSStateStoreProvider[id = (op=0, part=4), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$str
eaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$exec
ution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$str
eaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvi
der.scala:265) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does
not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$str
eaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:407) ... 21 more
19/01/24 10:50:22 INFO TaskSetManager: Starting task 4.1 in stage 13.0 (TID 980, 34.55.0.164, executor 1, partition 4, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 3.1 in stage 13.0 (TID 978) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0, part=3), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta does not exist) [duplicate 1]19/01/24 10:50:22 INFO TaskSetManager: Starting task 3.2 in stage 13.0 (TID 981, 34.55.0.164, executor 1, partition 3, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 5.1 in stage 13.0 (TID 979) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta of HDFSStateStoreProvider[id = (op=0, part=5), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta does not exist) [duplicate 1]19/01/24 10:50:22 INFO TaskSetManager: Starting task 5.2 in stage 13.0 (TID 982, 34.55.0.164, executor 1, partition 5, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 3.2 in stage 13.0 (TID 981) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0, part=3), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta does not exist) [duplicate 2]19/01/24 10:50:22 INFO TaskSetManager: Starting task 3.3 in stage 13.0 (TID 983, 34.55.0.164, executor 1, partition 3, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 4.1 in stage 13.0 (TID 980) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta of HDFSStateStoreProvider[id = (op=0, part=4), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does not exist) [duplicate 1]19/01/24 10:50:22 INFO TaskSetManager: Starting task 4.2 in stage 13.0 (TID 984, 34.55.0.164, executor 1, partition 4, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 5.2 in stage 13.0 (TID 982) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta of HDFSStateStoreProvider[id = (op=0, part=5), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/5/1.delta does not exist) [duplicate 2]19/01/24 10:50:22 INFO TaskSetManager: Starting task 5.3 in stage 13.0 (TID 985, 34.55.0.164, executor 1, partition 5, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 4.2 in stage 13.0 (TID 984) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta of HDFSStateStoreProvider[id = (op=0, part=4), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/4/1.delta does not exist) [duplicate 2]19/01/24 10:50:22 INFO TaskSetManager: Starting task 4.3 in stage 13.0 (TID 986, 34.55.0.164, executor 1, partition 4, AN
Y, 4730 bytes)19/01/24 10:50:22 INFO TaskSetManager: Lost task 3.3 in stage 13.0 (TID 983) on 34.55.0.164, executor 1: java.lang.Illega
lStateException (Error reading delta file /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0, part=3), dir = /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3]: /tmp/temporary-507089d7-9a64-40aa-9e8e-ab8a276f5bcf/state/0/3/1.delta does not exist) [duplicate 3]19/01/24 10:50:22 ERROR TaskSetManager: Task 3 in stage 13.0 failed 4 times; aborting job
19/01/24 10:50:22 INFO TaskSchedulerImpl: Cancelling stage 13
19/01/24 10:50:22 INFO TaskSchedulerImpl: Stage 13 was cancelled
Spark saves/stores streaming states in checkpoint locations (hdfs if you've preferred). if a particular state stopped abruptly/failed (for example losing an executor in dynamic nodes allocation app).. spark can't pick from it... The solution is to clear the offset and start again(it mainly depends on use case)

TB data write to parquet

we have 2.5TB data in hbase, and the region size is 5g or 10g, and the hbase table have 450 reigons. and we need transform to spark-sql. and the method used below:
1.snapshot hbase table.
2.read hfile by newHadoopAPIRDD
3.write to parquet.
val hconf = HBaseConfiguration.create()
hconf.set("hbase.rootdir", "/hbase")
hconf.set("hbase.zookeeper.quorum", HbaseToSparksqlBySnapshotParam.zookeeperQurum)
hconf.set(TableInputFormat.SCAN, convertScanToString(scan))
val job = Job.getInstance(hconf)
val path = new Path("/snapshot")
val snapshotName = HbaseToSparksqlBySnapshotParam.snapshotName
TableSnapshotInputFormat.setInput(job, snapshotName, path)
val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],classOf[ImmutableBytesWritable], classOf[Result])
val rdd = hbaseRDD.map{
case(_,result) =>
...
Row
}
val df = spark.createDataFrame(rdd, schema)
df.write.parquet("/test")
num-executors executor-memory executor-cores run_time
16 6 2 error following
5 15 8 6hours
i don't know how to set the params(num-executors,executor-memory,executor-cores), and can run faster. when i just get one region(10g), i use the param as num-executors 1,executor-memory 3g,executor-cores 1, it run 14min.
i use spark2.1.0
error:
19/01/18 00:55:26 INFO TaskSetManager: Finished task 386.0 in stage 0.0 (TID 331) in 1187343 ms on hdh68 (executor 5) (331/470)
19/01/18 00:55:31 INFO TaskSetManager: Starting task 423.0 in stage 0.0 (TID 363, hdh68, executor 1, partition 423, NODE_LOCAL, 6905 bytes)
19/01/18 00:55:31 INFO TaskSetManager: Finished task 383.0 in stage 0.0 (TID 328) in 1427677 ms on hdh68 (executor 1) (332/470)
19/01/18 00:57:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 10.
19/01/18 00:57:36 INFO DAGScheduler: Executor lost: 10 (epoch 0)
19/01/18 00:57:36 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/01/18 00:57:36 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(10, hdh68, 40679, None)
19/01/18 00:57:36 INFO BlockManagerMaster: Removed 10 successfully in removeExecutor
19/01/18 00:57:36 INFO DAGScheduler: Shuffle files lost for executor: 10 (epoch 0)
19/01/18 00:57:36 WARN DFSClient: Slow ReadProcessor read fields took 114044ms (threshold=30000ms); ack: seqno: 49 reply: 0 downstreamAckTimeNanos: 0, targets: [DatanodeInfoWithStorage[10.41.2.68:50010,DS-26a72d43-2e16-41a9-9a71-99593f14ab6f,DISK]]
19/01/18 00:57:39 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 6.6 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/01/18 00:57:39 ERROR YarnScheduler: Lost executor 10 on hdh68: Container killed by YARN for exceeding memory limits. 6.6 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/01/18 00:57:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/01/18 00:57:39 INFO BlockManagerMaster: Removal of executor 10 requested
19/01/18 00:57:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 10
19/01/18 00:57:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.41.2.68:49771) with ID 17
19/01/18 00:57:52 INFO TaskSetManager: Starting task 387.1 in stage 0.0 (TID 364, hdh68, executor 17, partition 387, NODE_LOCAL, 6906 bytes)
19/01/18 00:57:52 INFO TaskSetManager: Starting task 417.1 in stage 0.0 (TID 365, hdh68, executor 17, partition 417, NODE_LOCAL, 6906 bytes)
19/01/18 00:57:53 INFO BlockManagerMasterEndpoint: Registering block manager hdh68:38247 with 3.0 GB RAM, BlockManagerId(17, hdh68, 38247, None)
19/01/18 00:58:42 INFO TaskSetManager: Starting task 426.0 in stage 0.0 (TID 366, hdh68, executor 13, partition 426, NODE_LOCAL, 6905 bytes)
19/01/18 00:58:42 INFO TaskSetManager: Finished task 396.0 in stage 0.0 (TID 341) in 1014645 ms on hdh68 (executor 13) (333/470)
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:313)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:770)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1256)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:807)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:150)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
19/01/18 00:59:58 INFO TaskSetManager: Starting task 428.0 in stage 0.0 (TID 367, hdh68, executor 17, partition 428, NODE_LOCAL, 6906 bytes)
19/01/18 00:59:58 WARN TaskSetManager: Lost task 387.1 in stage 0.0 (TID 364, hdh68, executor 17): java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
19/01/18 01:00:04 INFO TaskSetManager: Starting task 387.2 in stage 0.0 (TID 368, hdh68, executor 14, partition 387, NODE_LOCAL, 6906 bytes)
19/01/18 01:00:04 INFO TaskSetManager: Finished task 385.0 in stage 0.0 (TID 330) in 1471628 ms on hdh68 (executor 14) (334/470)
19/01/18 01:00:12 INFO TaskSetManager: Starting task 429.0 in stage 0.0 (TID 369, hdh68, executor 8, partition 429, NODE_LOCAL, 6905 bytes)
19/01/18 01:00:12 INFO TaskSetManager: Finished task 392.0 in stage 0.0 (TID 337) in 1220925 ms on hdh68 (executor 8) (335/470)
19/01/18 01:00:27 INFO TaskSetManager: Starting task 430.0 in stage 0.0 (TID 370, hdh68, executor 11, partition 430, NODE_LOCAL, 6906 bytes)
19/01/18 01:01:16 INFO TaskSetManager: Finished task 390.0 in stage 0.0 (TID 335) in 1353136 ms on hdh68 (executor 14) (337/470)
19/01/18 01:01:30 INFO TaskSetManager: Starting task 432.0 in stage 0.0 (TID 372, hdh68, executor 5, partition 432, NODE_LOCAL, 6906 bytes)
19/01/18 01:01:30 INFO TaskSetManager: Finished task 393.0 in stage 0.0 (TID 338) in 1244707 ms on hdh68 (executor 5) (338/470)
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:319)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
19/01/18 01:01:59 INFO TaskSetManager: Starting task 434.0 in stage 0.0 (TID 374, hdh68, executor 17, partition 434, NODE_LOCAL, 6905 bytes)
19/01/18 01:01:59 WARN TaskSetManager: Lost task 417.1 in stage 0.0 (TID 365, hdh68, executor 17): java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
update:
the cluser is a pseudo-distributed. Memory Tota 200g, VCores Total 64. and the root.root queue resource, the queue has not other application:
Used Resources: <memory:171776, vCores:44>
Num Active Applications: 7
Num Pending Applications: 0
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:204800, vCores:64>
Steady Fair Share: <memory:54614, vCores:0>
Instantaneous Fair Share: <memory:102400, vCores:0>
Preemptable: true

When would pyspark fail with "java.lang.AssertionError: assertion failed" from BlockInfo.checkInvariants?

I am using pyspark and got the following messages:
17/12/03 11:57:48 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 1800, 172.31.27.9, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
17/12/03 11:57:48 INFO TaskSetManager: Starting task 0.1 in stage 5.0 (TID 1801, 172.31.27.9, executor 0, partition 0, PROCESS_LOCAL, 4871 bytes)
17/12/03 11:57:48 INFO TaskSetManager: Lost task 0.1 in stage 5.0 (TID 1801) on 172.31.27.9, executor 0: java.lang.AssertionError (assertion failed) [duplicate 1]
17/12/03 11:57:48 INFO TaskSetManager: Starting task 0.2 in stage 5.0 (TID 1802, 172.31.27.9, executor 0, partition 0, PROCESS_LOCAL, 4871 bytes)
17/12/03 11:57:48 INFO TaskSetManager: Lost task 0.2 in stage 5.0 (TID 1802) on 172.31.27.9, executor 0: java.lang.AssertionError (assertion failed) [duplicate 2]
17/12/03 11:57:48 INFO TaskSetManager: Starting task 0.3 in stage 5.0 (TID 1803, 172.31.27.9, executor 0, partition 0, PROCESS_LOCAL, 4871 bytes)
17/12/03 11:57:48 INFO TaskSetManager: Lost task 0.3 in stage 5.0 (TID 1803) on 172.31.27.9, executor 0: java.lang.AssertionError (assertion failed) [duplicate 3]
17/12/03 11:57:48 ERROR TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job
17/12/03 11:57:48 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
17/12/03 11:57:48 INFO TaskSchedulerImpl: Cancelling stage 5
17/12/03 11:57:48 INFO DAGScheduler: ResultStage 5 (runJob at PythonRDD.scala:446) failed in 0.078 s due to Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1803, 172.31.27.9, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It seems it happened to others TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job, but I am particularly concerned the messages: ... Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1803, 172.31.27.9, executor 0): java.lang.AssertionError: assertion failed.
I doubt the problem happens when writing to RDD, there may be errors on write-in, and Spark has trouble to parse later.
I am using ALS library and the Rating object. So before save to RDD it is more convenient for me to map RDD to
data.map(lambda x: (x.user, x.product, x.rating)).saveAsTextFile ("hdfs://"+master_ip+":9000/RDD/data")
and read and parse as
data = sc.textFile("hdfs://"+master_ip+":9000/RDD/data")
data = data.map(lambda x: x[1:-1]).map(lambda x: x.split(", ")).\
map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
I am pretty curious since these error messages didn't appear every time and are not reproducible always.
I am using spark 2.2.0 and hapdoop 2.7. Did anyone see this before?
Thanks!

NoSuchMethodException : com.google.common.io.ByteStreams.limit

I run spark to write data to hbase, but found NoSuchMethodException:
15/10/23 18:45:21 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, dn18-formal.i.nease.net): java.lang.NoSuchMethodError: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
I found guava.jar in hadoop/hbase dir and the version is 12.0, but com.google.common.io.ByteStreams.limit is since 14.0, so NoSuchMethodException occurs.
I try to run spark-submmit by - -jars,but the same. and I try to add
configuration.set("spark.executor.extraClassPath", "/home/ljh")
configuration.set("spark.driver.userClassPathFirst","true");
to my code, still the same.
How to solve this? How to remove the guava.jar in hadoop/hbase from class path? why it does not use the guava.jar in spark dir.
Here is my code:
rdd.foreach({ res =>
val configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum", “ip.66");
configuration.set("hbase.master", “ip:60000");
configuration.set("spark.executor.extraClassPath", "/home/ljh")
configuration.set("spark.driver.userClassPathFirst","true");
val hadmin = new HBaseAdmin(configuration);
configuration.clear();
configuration.addResource("/home/hadoop/conf/core-default.xml")
configuration.addResource("/home/hadoop/conf/core-site.xml")
configuration.addResource("/home/hadoop/conf/mapred-default.xml")
configuration.addResource("/home/hadoop/conf/mapred-site.xml")
configuration.addResource("/home/hadoop/conf/yarn-default.xml")
configuration.addResource("/home/hadoop/conf/yarn-site.xml")
configuration.addResource("/home/hadoop/conf/hdfs-default.xml")
configuration.addResource("/home/hadoop/conf/hdfs-site.xml")
configuration.addResource("/home/hadoop/conf/hbase-default.xml")
configuration.addResource("/home/ljhn1829/hbase-site.xml")
val table = new HTable(configuration, "ljh_test2");
var put = new Put(Bytes.toBytes(res.toKey()));
put.add(Bytes.toBytes("basic"), Bytes.toBytes("name"), Bytes.toBytes(res.totalCount + "\t" + res.positiveCount));
table.put(put);
table.flushCommits()
})
and the error message:
15/10/23 19:06:42 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, gdc-dn126-formal.i.nease.net): java.lang.NoSuchMethodError:
com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.nextBatchStream(ExternalAppendOnlyMap.scala:420)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:207)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:63)
at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:83)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:63)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/10/23 19:06:42 INFO TaskSetManager: Starting task 0.1 in stage 1.0 (TID 2, gdc-dn166-formal.i.nease.net, PROCESS_LOCAL, 1277
bytes)
15/10/23 19:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on gdc-dn166-formal.i.nease.net:3838
(size: 3.2 KB, free: 1060.3 MB)
15/10/23 19:06:42 ERROR YarnScheduler: Lost executor 1 on gdc-dn126-formal.i.nease.net: remote Rpc client disassociated
15/10/23 19:06:42 WARN ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor#gdc-dn126-formal.i.nease.net:1656] has
failed, address is now gated for [5000] ms. Reason is:
[Disassociated].
15/10/23 19:06:42 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 1.0
15/10/23 19:06:42 INFO DAGScheduler: Executor lost: 1 (epoch 1)
15/10/23 19:06:42 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
15/10/23 19:06:42 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, gdc-dn126-formal.i.nease.net, 44635)
15/10/23 19:06:42 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
15/10/23 19:06:42 INFO ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 1 (0/1, false)
15/10/23 19:06:42 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to
gdc-dn166-formal.i.nease.net:28595
15/10/23 19:06:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 84 bytes
15/10/23 19:06:42 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2, gdc-dn166-formal.i.nease.net): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=0, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:385)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:172)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
add
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
</dependency>
because in https://guava.dev/releases/19.0/api/docs/src-html/com/google/common/io/ByteStreams.html#line.596
587 /**
588 * Wraps a {#link InputStream}, limiting the number of bytes which can be
589 * read.
590 *
591 * #param in the input stream to be wrapped
592 * #param limit the maximum number of bytes to be read
593 * #return a length-limited {#link InputStream}
594 * #since 14.0 (since 1.0 as com.google.common.io.LimitInputStream)
595 */
596 public static InputStream limit(InputStream in, long limit) {
597 return new LimitedInputStream(in, limit);
598 }

spark pyspark mllib model - when prediction rdd is generated using map, it throws exception on collect()

I am using spark 1.2.0 (cannot upgrade as I dont have control over it). I am using mllib to build a model
points = labels.zip(tfidf).map(lambda t: LabeledPoint(t[0], t[1] ))
train_data, test_data = points.randomSplit([0.6, 0.4], 17)
iterations = 3
model = LogisticRegressionWithSGD.train(train_data, iterations)
labelsAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)) )
print("labels = "+str(labelsAndPreds.collect()))
When I run this code I get a NullPointerException on collect(). Infact any operation on the predicted data result throws this exception.
15/08/26 04:02:43 INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 26, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:43 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on dojo3s10118.rtp1.hadoop.fmr.com:41145 (size: 9.6 KB, free: 529.8 MB)
15/08/26 04:02:43 INFO BlockManagerInfo: Added broadcast_14_piece0 in memory on dojo3s10118.rtp1.hadoop.fmr.com:41145 (size: 68.0 B, free: 529.8 MB)
15/08/26 04:02:43 WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 26, dojo3s10118.rtp1.hadoop.fmr.com): java.lang.NullPointerException
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/08/26 04:02:43 INFO TaskSetManager: Starting task 0.1 in stage 17.0 (TID 27, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:44 INFO TaskSetManager: Lost task 0.1 in stage 17.0 (TID 27) on executor dojo3s10118.rtp1.hadoop.fmr.com: java.lang.NullPointerException (null) [duplicate 1]
15/08/26 04:02:44 INFO TaskSetManager: Starting task 0.2 in stage 17.0 (TID 28, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:44 INFO TaskSetManager: Lost task 0.2 in stage 17.0 (TID 28) on executor dojo3s10118.rtp1.hadoop.fmr.com: java.lang.NullPointerException (null) [duplicate 2]
15/08/26 04:02:44 INFO TaskSetManager: Starting task 0.3 in stage 17.0 (TID 29, dojo3s10118.rtp1.hadoop.fmr.com, PROCESS_LOCAL, 1464 bytes)
15/08/26 04:02:44 INFO TaskSetManager: Lost task 0.3 in stage 17.0 (TID 29) on executor dojo3s10118.rtp1.hadoop.fmr.com: java.lang.NullPointerException (null) [duplicate 3]
15/08/26 04:02:44 ERROR TaskSetManager: Task 0 in stage 17.0 failed 4 times; aborting job
15/08/26 04:02:44 INFO YarnClientClusterScheduler: Removed TaskSet 17.0, whose tasks have all completed, from pool
15/08/26 04:02:44 INFO YarnClientClusterScheduler: Cancelling stage 17
15/08/26 04:02:44 INFO DAGScheduler: Job 8 failed: collect at /home/a560975/spark-exp/./ml-py-exp-2.py:102, took 0.209401 s
Traceback (most recent call last):
File "/home/a560975/spark-exp/./ml-py-exp-2.py", line 102, in <module>
print("labels = "+str(labelsAndPreds.collect()))
File "/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p711.386/lib/spark/python/pyspark/rdd.py", line 676, in collect
bytesInJava = self._jrdd.collect().iterator()
File "/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p711.386/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p711.386/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o118.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 29, dojo3s10118.rtp1.hadoop.fmr.com
): java.lang.NullPointerException
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:229)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:229)
If instead of doing a test_data.map(lambda p: (p.label, model.predict(p.features)) )
I do the following
for lp in test_data.collect():
print("predicted = "+str(model.predict(lp.features)))
Then the prediction does not throw any exception, but this is not parallel.
Why do I get the exception when I try to do model prediction by map function ? How do I get past it ?
I have tried sc.broadcast(model) to broadcast the model but still I see the same problem. Please help.
If you used Python ,The reason is that “In Python, predict cannot currently be used within an RDD transformation or action. Call predict directly on the RDD instead.”.

Resources