Spark Cassandra Connector Maven Build Issue - apache-spark

Hi I am trying to write Spark application which reads data from Cassandra. My Scala version is 2.11 and Spark version is 2.2.0. Unfortunately I am facing build issue. It says "missing or invalid dependency detected while loading class file 'package.class'. I do not know what is causing this issue.
Here's my POM File
<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<encoding>UTF-8</encoding>
<!--scala.tools.version>2.11.8</scala.tools.version-->
<scala.version>2.11.8</scala.version>
</properties>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.3</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--arg>-make:transitive</arg-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.13</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<!-- If you have classpath issue like NoDefClassError,... -->
<!-- useManifestOnlyJar>false</useManifestOnlyJar -->
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<!-- "package" command plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<!-- Scala and Spark dependencies -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-xml</artifactId>
<version>2.11.0-M4</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-parser-combinators_2.11</artifactId>
<version>1.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.7</version>
</dependency>
<!--dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.11</artifactId>
<version>1.5.0-RC1</version>
</dependency-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.12</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
I am getting the following error
[INFO] --- maven-resources-plugin:2.3:resources (default-resources) # search-count ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO]
[INFO] --- maven-compiler-plugin:2.0.2:compile (default-compile) # search-count ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- scala-maven-plugin:3.1.3:compile (default) # search-count ---
[WARNING] Expected all dependencies to require Scala version: 2.11.8
[WARNING] search-count:search-count:0.0.1-SNAPSHOT requires scala version: 2.11.8
[WARNING] org.scala-lang.modules:scala-parser-combinators_2.11:1.0.2 requires scala version: 2.11.1
[WARNING] Multiple versions of scala libraries detected!
[ERROR] error: missing or invalid dependency detected while loading class file 'package.class'.
[INFO] Could not access type DataFrame in value org.apache.spark.sql.package,
[INFO] because it (or its dependencies) are missing. Check your build definition for
[INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
[INFO] A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.package.
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.052s
[INFO] Finished at: Wed Apr 04 11:33:51 CEST 2018
[INFO] Final Memory: 22M/425M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.3:compile (default) on project search-count: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]
Any ideas what could be the problem?
Console logs after running my app
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/04 14:15:31 INFO SparkContext: Running Spark version 2.2.0
18/04/04 14:15:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/04 14:15:32 WARN Utils: Your hostname, obel-pc0083 resolves to a loopback address: 127.0.1.1; using 10.96.20.75 instead (on interface eth0)
18/04/04 14:15:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/04/04 14:15:32 INFO SparkContext: Submitted application: Online Gateway Count
18/04/04 14:15:32 INFO Utils: Successfully started service 'sparkDriver' on port 45111.
18/04/04 14:15:32 INFO SparkEnv: Registering MapOutputTracker
18/04/04 14:15:32 INFO SparkEnv: Registering BlockManagerMaster
18/04/04 14:15:32 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/04/04 14:15:32 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/04/04 14:15:32 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e7cfde5b-87f0-4447-a19e-771d100d7422
18/04/04 14:15:32 INFO MemoryStore: MemoryStore started with capacity 1137.6 MB
18/04/04 14:15:32 INFO SparkEnv: Registering OutputCommitCoordinator
18/04/04 14:15:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/04/04 14:15:32 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.96.20.75:4040
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.96.20.75:7077...
18/04/04 14:15:33 INFO TransportClientFactory: Successfully created connection to /10.96.20.75:7077 after 59 ms (0 ms spent in bootstraps)
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20180404141533-0009
18/04/04 14:15:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39062.
18/04/04 14:15:33 INFO NettyBlockTransferService: Server created on 10.96.20.75:39062
18/04/04 14:15:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180404141533-0009/0 on worker-20180403185515-10.96.20.75-38166 (10.96.20.75:38166) with 4 cores
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: Granted executor ID app-20180404141533-0009/0 on hostPort 10.96.20.75:38166 with 4 cores, 1024.0 MB RAM
18/04/04 14:15:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.96.20.75:39062 with 1137.6 MB RAM, BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.96.20.75, 39062, None)
18/04/04 14:15:33 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180404141533-0009/0 is now RUNNING
18/04/04 14:15:33 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/04/04 14:15:34 INFO Native: Could not load JNR C Library, native system calls through this library will not be available (set this logger level to DEBUG to see the full stack trace).
18/04/04 14:15:34 INFO ClockFactory: Using java.lang.System clock to generate timestamps.
18/04/04 14:15:35 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
18/04/04 14:15:36 INFO Cluster: New Cassandra host /10.96.20.75:9042 added
18/04/04 14:15:36 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
18/04/04 14:15:36 INFO SparkContext: Starting job: count at SearchCount.scala:47
18/04/04 14:15:36 INFO DAGScheduler: Registering RDD 4 (distinct at SearchCount.scala:47)
18/04/04 14:15:36 INFO DAGScheduler: Got job 0 (count at SearchCount.scala:47) with 6 output partitions
18/04/04 14:15:36 INFO DAGScheduler: Final stage: ResultStage 1 (count at SearchCount.scala:47)
18/04/04 14:15:36 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/04/04 14:15:36 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/04/04 14:15:36 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[4] at distinct at SearchCount.scala:47), which has no missing parents
18/04/04 14:15:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.6 KB, free 1137.6 MB)
18/04/04 14:15:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.2 KB, free 1137.6 MB)
18/04/04 14:15:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.96.20.75:39062 (size: 5.2 KB, free: 1137.6 MB)
18/04/04 14:15:37 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
18/04/04 14:15:37 INFO DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[4] at distinct at SearchCount.scala:47) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5))
18/04/04 14:15:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
18/04/04 14:15:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.96.20.75:43727) with ID 0
18/04/04 14:15:37 INFO BlockManagerMasterEndpoint: Registering block manager 10.96.20.75:46125 with 366.3 MB RAM, BlockManagerId(0, 10.96.20.75, 46125, None)
18/04/04 14:15:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.96.20.75, executor 0, partition 0, NODE_LOCAL, 12327 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.96.20.75, executor 0, partition 1, NODE_LOCAL, 11729 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.96.20.75, executor 0, partition 2, NODE_LOCAL, 13038 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.96.20.75, executor 0, partition 3, NODE_LOCAL, 12445 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.96.20.75, executor 0, partition 4, NODE_LOCAL, 12209 bytes)
18/04/04 14:15:38 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.96.20.75, executor 0, partition 5, NODE_LOCAL, 6864 bytes)
18/04/04 14:15:38 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.96.20.75, executor 0): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:309)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Edit: I really missed that 1.5.0-RC1 was commented out...
It should be enough to specify cassandra-spark-connector dependency - it already has dependency on spark-core & spark-sql. But if you use Spark 2.x, you need to use 2.x version of cassandra-spark-connector (although it has dependncy on 2.0.2, it could work with 2.2.0).
I don't know where did you take version 1.5.0-RC1 - it's veeeery old...

Related

Permission denied error when setting up local Spark instance and running pyspark

I am setting up a local Spark instance on Windows to use with PySpark as described in this guide (but with spark-3.0.0 / hadoop 2.7 instead): https://phoenixnap.com/kb/install-spark-on-windows-10.
I can startup Spark with:
C:\Spark\spark-3.0.0-bin-hadoop2.7\bin>spark-shell.cmd
and connect to it with http://localhost:4040/ in my browser (I see the Spark GUI).
But when am running the Python pyspark example with
C:\Spark\spark-3.0.0-bin-hadoop2.7\examples>run-example SparkPi
it throws an Permission Denied error like in this trace:
21/03/08 10:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/03/08 10:51:04 INFO SparkContext: Running Spark version 3.0.0
21/03/08 10:51:04 INFO ResourceUtils: ==============================================================
21/03/08 10:51:04 INFO ResourceUtils: Resources for spark.driver:
21/03/08 10:51:04 INFO ResourceUtils: ==============================================================
21/03/08 10:51:04 INFO SparkContext: Submitted application: Spark Pi
21/03/08 10:51:04 INFO SecurityManager: Changing view acls to: #####
21/03/08 10:51:04 INFO SecurityManager: Changing modify acls to: #####
21/03/08 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/03/08 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/03/08 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(#####); groups with view permissions: Set(); users with modify permissions: Set(#####); groups with modify permissions: Set()
21/03/08 10:51:05 INFO Utils: Successfully started service 'sparkDriver' on port 63213.
21/03/08 10:51:05 INFO SparkEnv: Registering MapOutputTracker
21/03/08 10:51:05 INFO SparkEnv: Registering BlockManagerMaster
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/03/08 10:51:05 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/03/08 10:51:05 INFO DiskBlockManager: Created local directory at C:\Users\#####\AppData\Local\Temp\blockmgr-dce03954-27a7-484d-8e54-f552b21433f7
21/03/08 10:51:05 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
21/03/08 10:51:05 INFO SparkEnv: Registering OutputCommitCoordinator
21/03/08 10:51:05 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/03/08 10:51:05 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://WORKSTATION.DOMAIN.EXT:4040
21/03/08 10:51:05 INFO SparkContext: Added JAR file:///C:/Spark/spark-3.0.0-bin-hadoop2.7/examples/jars/scopt_2.12-3.7.1.jar at spark://WORKSTATION.DOMAIN.EXT:63213/jars/scopt_2.12-3.7.1.jar with timestamp 1615197065578
21/03/08 10:51:05 INFO SparkContext: Added JAR file:///C:/Spark/spark-3.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.12-3.0.0.jar at spark://WORKSTATION.DOMAIN.EXT:63213/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615197065579
21/03/08 10:51:05 INFO Executor: Starting executor ID driver on host WORKSTATION.DOMAIN.EXT
21/03/08 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63260.
21/03/08 10:51:05 INFO NettyBlockTransferService: Server created on WORKSTATION.DOMAIN.EXT:63260
21/03/08 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/03/08 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: Registering block manager NLLR4000250910.solon.prd:63260 with 366.3 MiB RAM, BlockManagerId(driver, WORKSTATION.DOMAIN.EXT, 63260, None)
21/03/08 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:06 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
21/03/08 10:51:06 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
21/03/08 10:51:06 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
21/03/08 10:51:06 INFO DAGScheduler: Parents of final stage: List()
21/03/08 10:51:06 INFO DAGScheduler: Missing parents: List()
21/03/08 10:51:06 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
21/03/08 10:51:06 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 366.3 MiB)
21/03/08 10:51:06 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 366.3 MiB)
21/03/08 10:51:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on WORKSTATION.DOMAIN.EXT:63260 (size: 1816.0 B, free: 366.3 MiB)
21/03/08 10:51:06 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1200
21/03/08 10:51:06 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
21/03/08 10:51:06 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
21/03/08 10:51:06 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, WORKSTATION.DOMAIN.EXT, executor driver, partition 0, PROCESS_LOCAL, 7393 bytes)
21/03/08 10:51:06 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, WORKSTATION.DOMAIN.EXT, executor driver, partition 1, PROCESS_LOCAL, 7393 bytes)
21/03/08 10:51:06 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/08 10:51:06 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/08 10:51:06 INFO Executor: Fetching spark://WORKSTATION.DOMAIN.EXT:63213/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615197065579
21/03/08 10:51:06 ERROR Utils: Aborting task
java.io.IOException: Failed to connect to WORKSTATION.DOMAIN.EXT/192.168.#.#:63213
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:359)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:535)
at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:869)
at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:860)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:860)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:404)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedSocketException: Permission denied: no further information: WORKSTATION.DOMAIN.EXT/192.168.#.#:63213
Caused by: java.net.SocketException: Permission denied: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Unknown Source)
[snip]
When running it on a different machine with seemingly the same config where it works fine, I get this trace on the part where the Exception is thrown on the other trace:
[snip]
21/03/08 08:00:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/08 08:00:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/08 08:00:22 INFO Executor: Fetching spark://WORKSTATION.DOMAIN.EXT:63646/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615186820489
21/03/08 08:00:22 INFO TransportClientFactory: Successfully created connection to WORKSTATION.DOMAIN.EXT/10.121.#.#:63646 after 86 ms (0 ms spent in bootstraps)
21/03/08 08:00:22 INFO Utils: Fetching spark://WORKSTATION.DOMAIN.EXT:63646/jars/spark-examples_2.12-3.0.0.jar to C:\Users\#####\AppData\Local\Temp\spark-54a13d9f-9064-4f34-ba81-af49b18d9a0c\userFiles-24c3eabc-02a4-4aca-8abb-424431c6442f\fetchFileTemp5258763437798623210.tmp
21/03/08 08:00:24 INFO Executor: Adding file:/C:/Users/#####/AppData/Local/Temp/spark-54a13d9f-9064-4f34-ba81-af49b18d9a0c/userFiles-24c3eabc-02a4-4aca-8abb-424431c6442f/spark-examples_2.12-3.0.0.jar to class loader
[snip]
At first it seemed to me as a Firewall issue, but adding the executing java.exe as exeption to the firewall didn't solve the issue.
Does anyone know what I should try next to get this issue resolved?
Finally I could solve it by setting my SPARK_LOCAL_IP to localhost in my environment variables: Go to your Windows environment variables and set SPARK_LOCAL_IP=localhost

FileNotFoundException on submitting Spark Jobs to remote

I've created an environment where I've set up 3 Docker containers, 1 for Airflow using the puckel/docker-airflow image with spark and hadoop additionally installed. The other two containers are basically imitating spark master and worker (used gettyimages/spark Docker image to create this). All 3 containers are connected to each other via a bridge network, so all containers are able to communicate with each other.
What I'm trying to do next is to submit spark job from the Airflow container to the Spark cluster (master).
As an initial example, I'm using the wordcount sample script. I created a sample.txt file in the airflow container at path usr/local/airflow/sample.txt. I've bashed into the Airflow container and I'm using the command given below to run the wordcount.py on spark master located at the ip which I found after inspecting the bridge network.
spark-submit --master spark://ipaddress:7077 --files usr/local/airflow/sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
After submitting the script, from the logs, I can see that a connection has been established with the master (from airflow container), and it also copied the file specified by --files to the master and worker, but then it just errors out saying,
java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
As per my understanding (could be wrong), but when we specify files to copy to master using --files you can access it directly via the file name (sample.txt in my case). So what I'm trying to figure out is if a job has been submitted and the file has been copied to master, then why is it searching in the location file:/usr/local/airflow/sample.txt? How do I make it refer to the correct path?
I apologize as this question has been asked a couple of times, but I've read all the related question on stackoverflow, but I'm still unable to resolve this. I'd really appreciate y'alls help on this.
Thanks.
The full log below,
user#machine:/usr/local/airflow# spark-submit --master spark://172.22.0.2:7077 --files sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py ./sample.txt
20/07/25 03:23:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/25 03:23:35 INFO SparkContext: Running Spark version 2.4.1
20/07/25 03:23:35 INFO SparkContext: Submitted application: PythonWordCount
20/07/25 03:23:35 INFO SecurityManager: Changing view acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing view acls groups to:
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls groups to:
20/07/25 03:23:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/25 03:23:35 INFO Utils: Successfully started service 'sparkDriver' on port 33457.
20/07/25 03:23:35 INFO SparkEnv: Registering MapOutputTracker
20/07/25 03:23:36 INFO SparkEnv: Registering BlockManagerMaster
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/25 03:23:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-dd1957de-6907-484d-a3d8-2b3b88e0c7ca
20/07/25 03:23:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/25 03:23:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/25 03:23:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/25 03:23:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://0508a77fcaad:4040
20/07/25 03:23:37 INFO SparkContext: Added file file:///usr/local/airflow/sample.txt at spark://0508a77fcaad:33457/files/sample.txt with timestamp 1595647417081
20/07/25 03:23:37 INFO Utils: Copying /usr/local/airflow/sample.txt to /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/userFiles-74f8cfe4-8a19-4d2e-8fa1-1f0bd1f0ef12/sample.txt
20/07/25 03:23:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.0.2:7077...
20/07/25 03:23:37 INFO TransportClientFactory: Successfully created connection to /172.22.0.2:7077 after 32 ms (0 ms spent in bootstraps)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200725032338-0003
20/07/25 03:23:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45057.
20/07/25 03:23:38 INFO NettyBlockTransferService: Server created on 0508a77fcaad:45057
20/07/25 03:23:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200725032338-0003/0 on worker-20200725025003-172.22.0.4-8881 (172.22.0.4:8881) with 2 core(s)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20200725032338-0003/0 on hostPort 172.22.0.4:8881 with 2 core(s), 1024.0 MB RAM
20/07/25 03:23:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMasterEndpoint: Registering block manager 0508a77fcaad:45057 with 366.3 MB RAM, BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200725032338-0003/0 is now RUNNING
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.020/07/25 03:23:38 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/usr/local/airflow/spark-warehouse').
20/07/25 03:23:38 INFO SharedState: Warehouse path is 'file:/usr/local/airflow/spark-warehouse'.
20/07/25 03:23:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/25 03:23:47 INFO FileSourceStrategy: Pruning directories with:
20/07/25 03:23:47 INFO FileSourceStrategy: Post-Scan Filters:
20/07/25 03:23:47 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/07/25 03:23:47 INFO FileSourceScanExec: Pushed Filters:
20/07/25 03:23:51 INFO CodeGenerator: Code generated in 2187.926234 ms
20/07/25 03:23:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.9 KB, free 366.1 MB)
20/07/25 03:23:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.8 KB, free 366.1 MB)
20/07/25 03:23:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 0508a77fcaad:45057 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:23:55 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
20/07/25 03:23:55 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/07/25 03:23:57 INFO SparkContext: Starting job: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40
20/07/25 03:23:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.22.0.4:59324) with ID 0
20/07/25 03:23:58 INFO DAGScheduler: Registering RDD 5 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39)
20/07/25 03:23:58 INFO DAGScheduler: Got job 0 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40) with 1 output partitions
20/07/25 03:23:58 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40)
20/07/25 03:23:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39), which has no missing parents
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.2 KB, free 366.0 MB)
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.1 KB, free 366.0 MB)
20/07/25 03:23:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 0508a77fcaad:45057 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:23:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
20/07/25 03:23:58 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) (first 15 tasks are for partitions Vector(0))
20/07/25 03:23:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/07/25 03:23:58 INFO BlockManagerMasterEndpoint: Registering block manager 172.22.0.4:45435 with 366.3 MB RAM, BlockManagerId(0, 172.22.0.4, 45435, None)
20/07/25 03:23:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.22.0.4:45435 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:24:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.22.0.4:45435 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:24:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 1]
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 2]
20/07/25 03:24:12 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 3]
20/07/25 03:24:12 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
20/07/25 03:24:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/07/25 03:24:12 INFO TaskSchedulerImpl: Cancelling stage 0
20/07/25 03:24:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/07/25 03:24:12 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) failed in 13.690 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
20/07/25 03:24:12 INFO DAGScheduler: Job 0 failed: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40, took 14.579961 s
Traceback (most recent call last):
File "/opt/spark-2.4.1/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/25 03:24:13 INFO SparkUI: Stopped Spark web UI at http://0508a77fcaad:4040
20/07/25 03:24:13 INFO StandaloneSchedulerBackend: Shutting down all executors
20/07/25 03:24:13 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/07/25 03:24:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/25 03:24:16 INFO MemoryStore: MemoryStore cleared
20/07/25 03:24:16 INFO BlockManager: BlockManager stopped
20/07/25 03:24:16 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/25 03:24:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/25 03:24:16 INFO SparkContext: Successfully stopped SparkContext
20/07/25 03:24:16 INFO ShutdownHookManager: Shutdown hook called
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-2dfb2222-d56c-4ee1-ab62-86e71e5e751b
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/pyspark-2ee74d07-6606-4edc-8420-fe46212c50e5
Change your spark-submit like below for submitting your spark job.
spark-submit \
--master spark://ipaddress:7077 \
--deploy-mode cluster # add this if you want to pass file name to wordcount.py
--files usr/local/airflow/sample.txt \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
OR
spark-submit \
--master spark://ipaddress:7077 \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py /usr/local/airflow/sample.txt

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/Logging

I'm new to Spark. I attempted to run a Spark app (.jar) on CDH 5.8.0-0 on Oracle VirtualBox 5.1.4r110228 which leveraged Spark Steaming to perform sentiment analysis on twitter. I have my twitter account created and all required (4) tokens were generated. I was blocked by the NoClassDefFoundError exception.
I've been googling around for a couple of days. The best advice I found so far was in the URL below but apparently my environment is still missing something.
http://javarevisited.blogspot.com/2011/06/noclassdeffounderror-exception-in.html#ixzz4Ia99dsp0
What does it mean by a library showed up in Compile by was missing at RunTime? How can we fix this?
What is the library of Logging? I came across an article stating this Logging is subject to be deprecated. Besides that, I do see log4j in my environment.
In my CDH 5.8, I'm running these versions of software:
Spark-2.0.0-bin-hadoop2.7 / spark-core_2.10-2.0.0
jdk-8u101-linux-x64 / jre-bu101-linux-x64
I appended the detail of the exception at the end. Here is the procedure I performed to execute the app and some verification I did after hitting the exception:
Unzip twitter-streaming.zip (the Spark app)
cd twitter-streaming
run ./sbt/sbt assembly
Update env.sh with your Twitter account
$ cat env.sh
export SPARK_HOME=/home/cloudera/spark-2.0.0-bin-hadoop2.7
export CONSUMER_KEY=<my_consumer_key>
export CONSUMER_SECRET=<my_consumer_secret>
export ACCESS_TOKEN=<my_twitterapp_access_token>
export ACCESS_TOKEN_SECRET=<my_twitterapp_access_token>
The submit.sh script wrapped up the spark-submit command with required credential info in env.sh:
$ cat submit.sh
source ./env.sh
$SPARK_HOME/bin/spark-submit --class "TwitterStreamingApp" --master local[*] ./target/scala-2.10/twitter-streaming-assembly-1.0.jar $CONSUMER_KEY $CONSUMER_SECRET $ACCESS_TOKEN $ACCESS_TOKEN_SECRET
The log of the assembly process:
[cloudera#quickstart twitter-streaming]$ ./sbt/sbt assembly
Launching sbt from sbt/sbt-launch-0.13.7.jar
[info] Loading project definition from /home/cloudera/workspace/twitter-streaming/project
[info] Set current project to twitter-streaming (in build file:/home/cloudera/workspace/twitter-streaming/)
[info] Including: twitter4j-stream-3.0.3.jar
[info] Including: twitter4j-core-3.0.3.jar
[info] Including: scala-library.jar
[info] Including: unused-1.0.0.jar
[info] Including: spark-streaming-twitter_2.10-1.4.1.jar
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[warn] Merging 'META-INF/LICENSE.txt' with strategy 'first'
[warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[warn] Merging 'META-INF/maven/org.spark-project.spark/unused/pom.properties' with strategy 'first'
[warn] Merging 'META-INF/maven/org.spark-project.spark/unused/pom.xml' with strategy 'first'
[warn] Merging 'log4j.properties' with strategy 'discard'
[warn] Merging 'org/apache/spark/unused/UnusedStubClass.class' with strategy 'first'
[warn] Strategy 'discard' was applied to 2 files
[warn] Strategy 'first' was applied to 4 files
[info] SHA-1: 69146d6fdecc2a97e346d36fafc86c2819d5bd8f
[info] Packaging /home/cloudera/workspace/twitter-streaming/target/scala-2.10/twitter-streaming-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 6 s, completed Aug 27, 2016 11:58:03 AM
Not sure exactly what it means but everything looked good when I ran Hadoop NativeCheck:
$ hadoop checknative -a
16/08/27 13:27:22 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
16/08/27 13:27:22 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /usr/lib/hadoop/lib/native/libsnappy.so.1
lz4: true revision:10301
bzip2: true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
Here is the console log of my exception:
$ ./submit.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/08/28 20:13:23 INFO SparkContext: Running Spark version 2.0.0
16/08/28 20:13:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/28 20:13:24 WARN Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/08/28 20:13:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/08/28 20:13:24 INFO SecurityManager: Changing view acls to: cloudera
16/08/28 20:13:24 INFO SecurityManager: Changing modify acls to: cloudera
16/08/28 20:13:24 INFO SecurityManager: Changing view acls groups to:
16/08/28 20:13:24 INFO SecurityManager: Changing modify acls groups to:
16/08/28 20:13:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); groups with view permissions: Set(); users with modify permissions: Set(cloudera); groups with modify permissions: Set()
16/08/28 20:13:25 INFO Utils: Successfully started service 'sparkDriver' on port 37550.
16/08/28 20:13:25 INFO SparkEnv: Registering MapOutputTracker
16/08/28 20:13:25 INFO SparkEnv: Registering BlockManagerMaster
16/08/28 20:13:25 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-37a0492e-67e3-4ad5-ac38-40448c25d523
16/08/28 20:13:25 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
16/08/28 20:13:25 INFO SparkEnv: Registering OutputCommitCoordinator
16/08/28 20:13:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/08/28 20:13:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
16/08/28 20:13:25 INFO SparkContext: Added JAR file:/home/cloudera/workspace/twitter-streaming/target/scala-2.10/twitter-streaming-assembly-1.1.jar at spark://10.0.2.15:37550/jars/twitter-streaming-assembly-1.1.jar with timestamp 1472440405882
16/08/28 20:13:26 INFO Executor: Starting executor ID driver on host localhost
16/08/28 20:13:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41264.
16/08/28 20:13:26 INFO NettyBlockTransferService: Server created on 10.0.2.15:41264
16/08/28 20:13:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41264)
16/08/28 20:13:26 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41264 with 366.3 MB RAM, BlockManagerId(driver, 10.0.2.15, 41264)
16/08/28 20:13:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41264)
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)I
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.streaming.twitter.TwitterUtils$.createStream(TwitterUtils.scala:44)
at TwitterStreamingApp$.main(TwitterStreamingApp.scala:42)
at TwitterStreamingApp.main(TwitterStreamingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 23 more
16/08/28 20:13:26 INFO SparkContext: Invoking stop() from shutdown hook
16/08/28 20:13:26 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
16/08/28 20:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/08/28 20:13:26 INFO MemoryStore: MemoryStore cleared
16/08/28 20:13:26 INFO BlockManager: BlockManager stopped
16/08/28 20:13:26 INFO BlockManagerMaster: BlockManagerMaster stopped
16/08/28 20:13:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/08/28 20:13:26 INFO SparkContext: Successfully stopped SparkContext
16/08/28 20:13:26 INFO ShutdownHookManager: Shutdown hook called
16/08/28 20:13:26 INFO ShutdownHookManager: Deleting directory /tmp/spark-5e29c3b2-74c2-4d89-970f-5be89d176b26
I understand my post was lengthy. Your advice or insights are highly appreciated!!
-jsung8
Use: spark-core_2.11-1.5.2.jar
I had the same problem described by #jsung8 and tried to find the .jar suggested by #youngstephen but could not. However linking in spark-core_2.11-1.5.2.jar instead of spark-core_2.11-1.5.2.logging.jar resolved the exception in the way #youngstephen suggested.
org/apache/spark/Logging was removed after spark 1.5.2.
Since your spark-core version is 2.0, then the simplest solution is:
download a single spark-core_2.11-1.5.2.logging.jar and put it in the jars directory under your spark root directory.
Anyway, It solves my problem, hope it helps.
One reason that may cause this problem is lib and class conflict.
I faced this problem and solved it using some maven exclusions:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.0.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
You're using an old version of the Spark Twitter connector. This class from your stack trace hints at that:
org.apache.spark.streaming.twitter.TwitterUtils
Spark removed that integration in version 2.0. You're using the one from an old Spark version that references the old Logging class which moved to a different package in Spark 2.0.
If you want to use Spark 2.0, you'll need to use the Twitter connector from the Bahir project.
Spark core version should be degraded to 1.5 due to the below error
java.lang.NoClassDefFoundError: org/apache/spark/Logging
http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ provides the better solution for this. By adding the below dependency, my issue was resolved.
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>2.0.0</version>
</dependency>

Scala application using spark cassandra connector hangs

I am developing a test application in Intellij using scala and spark cassandra connector. Here is my build.sbt code:
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark".%%("spark-sql") % "1.6.1"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0-M2"
I have created cassandra cluster using ccm with 4 nodes. The keyspace I created has replication factor 3. Here is my code in scala app
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkCassandra")
//set Cassandra host address as your local address
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("excelsior", "emp")
val total = rdd.count()
println(total)
println("exiting now:")
sc.stop()
But the spark job hangs at following line
CassandraConnector: Disconnected from Cassandra cluster: cluster4nodes
Only 3 tasks out of 4 are completed.Here is the full log:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/07/08 14:23:26 INFO SparkContext: Running Spark version 1.6.0
16/07/08 14:23:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/08 14:23:27 WARN Utils: Your hostname, renuka-Inspiron-3542 resolves to a loopback address: 127.0.1.1; using 192.168.1.189 instead (on interface wlan0)
16/07/08 14:23:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/07/08 14:23:27 INFO SecurityManager: Changing view acls to: renuka
16/07/08 14:23:27 INFO SecurityManager: Changing modify acls to: renuka
16/07/08 14:23:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(renuka); users with modify permissions: Set(renuka)
16/07/08 14:23:28 INFO Utils: Successfully started service 'sparkDriver' on port 41027.
16/07/08 14:23:28 INFO Slf4jLogger: Slf4jLogger started
16/07/08 14:23:28 INFO Remoting: Starting remoting
16/07/08 14:23:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.189:46329]
16/07/08 14:23:28 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 46329.
16/07/08 14:23:29 INFO SparkEnv: Registering MapOutputTracker
16/07/08 14:23:29 INFO SparkEnv: Registering BlockManagerMaster
16/07/08 14:23:29 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d1db2ea9-3fba-4b64-ad2b-80deddd3f05a
16/07/08 14:23:29 INFO MemoryStore: MemoryStore started with capacity 1091.3 MB
16/07/08 14:23:29 INFO SparkEnv: Registering OutputCommitCoordinator
16/07/08 14:23:29 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/07/08 14:23:29 INFO SparkUI: Started SparkUI at http://192.168.1.189:4040
16/07/08 14:23:30 INFO Executor: Starting executor ID driver on host localhost
16/07/08 14:23:30 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40956.
16/07/08 14:23:30 INFO NettyBlockTransferService: Server created on 40956
16/07/08 14:23:30 INFO BlockManagerMaster: Trying to register BlockManager
16/07/08 14:23:30 INFO BlockManagerMasterEndpoint: Registering block manager localhost:40956 with 1091.3 MB RAM, BlockManagerId(driver, localhost, 40956)
16/07/08 14:23:30 INFO BlockManagerMaster: Registered BlockManager
16/07/08 14:23:30 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/07/08 14:23:31 INFO Cluster: New Cassandra host /127.0.0.1:9042 added
16/07/08 14:23:31 INFO Cluster: New Cassandra host /127.0.0.2:9042 added
16/07/08 14:23:31 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.2 (datacenter1)
16/07/08 14:23:31 INFO Cluster: New Cassandra host /127.0.0.3:9042 added
16/07/08 14:23:31 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.3 (datacenter1)
16/07/08 14:23:31 INFO Cluster: New Cassandra host /127.0.0.4:9042 added
16/07/08 14:23:31 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.4 (datacenter1)
16/07/08 14:23:31 INFO CassandraConnector: Connected to Cassandra cluster: cluster4nodes
16/07/08 14:23:31 INFO SparkContext: Starting job: count at hello.scala:36
16/07/08 14:23:32 INFO DAGScheduler: Got job 0 (count at hello.scala:36) with 4 output partitions
16/07/08 14:23:32 INFO DAGScheduler: Final stage: ResultStage 0 (count at hello.scala:36)
16/07/08 14:23:32 INFO DAGScheduler: Parents of final stage: List()
16/07/08 14:23:32 INFO DAGScheduler: Missing parents: List()
16/07/08 14:23:32 INFO DAGScheduler: Submitting ResultStage 0 (CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:18), which has no missing parents
16/07/08 14:23:32 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.2 KB, free 7.2 KB)
16/07/08 14:23:32 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.7 KB, free 10.9 KB)
16/07/08 14:23:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40956 (size: 3.7 KB, free: 1091.2 MB)
16/07/08 14:23:32 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/07/08 14:23:32 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:18)
16/07/08 14:23:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
16/07/08 14:23:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,NODE_LOCAL, 3530 bytes)
16/07/08 14:23:32 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,NODE_LOCAL, 3530 bytes)
16/07/08 14:23:32 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, partition 2,NODE_LOCAL, 3530 bytes)
16/07/08 14:23:32 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
16/07/08 14:23:32 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/07/08 14:23:32 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/08 14:23:34 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2082 bytes result sent to driver
16/07/08 14:23:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2082 bytes result sent to driver
16/07/08 14:23:34 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 2082 bytes result sent to driver
16/07/08 14:23:34 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2089 ms on localhost (1/4)
16/07/08 14:23:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2179 ms on localhost (2/4)
16/07/08 14:23:34 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2117 ms on localhost (3/4)
16/07/08 14:23:41 INFO CassandraConnector: Disconnected from Cassandra cluster: cluster4nodes
If I created a keyspace with replication factor 4 on 4 nodes cluster, app works fine and never hangs. Am I missing anything in configuration. Thanks in advance.

Spark: EOFException when reading from HDFS

I just started playing with Spark. So I ran the SimpleApp program from tutorial (https://spark.apache.org/docs/1.0.0/quick-start.html), which works fine.
However, if I change the file location from local to hdfs, then I get an EOFException.
I did some search online which suggests this error is caused by hadoop version conflicts, I made the suggested modification in my sbt file, but still get the same error.
I am using CDH5.1, code and full error log is below. Any help is greatly appreciated.
Thanks
Scala:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "hdfs://plogs001.sjc.domain.com:8020/tmp/data.txt" // Should be some file on your system
val conf = new SparkConf()
.setMaster("spark://plogs004.sjc.domain.com:7077")
.setAppName("SimpleApp")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
//val logFile = "/tmp/data.txt" // Should be some file on your system
//val conf = new SparkConf().setAppName("Simple Application")
//val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
SBT:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.1.0"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
Error Log:
[hdfs#plogs001 test1]$ spark-submit --class SimpleApp --master spark://spark#plogs004.sjc.domain.com:7077 target/scala-2.10/simple-project_2.10-1.0.jar
14/09/09 16:56:41 INFO spark.SecurityManager: Changing view acls to: hdfs
14/09/09 16:56:41 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs)
14/09/09 16:56:41 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/09/09 16:56:41 INFO Remoting: Starting remoting
14/09/09 16:56:41 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#plogs001.sjc.domain.com:34607]
14/09/09 16:56:41 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#plogs001.sjc.domain.com:34607]
14/09/09 16:56:41 INFO spark.SparkEnv: Registering MapOutputTracker
14/09/09 16:56:41 INFO spark.SparkEnv: Registering BlockManagerMaster
14/09/09 16:56:41 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140909165641-375e
14/09/09 16:56:41 INFO storage.MemoryStore: MemoryStore started with capacity 294.9 MB.
14/09/09 16:56:41 INFO network.ConnectionManager: Bound socket to port 40833 with id = ConnectionManagerId(plogs001.sjc.domain.com,40833)
14/09/09 16:56:41 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/09/09 16:56:41 INFO storage.BlockManagerInfo: Registering block manager plogs001.sjc.domain.com:40833 with 294.9 MB RAM
14/09/09 16:56:41 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/09 16:56:41 INFO spark.HttpServer: Starting HTTP Server
14/09/09 16:56:42 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/09 16:56:42 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:47419
14/09/09 16:56:42 INFO broadcast.HttpBroadcast: Broadcast server started at http://172.16.30.161:47419
14/09/09 16:56:42 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-7026d0b6-777e-4dd3-9bbb-e79d7487e7d7
14/09/09 16:56:42 INFO spark.HttpServer: Starting HTTP Server
14/09/09 16:56:42 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/09 16:56:42 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:42388
14/09/09 16:56:42 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/09 16:56:42 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
14/09/09 16:56:42 INFO ui.SparkUI: Started SparkUI at http://plogs001.sjc.domain.com:4040
14/09/09 16:56:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/09/09 16:56:42 INFO spark.SparkContext: Added JAR file:/home/hdfs/kent/test1/target/scala-2.10/simple-project_2.10-1.0.jar at http://172.16.30.161:42388/jars/simple-project_2.10-1.0.jar with timestamp 1410307002737
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Connecting to master spark://plogs004.sjc.domain.com:7077...
14/09/09 16:56:42 INFO storage.MemoryStore: ensureFreeSpace(155704) called with curMem=0, maxMem=309225062
14/09/09 16:56:42 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 152.1 KB, free 294.8 MB)
14/09/09 16:56:42 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140909165642-0041
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Executor added: app-20140909165642-0041/0 on worker-20140902113555-plogs005.sjc.domain.com-7078 (plogs005.sjc.domain.com:7078) with 24 cores
14/09/09 16:56:42 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140909165642-0041/0 on hostPort plogs005.sjc.domain.com:7078 with 24 cores, 1024.0 MB RAM
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Executor added: app-20140909165642-0041/1 on worker-20140902113555-plogs006.sjc.domain.com-7078 (plogs006.sjc.domain.com:7078) with 24 cores
14/09/09 16:56:42 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140909165642-0041/1 on hostPort plogs006.sjc.domain.com:7078 with 24 cores, 1024.0 MB RAM
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Executor added: app-20140909165642-0041/2 on worker-20140902113556-plogs004.sjc.domain.com-7078 (plogs004.sjc.domain.com:7078) with 24 cores
14/09/09 16:56:42 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140909165642-0041/2 on hostPort plogs004.sjc.domain.com:7078 with 24 cores, 1024.0 MB RAM
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Executor updated: app-20140909165642-0041/2 is now RUNNING
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Executor updated: app-20140909165642-0041/1 is now RUNNING
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Executor updated: app-20140909165642-0041/0 is now RUNNING
14/09/09 16:56:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/09/09 16:56:43 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:22
14/09/09 16:56:43 INFO scheduler.DAGScheduler: Got job 0 (count at SimpleApp.scala:22) with 2 output partitions (allowLocal=false)
14/09/09 16:56:43 INFO scheduler.DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:22)
14/09/09 16:56:43 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/09/09 16:56:43 INFO scheduler.DAGScheduler: Missing parents: List()
14/09/09 16:56:43 INFO scheduler.DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:22), which has no missing parents
14/09/09 16:56:43 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:22)
14/09/09 16:56:43 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/09/09 16:56:44 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#plogs005.sjc.domain.com:59110/user/Executor#181141295] with ID 0
14/09/09 16:56:44 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: plogs005.sjc.domain.com (PROCESS_LOCAL)
14/09/09 16:56:44 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1915 bytes in 2 ms
14/09/09 16:56:44 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor 0: plogs005.sjc.domain.com (PROCESS_LOCAL)
14/09/09 16:56:44 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1915 bytes in 0 ms
14/09/09 16:56:44 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#plogs006.sjc.domain.com:45192/user/Executor#2003979349] with ID 1
14/09/09 16:56:44 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#plogs004.sjc.domain.com:46711/user/Executor#-1654256828] with ID 2
14/09/09 16:56:44 INFO storage.BlockManagerInfo: Registering block manager plogs005.sjc.domain.com:36798 with 589.2 MB RAM
14/09/09 16:56:44 INFO storage.BlockManagerInfo: Registering block manager plogs004.sjc.domain.com:40459 with 589.2 MB RAM
14/09/09 16:56:44 INFO storage.BlockManagerInfo: Registering block manager plogs006.sjc.domain.com:54696 with 589.2 MB RAM
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Loss was due to java.io.EOFException
java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1032)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:260)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:252)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:147)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 1 (task 0.0:1)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 1]
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 2 on executor 2: plogs004.sjc.domain.com (NODE_LOCAL)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1915 bytes in 1 ms
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 3 on executor 1: plogs006.sjc.domain.com (NODE_LOCAL)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1915 bytes in 0 ms
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 3 (task 0.0:0)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 2]
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 4 on executor 2: plogs004.sjc.domain.com (NODE_LOCAL)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1915 bytes in 1 ms
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 2 (task 0.0:1)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 3]
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 5 on executor 2: plogs004.sjc.domain.com (NODE_LOCAL)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1915 bytes in 0 ms
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 4 (task 0.0:0)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 4]
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 6 on executor 2: plogs004.sjc.domain.com (NODE_LOCAL)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1915 bytes in 0 ms
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 5 (task 0.0:1)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 5]
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 7 on executor 0: plogs005.sjc.domain.com (NODE_LOCAL)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1915 bytes in 0 ms
14/09/09 16:56:45 WARN scheduler.TaskSetManager: Lost TID 6 (task 0.0:0)
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 6]
14/09/09 16:56:45 ERROR scheduler.TaskSetManager: Task 0.0:0 failed 4 times; aborting job
14/09/09 16:56:45 INFO scheduler.DAGScheduler: Failed to run count at SimpleApp.scala:22
Exception in thread "main" 14/09/09 16:56:45 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host plogs004.sjc.domain.com: java.io.EOFException
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)
java.io.ObjectInputStream.readFully(ObjectInputStream.java:1032)
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
org.apache.hadoop.io.UTF8.readChars(UTF8.java:260)
org.apache.hadoop.io.UTF8.readString(UTF8.java:252)
org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:147)
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/09/09 16:56:45 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/09/09 16:56:45 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
14/09/09 16:56:45 INFO scheduler.TaskSetManager: Loss was due to java.io.EOFException [duplicate 7]
14/09/09 16:56:45 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

Resources