"Application attempt...doesn't exist in ApplicationMasterService cache” cause? (Pregel: maxIterations impact on cluster for non-convergent algorithm) - apache-spark

I've tried to run my own Pregel method for a relatively small graph (250k vertices, 1.5M edges). The algorithm which I use may (high chances are) be non-convergent meaning in most cases maxIterations setting is actually acting as hard stop finishing all calculations.
I'm using AWS EMR with apache spark and m5.2xlarge instances for all nodes in a setup with EMR-managed scaling. Initially, cluster is set to run 1 master and 4 worker nodes with expansion up to 8 max.
For the same setup of cluster, I was increasing the number of maxIterations gradually from 100 to 500 with step of 100 [100, 200, 300, 400, 500]. I was under the assumption that setup enough for 100 iterations will be also enough for any other number just because not used memory will be freeing up.
However, when I ran a set of jobs with maxIterations increasing from 100 to 500 I found that all jobs with maxIterations > 100 were terminated due to step error. I've checked logs of Spark to find issues and this is what I got:
log start
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/10/__spark_libs__364046395941885636.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for TERM
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for HUP
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for INT
21/02/13 21:23:24 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing view acls groups to:
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls groups to:
21/02/13 21:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:24 INFO ApplicationMaster: Preparing Local resources
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:25 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1613251201422_0001_000001
21/02/13 21:23:25 INFO ApplicationMaster: Starting the user application in a separate Thread
21/02/13 21:23:25 INFO ApplicationMaster: Waiting for spark context initialization...
21/02/13 21:23:25 INFO SparkContext: Running Spark version 2.4.7-amzn-0
21/02/13 21:23:25 INFO SparkContext: Submitted application: Read JDBC Datasites2
21/02/13 21:23:25 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing view acls groups to:
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls groups to:
21/02/13 21:23:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:25 INFO Utils: Successfully started service 'sparkDriver' on port 41117.
21/02/13 21:23:25 INFO SparkEnv: Registering MapOutputTracker
21/02/13 21:23:25 INFO SparkEnv: Registering BlockManagerMaster
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/02/13 21:23:25 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-bc544c91-1a59-41f3-890f-faaa392bea09
21/02/13 21:23:25 INFO DiskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-14e3f36f-6d3f-4ffe-a28c-fa3f81f0c5c9
21/02/13 21:23:26 INFO MemoryStore: MemoryStore started with capacity 1008.9 MB
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:26 INFO SparkEnv: Registering OutputCommitCoordinator
21/02/13 21:23:26 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
21/02/13 21:23:26 INFO Utils: Successfully started service 'SparkUI' on port 43659.
21/02/13 21:23:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ip-172-31-21-88.ec2.internal:43659
21/02/13 21:23:26 INFO YarnClusterScheduler: Created YarnClusterScheduler
21/02/13 21:23:26 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1613251201422_0001 and attemptId Some(appattempt_1613251201422_0001_000001)
21/02/13 21:23:26 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34665.
21/02/13 21:23:26 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:26 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:27 INFO RMProxy: Connecting to ResourceManager at ip-172-31-29-
command:
LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH\" \
{{JAVA_HOME}}/bin/java \
-server \
-Xmx4743m \
'-verbose:gc' \
'-XX:+PrintGCDetails' \
'-XX:+PrintGCDateStamps' \
'-XX:OnOutOfMemoryError=kill -9 %p' \
'-XX:+UseParallelGC' \
'-XX:InitiatingHeapOccupancyPercent=70' \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.history.ui.port=18080' \
'-Dspark.ui.port=0' \
'-Dspark.driver.port=41117' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler#ip-172-31-21-88.ec2.internal:41117 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
2 \
--app-id \
application_1613251201422_0001 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__app__.jar -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/force-pregel.jar" } size: 27378 timestamp: 1613251399566 type: FILE visibility: PRIVATE
__spark_libs__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_libs__364046395941885636.zip" } size: 239655683 timestamp: 1613251397751 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_conf__.zip" } size: 274365 timestamp: 1613251399776 type: ARCHIVE visibility: PRIVATE
hive-site.xml -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/hive-site.xml" } size: 2137 timestamp: 1613251399631 type: FILE visibility: PRIVATE
===============================================================================
21/02/13 21:23:27 INFO Configuration: resource-types.xml not found
21/02/13 21:23:27 INFO ResourceUtils: Unable to find 'resource-types.xml'.
21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/02/13 21:23:27 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:27 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM#ip-172-31-21-88.ec2.internal:41117)
21/02/13 21:23:27 INFO YarnAllocator: Will request up to 100 executor container(s), each with <memory:5632, max memory:2147483647, vCores:2, max vCores:2147483647>
21/02/13 21:23:27 INFO YarnAllocator: Submitted 100 unlocalized container requests.
21/02/13 21:23:27 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
21/02/13 21:23:27 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000002 on host ip-172-31-21-88.ec2.internal for executor with ID 1 with resources <memory:5632, max memory:12288, vCores:1, max vCores:8>
21/02/13 21:23:27 INFO YarnAllocator: Launching executor with 4742m of heap (plus 890m overhead) and 2 cores
21/02/13 21:23:27 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000004 on host ip-172-31-25-102.ec2.internal for executor with ID 2 with resources <memory:11264, vCores:2>
21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000006 on host ip-172-31-28-143.ec2.internal for executor with ID 3 with resources <memory:11264, vCores:2>
21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
21/02/13 21:23:28 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
30 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.21.88:53634) with ID 1
21/02/13 21:23:30 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
21/02/13 21:23:30 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-21-88.ec2.internal:45667 with 2.3 GB RAM, BlockManagerId(1, ip-172-31-21-88.ec2.internal, 45667, None)
then approximately 2Mbytes of same output and then it finishes:
21/02/13 21:28:25 INFO TaskSetManager: Finished task 199.0 in stage 37207.0 (TID 93528) in 8 ms on ip-172-31-25-102.ec2.internal (executor 2) (158/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_31 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 252.3 KB, free: 2.1 GB)
21/02/13 21:28:25 ERROR ApplicationMaster: Exception from Reporter thread.
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy23.allocate(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:300)
at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:279)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$allocationThreadImpl(ApplicationMaster.scala:541)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:607)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
at org.apache.hadoop.ipc.Client.call(Client.java:1495)
at org.apache.hadoop.ipc.Client.call(Client.java:1394)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy22.allocate(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
... 13 more
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_30 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.8 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 40.0 in stage 37207.0 (TID 93533, ip-172-31-21-88.ec2.internal, executor 1, partition 40, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 31.0 in stage 37207.0 (TID 93532) in 16 ms on ip-172-31-21-88.ec2.internal (executor 1) (162/200)
21/02/13 21:28:25 INFO ApplicationMaster: Final app status: FAILED, exitCode: 12, (reason: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 41.0 in stage 37207.0 (TID 93534, ip-172-31-21-88.ec2.internal, executor 1, partition 41, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 30.0 in stage 37207.0 (TID 93531) in 22 ms on ip-172-31-21-88.ec2.internal (executor 1) (163/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_40 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 234.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 48.0 in stage 37207.0 (TID 93535, ip-172-31-21-88.ec2.internal, executor 1, partition 48, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 40.0 in stage 37207.0 (TID 93533) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (164/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_41 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 233.4 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 51.0 in stage 37207.0 (TID 93536, ip-172-31-21-88.ec2.internal, executor 1, partition 51, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 41.0 in stage 37207.0 (TID 93534) in 15 ms on ip-172-31-21-88.ec2.internal (executor 1) (165/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_48 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 235.1 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 57.0 in stage 37207.0 (TID 93537, ip-172-31-21-88.ec2.internal, executor 1, partition 57, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 48.0 in stage 37207.0 (TID 93535) in 11 ms on ip-172-31-21-88.ec2.internal (executor 1) (166/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_57 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 232.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_51 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 61.0 in stage 37207.0 (TID 93538, ip-172-31-21-88.ec2.internal, executor 1, partition 61, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 57.0 in stage 37207.0 (TID 93537) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (167/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 63.0 in stage 37207.0 (TID 93539, ip-172-31-21-88.ec2.internal, executor 1, partition 63, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 51.0 in stage 37207.0 (TID 93536) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (168/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_61 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 228.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 67.0 in stage 37207.0 (TID 93540, ip-172-31-21-88.ec2.internal, executor 1, partition 67, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 61.0 in stage 37207.0 (TID 93538) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (169/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_63 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 238.3 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 71.0 in stage 37207.0 (TID 93541, ip-172-31-21-88.ec2.internal, executor 1, partition 71, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 63.0 in stage 37207.0 (TID 93539) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (170/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_67 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 247.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_71 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 243.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 77.0 in stage 37207.0 (TID 93542, ip-172-31-21-88.ec2.internal, executor 1, partition 77, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 67.0 in stage 37207.0 (TID 93540) in 18 ms on ip-172-31-21-88.ec2.internal (executor 1) (171/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 79.0 in stage 37207.0 (TID 93543, ip-172-31-21-88.ec2.internal, executor 1, partition 79, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 71.0 in stage 37207.0 (TID 93541) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (172/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_79 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 253.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_77 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 222.5 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 86.0 in stage 37207.0 (TID 93544, ip-172-31-21-88.ec2.internal, executor 1, partition 86, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 79.0 in stage 37207.0 (TID 93543) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (173/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 87.0 in stage 37207.0 (TID 93545, ip-172-31-21-88.ec2.internal, executor 1, partition 87, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 77.0 in stage 37207.0 (TID 93542) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (174/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_86 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 254.5 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_87 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 267.1 KB, free: 2.1 GB)
Am I correct that Pregel doesn't finish 200 or more iterations due to OutOfMemory error on some of the cluster nodes?
If so, how does Pregel work that 100 iterations are not causing it and 200 or 300 are causing? My understand before this issue was that Pregel as many other iterative approaches only 'store' previous and current iteration values and results and iteration by iteration values are changing, but their quantity is not increasing, meaning it is still graph with 250k vertices and 1.5m edges and only messages valid for current iteration are adding up to the heap.
Throughout the log I was not able to find any information on low memory and as seen, there are Gigabytes of it available on each node before it terminates

Related

FileNotFoundException on submitting Spark Jobs to remote

I've created an environment where I've set up 3 Docker containers, 1 for Airflow using the puckel/docker-airflow image with spark and hadoop additionally installed. The other two containers are basically imitating spark master and worker (used gettyimages/spark Docker image to create this). All 3 containers are connected to each other via a bridge network, so all containers are able to communicate with each other.
What I'm trying to do next is to submit spark job from the Airflow container to the Spark cluster (master).
As an initial example, I'm using the wordcount sample script. I created a sample.txt file in the airflow container at path usr/local/airflow/sample.txt. I've bashed into the Airflow container and I'm using the command given below to run the wordcount.py on spark master located at the ip which I found after inspecting the bridge network.
spark-submit --master spark://ipaddress:7077 --files usr/local/airflow/sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
After submitting the script, from the logs, I can see that a connection has been established with the master (from airflow container), and it also copied the file specified by --files to the master and worker, but then it just errors out saying,
java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
As per my understanding (could be wrong), but when we specify files to copy to master using --files you can access it directly via the file name (sample.txt in my case). So what I'm trying to figure out is if a job has been submitted and the file has been copied to master, then why is it searching in the location file:/usr/local/airflow/sample.txt? How do I make it refer to the correct path?
I apologize as this question has been asked a couple of times, but I've read all the related question on stackoverflow, but I'm still unable to resolve this. I'd really appreciate y'alls help on this.
Thanks.
The full log below,
user#machine:/usr/local/airflow# spark-submit --master spark://172.22.0.2:7077 --files sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py ./sample.txt
20/07/25 03:23:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/25 03:23:35 INFO SparkContext: Running Spark version 2.4.1
20/07/25 03:23:35 INFO SparkContext: Submitted application: PythonWordCount
20/07/25 03:23:35 INFO SecurityManager: Changing view acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing view acls groups to:
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls groups to:
20/07/25 03:23:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/25 03:23:35 INFO Utils: Successfully started service 'sparkDriver' on port 33457.
20/07/25 03:23:35 INFO SparkEnv: Registering MapOutputTracker
20/07/25 03:23:36 INFO SparkEnv: Registering BlockManagerMaster
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/25 03:23:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-dd1957de-6907-484d-a3d8-2b3b88e0c7ca
20/07/25 03:23:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/25 03:23:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/25 03:23:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/25 03:23:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://0508a77fcaad:4040
20/07/25 03:23:37 INFO SparkContext: Added file file:///usr/local/airflow/sample.txt at spark://0508a77fcaad:33457/files/sample.txt with timestamp 1595647417081
20/07/25 03:23:37 INFO Utils: Copying /usr/local/airflow/sample.txt to /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/userFiles-74f8cfe4-8a19-4d2e-8fa1-1f0bd1f0ef12/sample.txt
20/07/25 03:23:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.0.2:7077...
20/07/25 03:23:37 INFO TransportClientFactory: Successfully created connection to /172.22.0.2:7077 after 32 ms (0 ms spent in bootstraps)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200725032338-0003
20/07/25 03:23:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45057.
20/07/25 03:23:38 INFO NettyBlockTransferService: Server created on 0508a77fcaad:45057
20/07/25 03:23:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200725032338-0003/0 on worker-20200725025003-172.22.0.4-8881 (172.22.0.4:8881) with 2 core(s)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20200725032338-0003/0 on hostPort 172.22.0.4:8881 with 2 core(s), 1024.0 MB RAM
20/07/25 03:23:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMasterEndpoint: Registering block manager 0508a77fcaad:45057 with 366.3 MB RAM, BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200725032338-0003/0 is now RUNNING
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.020/07/25 03:23:38 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/usr/local/airflow/spark-warehouse').
20/07/25 03:23:38 INFO SharedState: Warehouse path is 'file:/usr/local/airflow/spark-warehouse'.
20/07/25 03:23:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/25 03:23:47 INFO FileSourceStrategy: Pruning directories with:
20/07/25 03:23:47 INFO FileSourceStrategy: Post-Scan Filters:
20/07/25 03:23:47 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/07/25 03:23:47 INFO FileSourceScanExec: Pushed Filters:
20/07/25 03:23:51 INFO CodeGenerator: Code generated in 2187.926234 ms
20/07/25 03:23:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.9 KB, free 366.1 MB)
20/07/25 03:23:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.8 KB, free 366.1 MB)
20/07/25 03:23:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 0508a77fcaad:45057 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:23:55 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
20/07/25 03:23:55 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/07/25 03:23:57 INFO SparkContext: Starting job: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40
20/07/25 03:23:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.22.0.4:59324) with ID 0
20/07/25 03:23:58 INFO DAGScheduler: Registering RDD 5 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39)
20/07/25 03:23:58 INFO DAGScheduler: Got job 0 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40) with 1 output partitions
20/07/25 03:23:58 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40)
20/07/25 03:23:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39), which has no missing parents
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.2 KB, free 366.0 MB)
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.1 KB, free 366.0 MB)
20/07/25 03:23:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 0508a77fcaad:45057 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:23:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
20/07/25 03:23:58 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) (first 15 tasks are for partitions Vector(0))
20/07/25 03:23:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/07/25 03:23:58 INFO BlockManagerMasterEndpoint: Registering block manager 172.22.0.4:45435 with 366.3 MB RAM, BlockManagerId(0, 172.22.0.4, 45435, None)
20/07/25 03:23:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.22.0.4:45435 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:24:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.22.0.4:45435 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:24:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 1]
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 2]
20/07/25 03:24:12 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 3]
20/07/25 03:24:12 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
20/07/25 03:24:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/07/25 03:24:12 INFO TaskSchedulerImpl: Cancelling stage 0
20/07/25 03:24:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/07/25 03:24:12 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) failed in 13.690 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
20/07/25 03:24:12 INFO DAGScheduler: Job 0 failed: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40, took 14.579961 s
Traceback (most recent call last):
File "/opt/spark-2.4.1/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/25 03:24:13 INFO SparkUI: Stopped Spark web UI at http://0508a77fcaad:4040
20/07/25 03:24:13 INFO StandaloneSchedulerBackend: Shutting down all executors
20/07/25 03:24:13 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/07/25 03:24:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/25 03:24:16 INFO MemoryStore: MemoryStore cleared
20/07/25 03:24:16 INFO BlockManager: BlockManager stopped
20/07/25 03:24:16 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/25 03:24:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/25 03:24:16 INFO SparkContext: Successfully stopped SparkContext
20/07/25 03:24:16 INFO ShutdownHookManager: Shutdown hook called
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-2dfb2222-d56c-4ee1-ab62-86e71e5e751b
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/pyspark-2ee74d07-6606-4edc-8420-fe46212c50e5
Change your spark-submit like below for submitting your spark job.
spark-submit \
--master spark://ipaddress:7077 \
--deploy-mode cluster # add this if you want to pass file name to wordcount.py
--files usr/local/airflow/sample.txt \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
OR
spark-submit \
--master spark://ipaddress:7077 \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py /usr/local/airflow/sample.txt

Why would Spark executors be removed (with "ExecutorAllocationManager: Request to remove executorIds" in the logs)?

Im trying to execute a spark job in an AWS cluster of 6 c4.2xlarge nodes and I don't know why Spark is killing the executors...
Any help will be appreciated
Here the spark submit command:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 12 --executor-cores 3 --executor-memory 7G --driver-memory 7g --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
At line this line starts to remove executors
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
Output spark-submit log:
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-avro_2.11;3.2.0 in central
found org.slf4j#slf4j-api;1.7.5 in central
found org.apache.avro#avro;1.7.6 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found com.thoughtworks.paranamer#paranamer;2.3 in central
found org.xerial.snappy#snappy-java;1.0.5 in central
found org.apache.commons#commons-compress;1.4.1 in central
found org.tukaani#xz;1.0 in central
:: resolution report :: resolve 284ms :: artifacts dl 8ms
:: modules in use:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
org.apache.avro#avro;1.7.6 from central in [default]
org.apache.commons#commons-compress;1.4.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.5 from central in [default]
org.tukaani#xz;1.0 from central in [default]
org.xerial.snappy#snappy-java;1.0.5 from central in [default]
:: evicted modules:
org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 10 | 0 | 0 | 1 || 9 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 9 already retrieved (0kB/8ms)
17/05/25 14:41:37 INFO SparkContext: Running Spark version 2.1.0
17/05/25 14:41:38 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:38 INFO Utils: Successfully started service 'sparkDriver' on port 37132.
17/05/25 14:41:38 INFO SparkEnv: Registering MapOutputTracker
17/05/25 14:41:38 INFO SparkEnv: Registering BlockManagerMaster
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/05/25 14:41:38 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-e368a261-c1a1-49e7-8533-8081896a45e4
17/05/25 14:41:38 INFO MemoryStore: MemoryStore started with capacity 4.0 GB
17/05/25 14:41:38 INFO SparkEnv: Registering OutputCommitCoordinator
17/05/25 14:41:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/05/25 14:41:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.185.53.161:4040
17/05/25 14:41:39 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:39 INFO RMProxy: Connecting to ResourceManager at ip-10-185-53-161.eu-west-1.compute.internal/10.185.53.161:8032
17/05/25 14:41:39 INFO Client: Requesting a new application from cluster with 5 NodeManagers
17/05/25 14:41:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
17/05/25 14:41:40 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
17/05/25 14:41:40 INFO Client: Setting up container launch context for our AM
17/05/25 14:41:40 INFO Client: Setting up the launch environment for our AM container
17/05/25 14:41:40 INFO Client: Preparing resources for our AM container
17/05/25 14:41:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/05/25 14:41:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_libs__6500399427935716229.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_libs__6500399427935716229.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/RedshiftJDBC42-1.2.1.1001.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/RedshiftJDBC42-1.2.1.1001.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.databricks_spark-avro_2.11-3.2.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.slf4j_slf4j-api-1.7.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.avro_avro-1.7.6.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-core-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.thoughtworks.paranamer_paranamer-2.3.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.xerial.snappy_snappy-java-1.0.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.commons_commons-compress-1.4.1.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.tukaani_xz-1.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/hive-site.xml
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/pyspark.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/py4j-0.10.4-src.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/dependencies.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/dependencies.zip
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_conf__1516567354161750682.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_conf__.zip
17/05/25 14:41:43 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:43 INFO Client: Submitting application application_1495720658394_0004 to ResourceManager
17/05/25 14:41:43 INFO YarnClientImpl: Submitted application application_1495720658394_0004
17/05/25 14:41:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1495720658394_0004 and attemptId None
17/05/25 14:41:44 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:44 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:45 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
17/05/25 14:41:46 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-185-53-161.eu-west-1.compute.internal, PROXY_URI_BASES -> http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004), /proxy/application_1495720658394_0004
17/05/25 14:41:46 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/05/25 14:41:47 INFO Client: Application report for application_1495720658394_0004 (state: RUNNING)
17/05/25 14:41:47 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.185.52.31
ApplicationMaster RPC port: 0
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:47 INFO YarnClientSchedulerBackend: Application application_1495720658394_0004 has started running.
17/05/25 14:41:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37860.
17/05/25 14:41:47 INFO NettyBlockTransferService: Server created on 10.185.53.161:37860
17/05/25 14:41:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/05/25 14:41:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMasterEndpoint: Registering block manager 10.185.53.161:37860 with 4.0 GB RAM, BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManager: external shuffle service port = 7337
17/05/25 14:41:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1495720658394_0004
17/05/25 14:41:47 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.52.31:57406) with ID 5
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 5 has registered (new total is 1)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-52-31.eu-west-1.compute.internal:38781 with 4.0 GB RAM, BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.45:40096) with ID 3
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 3 has registered (new total is 2)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-45.eu-west-1.compute.internal:43702 with 4.0 GB RAM, BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.135:42390) with ID 2
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 2 has registered (new total is 3)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-135.eu-west-1.compute.internal:41552 with 4.0 GB RAM, BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.10:60612) with ID 1
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 4)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-10.eu-west-1.compute.internal:33391 with 4.0 GB RAM, BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.68:57424) with ID 4
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 4 has registered (new total is 5)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-68.eu-west-1.compute.internal:34222 with 4.0 GB RAM, BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:42:09 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
17/05/25 14:42:09 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
17/05/25 14:42:10 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
17/05/25 14:42:11 INFO CodeGenerator: Code generated in 170.416763 ms
17/05/25 14:42:11 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:90
17/05/25 14:42:11 INFO DAGScheduler: Got job 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) with 1 output partitions
17/05/25 14:42:11 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:11 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:11 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90), which has no missing parents
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:11 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
17/05/25 14:42:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-10-185-53-135.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5899 bytes)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-185-53-135.eu-west-1.compute.internal:41552 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1101 ms on ip-10-185-53-135.eu-west-1.compute.internal (executor 2) (1/1)
17/05/25 14:42:12 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/05/25 14:42:12 INFO DAGScheduler: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) finished in 1.109 s
17/05/25 14:42:12 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/iface_extractions/select_fields.py:90, took 1.290037 s
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:91
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-10-185-53-135.eu-west-1.compute.internal:41552 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO DAGScheduler: Got job 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) with 1 output partitions
17/05/25 14:42:12 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:12 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:12 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91), which has no missing parents
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:12 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
17/05/25 14:42:12 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ip-10-185-53-68.eu-west-1.compute.internal, executor 4, partition 0, PROCESS_LOCAL, 5900 bytes)
17/05/25 14:42:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-185-53-68.eu-west-1.compute.internal:34222 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1047 ms on ip-10-185-53-68.eu-west-1.compute.internal (executor 4) (1/1)
17/05/25 14:42:14 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/05/25 14:42:14 INFO DAGScheduler: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) finished in 1.047 s
17/05/25 14:42:14 INFO DAGScheduler: Job 1 finished: collect at /home/hadoop/iface_extractions/select_fields.py:91, took 1.054768 s
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.109425 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 12.568665 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.257538 ms
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ip-10-185-53-68.eu-west-1.compute.internal:34222 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.563958 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 18.189301 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.490762 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 15.156166 ms
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 5, 3
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 5 because it has been idle for 60 seconds (new desired total will be 4)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 3)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
17/05/25 14:42:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5.
17/05/25 14:42:50 INFO DAGScheduler: Executor lost: 5 (epoch 0)
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:42:50 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
17/05/25 14:42:50 INFO YarnScheduler: Executor 5 on ip-10-185-52-31.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:50 INFO ExecutorAllocationManager: Existing executor 5 has been removed (new total is 4)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 1 on ip-10-185-53-10.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 3)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 3 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 3 on ip-10-185-53-45.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
17/05/25 14:43:12 INFO ExecutorAllocationManager: Request to remove executorIds: 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 2
17/05/25 14:43:12 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 1)
17/05/25 14:43:13 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
17/05/25 14:43:13 INFO DAGScheduler: Executor lost: 2 (epoch 0)
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:43:13 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/05/25 14:43:13 INFO YarnScheduler: Executor 2 on ip-10-185-53-135.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:13 INFO ExecutorAllocationManager: Existing executor 2 has been removed (new total is 1)
17/05/25 14:43:14 INFO ExecutorAllocationManager: Request to remove executorIds: 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 4
17/05/25 14:43:14 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 0)
17/05/25 14:43:17 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4.
17/05/25 14:43:17 INFO DAGScheduler: Executor lost: 4 (epoch 0)
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:43:17 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
17/05/25 14:43:17 INFO YarnScheduler: Executor 4 on ip-10-185-53-68.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:17 INFO ExecutorAllocationManager: Existing executor 4 has been removed (new total is 0)
My guess is that you've got Dynamic Resource Allocation enabled in your Spark configuration.
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
This feature is disabled by default and available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
I highlighted the relevant part that says it is disabled by default and hence I can only guess that it was enabled.
From ExecutorAllocationManager:
An agent that dynamically allocates and removes executors based on the workload.
With that said, I'd use web UI and see if spark.dynamicAllocation.enabled property is enabled or not.
There are two requirements for using this feature (Dynamic Resource Allocation). First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application.
This is the line that prints out the INFO message:
logInfo("Request to remove executorIds: " + executors.mkString(", "))
You can also kill executors using SparkContext.killExecutors that gives a Spark developer a way to kill executors himself.
killExecutors(executorIds: Seq[String]): Boolean Request that the cluster manager kill the specified executors.
There are two killExecutors actually and they are very helpful for demo purposes as you can easily show how executors come and go.

Spark: why tasks assigned only to one worker?

I'm new to Apache Spark and trying to run a simple program on my cluster. The problem is that the driver allocates all tasks to one worker.
I am running as spark stand-alone cluster mode on 2 computers:
1 - runs the master and a worker with 4 cores: 1 used for the master, 3 for the worker. Ip: 192.168.1.101
2 - runs only a worker with 4 cores: all for worker. Ip: 192.168.1.104
this is the code:
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("spark-project");
JavaSparkContext sc = new JavaSparkContext(conf);
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
JavaRDD<String> lines = sc.textFile("/Datasets/somefile.txt",7);
System.out.println(lines.partitions().size());
Accumulator<Integer> sum = sc.accumulator(0);
JavaRDD<Integer> numbers = lines.map(line -> 1);
System.out.println(numbers.partitions().size());
numbers.foreach(num -> System.out.println(num));
numbers.foreach(num -> sum.add(num));
System.out.println(sum.value());
sc.close();
}
Note: used Thread.sleep() command because I tried this: https://issues.apache.org/jira/browse/SPARK-3100
I used the submit script:
bin/spark-submit --class spark.Main --master spark://192.168.1.101:7077 --deploy-mode cluster /home/sparkUser/JarsOfSpark/JarForSpark.jar
this is the result I have got from the driver stdout:
7
7
50144
logs from the master:
log4j:WARN No appenders could be found for logger(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/15 19:22:14 INFO SecurityManager: Changing view acls to: sparkUser
16/01/15 19:22:14 INFO SecurityManager: Changing modify acls to: sparkUser
16/01/15 19:22:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkUser); users with modify permissions: Set(sparkUser)
16/01/15 19:22:24 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:22:24 INFO Utils: Successfully started service 'Driver' on port 46546.
16/01/15 19:22:24 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#192.168.1.101:43150/user/Worker
16/01/15 19:22:24 INFO SparkContext: Running Spark version 1.4.1
16/01/15 19:22:24 INFO SecurityManager: Changing view acls to: sparkUser
16/01/15 19:22:24 INFO SecurityManager: Changing modify acls to: sparkUser
16/01/15 19:22:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkUser); users with modify permissions: Set(sparkUser)
16/01/15 19:22:24 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#192.168.1.101:43150/user/Worker
16/01/15 19:22:25 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:22:25 INFO Utils: Successfully started service 'sparkDriver' on port 38186.
16/01/15 19:22:25 INFO SparkEnv: Registering MapOutputTracker
16/01/15 19:22:25 INFO SparkEnv: Registering BlockManagerMaster
16/01/15 19:22:25 INFO DiskBlockManager: Created local directory at /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1/blockmgr-e80c1c60-fe19-4be1-b3f9-259b3f1031a0
16/01/15 19:22:25 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
16/01/15 19:22:25 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1/httpd-e05a5a70-dbf3-4055-b6ab-7efa22dfa4d2
16/01/15 19:22:25 INFO HttpServer: Starting HTTP Server
16/01/15 19:22:25 INFO Utils: Successfully started service 'HTTP file server' on port 34728.
16/01/15 19:22:25 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/15 19:22:35 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/15 19:22:35 INFO SparkUI: Started SparkUI at http://192.168.1.101:4040
16/01/15 19:22:35 INFO SparkContext: Added JAR file:/home/sparkUser/JarsOfSpark/JarForSpark.jar at http://192.168.1.101:34728/jars/JarForSpark.jar with timestamp 1452878555317
16/01/15 19:22:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#192.168.1.101:7077/user/Master...
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160115192235-0016
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor added: app-20160115192235-0016/0 on worker-20160115181337-192.168.1.104-50099 (192.168.1.104:50099) with 4 cores
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160115192235-0016/0 on hostPort 192.168.1.104:50099 with 4 cores, 512.0 MB RAM
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor added: app-20160115192235-0016/1 on worker-20160115125104-192.168.1.101-43150 (192.168.1.101:43150) with 3 cores
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160115192235-0016/1 on hostPort 192.168.1.101:43150 with 3 cores, 512.0 MB RAM
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/1 is now LOADING
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/0 is now LOADING
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/0 is now RUNNING
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/1 is now RUNNING
16/01/15 19:22:35 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33359.
16/01/15 19:22:35 INFO NettyBlockTransferService: Server created on 33359
16/01/15 19:22:35 INFO BlockManagerMaster: Trying to register BlockManager
16/01/15 19:22:35 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.101:33359 with 265.1 MB RAM, BlockManagerId(driver, 192.168.1.101, 33359)
16/01/15 19:22:35 INFO BlockManagerMaster: Registered BlockManager
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/01/15 19:22:38 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#192.168.1.104:49573/user/Executor#1472403765]) with ID 0
16/01/15 19:22:39 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.104:33856 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.104, 33856)
16/01/15 19:22:40 INFO MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=278019440
16/01/15 19:22:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 265.0 MB)
16/01/15 19:22:40 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=130448, maxMem=278019440
16/01/15 19:22:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 265.0 MB)
16/01/15 19:22:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.101:33359 (size: 13.9 KB, free: 265.1 MB)
16/01/15 19:22:40 INFO SparkContext: Created broadcast 0 from textFile at Main.java:25
16/01/15 19:22:41 INFO FileInputFormat: Total input paths to process : 1
16/01/15 19:22:41 INFO SparkContext: Starting job: foreach at Main.java:33
16/01/15 19:22:41 INFO DAGScheduler: Got job 0 (foreach at Main.java:33) with 7 output partitions (allowLocal=false)
16/01/15 19:22:41 INFO DAGScheduler: Final stage: ResultStage 0(foreach at Main.java:33)
16/01/15 19:22:41 INFO DAGScheduler: Parents of final stage: List()
16/01/15 19:22:41 INFO DAGScheduler: Missing parents: List()
16/01/15 19:22:41 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at map at Main.java:30), which has no missing parents
16/01/15 19:22:41 INFO MemoryStore: ensureFreeSpace(4400) called with curMem=144705, maxMem=278019440
16/01/15 19:22:41 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 265.0 MB)
16/01/15 19:22:41 INFO MemoryStore: ensureFreeSpace(2538) called with curMem=149105, maxMem=278019440
16/01/15 19:22:41 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 265.0 MB)
16/01/15 19:22:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.101:33359 (size: 2.5 KB, free: 265.1 MB)
16/01/15 19:22:41 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
16/01/15 19:22:41 INFO DAGScheduler: Submitting 7 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at map at Main.java:30)
16/01/15 19:22:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks
16/01/15 19:22:41 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.104:33856 (size: 2.5 KB, free: 265.1 MB)
16/01/15 19:22:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.104:33856 (size: 13.9 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 2017 ms on 192.168.1.104 (1/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2036 ms on 192.168.1.104 (2/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2027 ms on 192.168.1.104 (3/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2027 ms on 192.168.1.104 (4/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 143 ms on 192.168.1.104 (5/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 199 ms on 192.168.1.104 (6/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 206 ms on 192.168.1.104 (7/7)
16/01/15 19:22:43 INFO DAGScheduler: ResultStage 0 (foreach at Main.java:33) finished in 2.218 s
16/01/15 19:22:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/15 19:22:43 INFO DAGScheduler: Job 0 finished: foreach at Main.java:33, took 2.289399 s
16/01/15 19:22:43 INFO SparkContext: Starting job: foreach at Main.java:34
16/01/15 19:22:43 INFO DAGScheduler: Got job 1 (foreach at Main.java:34) with 7 output partitions (allowLocal=false)
16/01/15 19:22:43 INFO DAGScheduler: Final stage: ResultStage 1(foreach at Main.java:34)
16/01/15 19:22:43 INFO DAGScheduler: Parents of final stage: List()
16/01/15 19:22:43 INFO DAGScheduler: Missing parents: List()
16/01/15 19:22:43 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[2] at map at Main.java:30), which has no missing parents
16/01/15 19:22:43 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=151643, maxMem=278019440
16/01/15 19:22:43 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.7 KB, free 265.0 MB)
16/01/15 19:22:43 INFO MemoryStore: ensureFreeSpace(2761) called with curMem=156467, maxMem=278019440
16/01/15 19:22:43 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.7 KB, free 265.0 MB)
16/01/15 19:22:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.101:33359 (size: 2.7 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/01/15 19:22:43 INFO DAGScheduler: Submitting 7 missing tasks from ResultStage 1 (MapPartitionsRDD[2] at map at Main.java:30)
16/01/15 19:22:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 7 tasks
16/01/15 19:22:43 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 7, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 8, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 9, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 10, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.104:33856 (size: 2.7 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 11, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 7) in 106 ms on 192.168.1.104 (1/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 8) in 125 ms on 192.168.1.104 (2/7)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 12, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 13, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 9) in 131 ms on 192.168.1.104 (3/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 10) in 133 ms on 192.168.1.104 (4/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 12) in 32 ms on 192.168.1.104 (5/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 11) in 61 ms on 192.168.1.104 (6/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 13) in 34 ms on 192.168.1.104 (7/7)
16/01/15 19:22:43 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/01/15 19:22:43 INFO DAGScheduler: ResultStage 1 (foreach at Main.java:34) finished in 0.165 s
16/01/15 19:22:43 INFO DAGScheduler: Job 1 finished: foreach at Main.java:34, took 0.177378 s
16/01/15 19:22:43 INFO SparkUI: Stopped Spark web UI at http://192.168.1.101:4040
16/01/15 19:22:43 INFO DAGScheduler: Stopping DAGScheduler
16/01/15 19:22:43 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/01/15 19:22:43 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/01/15 19:22:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/15 19:22:43 INFO Utils: path = /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1/blockmgr-e80c1c60-fe19-4be1-b3f9-259b3f1031a0, already present as root for deletion.
16/01/15 19:22:43 INFO MemoryStore: MemoryStore cleared
16/01/15 19:22:43 INFO BlockManager: BlockManager stopped
16/01/15 19:22:43 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/15 19:22:43 INFO SparkContext: Successfully stopped SparkContext
16/01/15 19:22:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/15 19:22:43 INFO Utils: Shutdown hook called
16/01/15 19:22:43 INFO Utils: Deleting directory /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1
logs from worker 192.168.1.101:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/15 18:14:15 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/01/15 18:14:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/15 18:14:15 INFO SecurityManager: Changing view acls to: sparkUser
16/01/15 18:14:15 INFO SecurityManager: Changing modify acls to: sparkUser
16/01/15 18:14:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkUser); users with modify permissions: Set(sparkUser)
logs from worker 192.168.1.104:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/15 19:23:23 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/01/15 19:23:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/15 19:23:24 INFO SecurityManager: Changing view acls to: root,sparkUser
16/01/15 19:23:24 INFO SecurityManager: Changing modify acls to: root,sparkUser
16/01/15 19:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, sparkUser); users with modify permissions: Set(root, sparkUser)
16/01/15 19:23:25 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:23:25 INFO Remoting: Starting remoting
16/01/15 19:23:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#192.168.1.104:43937]
16/01/15 19:23:25 INFO Utils: Successfully started service 'driverPropsFetcher' on port 43937.
16/01/15 19:23:26 INFO SecurityManager: Changing view acls to: root,sparkUser
16/01/15 19:23:26 INFO SecurityManager: Changing modify acls to: root,sparkUser
16/01/15 19:23:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, sparkUser); users with modify permissions: Set(root, sparkUser)
16/01/15 19:23:26 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/15 19:23:26 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/15 19:23:26 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:23:26 INFO Remoting: Starting remoting
16/01/15 19:23:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor#192.168.1.104:49573]
16/01/15 19:23:26 INFO Utils: Successfully started service 'sparkExecutor' on port 49573.
16/01/15 19:23:26 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/01/15 19:23:26 INFO DiskBlockManager: Created local directory at /tmp/spark-6ffb215c-7267-4a93-a766-2486d2331f6b/executor-146bfe64-d7e8-4da4-9144-8003754f0b5b/blockmgr-41031d8c-b069-4147-90c9-2237baed04f1
16/01/15 19:23:26 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
16/01/15 19:23:26 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver#192.168.1.101:38186/user/CoarseGrainedScheduler
16/01/15 19:23:26 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#192.168.1.104:50099/user/Worker
16/01/15 19:23:26 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#192.168.1.104:50099/user/Worker
16/01/15 19:23:26 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
16/01/15 19:23:26 INFO Executor: Starting executor ID 0 on host 192.168.1.104
16/01/15 19:23:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33856.
16/01/15 19:23:26 INFO NettyBlockTransferService: Server created on 33856
16/01/15 19:23:26 INFO BlockManagerMaster: Trying to register BlockManager
16/01/15 19:23:26 INFO BlockManagerMaster: Registered BlockManager
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 0
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 1
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 2
16/01/15 19:23:29 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/01/15 19:23:29 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 3
16/01/15 19:23:29 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/01/15 19:23:29 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
16/01/15 19:23:29 INFO Executor: Fetching http://192.168.1.101:34728/jars/JarForSpark.jar with timestamp 1452878555317
16/01/15 19:23:29 INFO Utils: Fetching http://192.168.1.101:34728/jars/JarForSpark.jar to /tmp/spark-6ffb215c-7267-4a93-a766-2486d2331f6b/executor-146bfe64-d7e8-4da4-9144-8003754f0b5b/fetchFileTemp1585609242243689070.tmp
16/01/15 19:23:29 INFO Utils: Copying /tmp/spark-6ffb215c-7267-4a93-a766-2486d2331f6b/executor-146bfe64-d7e8-4da4-9144-8003754f0b5b/3339800781452878555317_cache to /home/sparkUser2/Programs/spark-1.4.1-bin-hadoop2.6/work/app-20160115192235-0016/0/./JarForSpark.jar
16/01/15 19:23:29 INFO Executor: Adding file:/home/sparkUser2/Programs/spark-1.4.1-bin-hadoop2.6/work/app-20160115192235-0016/0/./JarForSpark.jar to class loader
16/01/15 19:23:29 INFO TorrentBroadcast: Started reading broadcast variable 1
16/01/15 19:23:29 INFO MemoryStore: ensureFreeSpace(2538) called with curMem=0, maxMem=278019440
16/01/15 19:23:29 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 265.1 MB)
16/01/15 19:23:29 INFO TorrentBroadcast: Reading broadcast variable 1 took 273 ms
16/01/15 19:23:29 INFO MemoryStore: ensureFreeSpace(4400) called with curMem=2538, maxMem=278019440
16/01/15 19:23:29 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 265.1 MB)
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:161772+161772
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:323544+161772
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:0+161772
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:485316+161772
16/01/15 19:23:29 INFO TorrentBroadcast: Started reading broadcast variable 0
16/01/15 19:23:29 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=6938, maxMem=278019440
16/01/15 19:23:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 265.1 MB)
16/01/15 19:23:30 INFO TorrentBroadcast: Reading broadcast variable 0 took 66 ms
16/01/15 19:23:30 INFO MemoryStore: ensureFreeSpace(188976) called with curMem=21195, maxMem=278019440
16/01/15 19:23:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 184.5 KB, free 264.9 MB)
16/01/15 19:23:30 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/01/15 19:23:30 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/01/15 19:23:30 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/01/15 19:23:30 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/01/15 19:23:30 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/01/15 19:23:30 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 1796 bytes result sent to driver
16/01/15 19:23:30 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1796 bytes result sent to driver
16/01/15 19:23:30 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1796 bytes result sent to driver
16/01/15 19:23:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 4
16/01/15 19:23:31 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:647088+161772
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 5
16/01/15 19:23:31 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 6
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:808860+161772
16/01/15 19:23:31 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:970632+161773
16/01/15 19:23:31 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 7
16/01/15 19:23:31 INFO Executor: Running task 0.0 in stage 1.0 (TID 7)
16/01/15 19:23:31 INFO TorrentBroadcast: Started reading broadcast variable 2
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 8
16/01/15 19:23:31 INFO Executor: Running task 1.0 in stage 1.0 (TID 8)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 9
16/01/15 19:23:31 INFO Executor: Running task 2.0 in stage 1.0 (TID 9)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 10
16/01/15 19:23:31 INFO Executor: Running task 3.0 in stage 1.0 (TID 10)
16/01/15 19:23:31 INFO MemoryStore: ensureFreeSpace(2761) called with curMem=210171, maxMem=278019440
16/01/15 19:23:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.7 KB, free 264.9 MB)
16/01/15 19:23:31 INFO TorrentBroadcast: Reading broadcast variable 2 took 42 ms
16/01/15 19:23:31 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=212932, maxMem=278019440
16/01/15 19:23:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.7 KB, free 264.9 MB)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:0+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:161772+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:323544+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:485316+161772
16/01/15 19:23:31 INFO Executor: Finished task 0.0 in stage 1.0 (TID 7). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 1.0 in stage 1.0 (TID 8). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 2.0 in stage 1.0 (TID 9). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 3.0 in stage 1.0 (TID 10). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 11
16/01/15 19:23:31 INFO Executor: Running task 4.0 in stage 1.0 (TID 11)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:647088+161772
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 12
16/01/15 19:23:31 INFO Executor: Running task 5.0 in stage 1.0 (TID 12)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 13
16/01/15 19:23:31 INFO Executor: Running task 6.0 in stage 1.0 (TID 13)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:808860+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:970632+161773
16/01/15 19:23:31 INFO Executor: Finished task 5.0 in stage 1.0 (TID 12). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 4.0 in stage 1.0 (TID 11). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 6.0 in stage 1.0 (TID 13). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
I also tried to stop one of the workers to see what happens and the program successfully completed on the other worker.
and I looked at this post but unfortunately it didn't solved my problem:
Why my tasks only be done in one worker in Spark cluster
Appreciate your help!
It is because of Data Locality - "How close data is to the code processing it"
Spark tries to schedule the available tasks to its best locality levels.
Spark by default tries "PROCESS_LOCAL" mode as the first option and switches on to the lower levels only if it sees that the none of the CPU's are freed after a certain time interval.
Default wait time before switching to lower levels is 3s (see spark.locality.wait parameter).
And looking at the logs, all your tasks are finished within 3 seconds.
16/01/15 19:22:41 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.104:33856 (size: 2.5 KB, free: 265.1 MB)
16/01/15 19:22:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.104:33856 (size: 13.9 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 2017 ms on 192.168.1.104 (1/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2036 ms on 192.168.1.104 (2/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2027 ms on 192.168.1.104 (3/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2027 ms on 192.168.1.104 (4/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 143 ms on 192.168.1.104 (5/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 199 ms on 192.168.1.104 (6/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 206 ms on 192.168.1.104 (7/7)
16/01/15 19:22:43 INFO DAGScheduler: ResultStage 0 (foreach at Main.java:33) finished in 2.218 s
16/01/15 19:22:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/15 19:22:43 INFO DAGScheduler: Job 0 finished: foreach at Main.java:33, took 2.289399 s
Would suggest to try with larger files (in GB's) where each tasks takes some time to get the final results.
for more information on Data Locality, please read the section "Data Locality" in Spark Tuning Section

Apache Spark Multi Node Clustering - java.io.FileNotFoundException

I am newbie to Apache Spark and Cluster Computing and I implemented Spark in Standalone Mode (Same Machine with Master and Worker), it worked fine for me.
Then, I downloaded pre-built version of spark, and followed these instructions and placed in every nodes of my cluster: http://spark.apache.org/docs/latest/spark-standalone.html#installing-spark-standalone-to-a-cluster.
My Master node has IP address: 172.17.0.224 and my Slave nodes has IP Address: 172.17.0.221, 172.17.0.222 and 172.17.0.223.
And I edited slaves and spark-env.sh files to add the IP addresses of my slaves and IP address of my master respectively.
I started the master node start-master.sh and started the slave nodes with start-slaves.sh, everything worked fine.
I submitted my spark-job using the command spark-submit --class "Rice" --master spark://172.17.0.224:7077 cs453project/target/scala-2.11/simple-project_2.11-1.0.jar cs453project/input.txt cs453project/ouput2 cs453project/ouput3.
This is the error messages I got:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/25 11:22:27 INFO SparkContext: Running Spark version 1.5.2
15/11/25 11:22:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/25 11:22:28 WARN Utils: Your hostname, node04 resolves to a loopback address: 127.0.1.1; using 172.17.0.224 instead (on interface eth0)
15/11/25 11:22:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/11/25 11:22:28 INFO SecurityManager: Changing view acls to: ujjwal
15/11/25 11:22:28 INFO SecurityManager: Changing modify acls to: ujjwal
15/11/25 11:22:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ujjwal); users with modify permissions: Set(ujjwal)
15/11/25 11:22:28 INFO Slf4jLogger: Slf4jLogger started
15/11/25 11:22:28 INFO Remoting: Starting remoting
15/11/25 11:22:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#172.17.0.224:58478]
15/11/25 11:22:28 INFO Utils: Successfully started service 'sparkDriver' on port 58478.
15/11/25 11:22:28 INFO SparkEnv: Registering MapOutputTracker
15/11/25 11:22:28 INFO SparkEnv: Registering BlockManagerMaster
15/11/25 11:22:28 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bc18e422-d334-4fe5-9663-9439620ec054
15/11/25 11:22:28 INFO MemoryStore: MemoryStore started with capacity 530.3 MB
15/11/25 11:22:29 INFO HttpFileServer: HTTP File server directory is /tmp/spark-7c6e0ad4-52ae-4f5a-9aaa-6ad9fbf48685/httpd-13d8dd4d-6ff1-450d-baac-f2702c7a4e5b
15/11/25 11:22:29 INFO HttpServer: Starting HTTP Server
15/11/25 11:22:29 INFO Utils: Successfully started service 'HTTP file server' on port 49496.
15/11/25 11:22:29 INFO SparkEnv: Registering OutputCommitCoordinator
15/11/25 11:22:29 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/11/25 11:22:29 INFO SparkUI: Started SparkUI at http://172.17.0.224:4040
15/11/25 11:22:29 INFO SparkContext: Added JAR file:/home/ujjwal/cs453project/target/scala-2.11/simple-project_2.11-1.0.jar at http://172.17.0.224:49496/jars/simple-project_2.11-1.0.jar with timestamp 1448479349380
15/11/25 11:22:29 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Connecting to master spark://172.17.0.224:7077...
15/11/25 11:22:29 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20151125112229-0001
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Executor added: app-20151125112229-0001/0 on worker-20151125095922-172.17.0.221-33366 (172.17.0.221:33366) with 2 cores
15/11/25 11:22:29 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151125112229-0001/0 on hostPort 172.17.0.221:33366 with 2 cores, 1024.0 MB RAM
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Executor updated: app-20151125112229-0001/0 is now LOADING
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Executor updated: app-20151125112229-0001/0 is now RUNNING
15/11/25 11:22:29 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47843.
15/11/25 11:22:29 INFO NettyBlockTransferService: Server created on 47843
15/11/25 11:22:29 INFO BlockManagerMaster: Trying to register BlockManager
15/11/25 11:22:29 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.224:47843 with 530.3 MB RAM, BlockManagerId(driver, 172.17.0.224, 47843)
15/11/25 11:22:29 INFO BlockManagerMaster: Registered BlockManager
15/11/25 11:22:29 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 530.1 MB)
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(14276) called with curMem=157248, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 530.1 MB)
15/11/25 11:22:30 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.224:47843 (size: 13.9 KB, free: 530.3 MB)
15/11/25 11:22:30 INFO SparkContext: Created broadcast 0 from textFile at build.scala:11
15/11/25 11:22:30 INFO FileInputFormat: Total input paths to process : 1
15/11/25 11:22:30 INFO SparkContext: Starting job: count at build.scala:13
15/11/25 11:22:30 INFO DAGScheduler: Got job 0 (count at build.scala:13) with 108 output partitions
15/11/25 11:22:30 INFO DAGScheduler: Final stage: ResultStage 0(count at build.scala:13)
15/11/25 11:22:30 INFO DAGScheduler: Parents of final stage: List()
15/11/25 11:22:30 INFO DAGScheduler: Missing parents: List()
15/11/25 11:22:30 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at map at build.scala:12), which has no missing parents
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(3424) called with curMem=171524, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 530.1 MB)
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(1934) called with curMem=174948, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1934.0 B, free 530.1 MB)
15/11/25 11:22:30 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.224:47843 (size: 1934.0 B, free: 530.3 MB)
15/11/25 11:22:30 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/11/25 11:22:30 INFO DAGScheduler: Submitting 108 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at map at build.scala:12)
15/11/25 11:22:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 108 tasks
15/11/25 11:22:31 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#172.17.0.221:55861/user/Executor#-498212581]) with ID 0
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.221:49642 with 530.3 MB RAM, BlockManagerId(0, 172.17.0.221, 49642)
15/11/25 11:22:32 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.221:49642 (size: 1934.0 B, free: 530.3 MB)
15/11/25 11:22:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.221:49642 (size: 13.9 KB, free: 530.3 MB)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 172.17.0.221): java.io.FileNotFoundException: File file:/home/ujjwal/cs453project/input.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 1]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.1 in stage 0.0 (TID 5, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 2]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 6, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 3]
15/11/25 11:22:32 INFO TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 4]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 4.1 in stage 0.0 (TID 7, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.1 in stage 0.0 (TID 5) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 5]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.2 in stage 0.0 (TID 8, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 6) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 6]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 9, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.2 in stage 0.0 (TID 8) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 7]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.3 in stage 0.0 (TID 10, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 4.1 in stage 0.0 (TID 7) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 8]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 4.2 in stage 0.0 (TID 11, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 9) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 9]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 12, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.3 in stage 0.0 (TID 10) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 10]
15/11/25 11:22:32 ERROR TaskSetManager: Task 3 in stage 0.0 failed 4 times; aborting job
15/11/25 11:22:32 INFO TaskSchedulerImpl: Cancelling stage 0
15/11/25 11:22:32 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/11/25 11:22:32 INFO DAGScheduler: ResultStage 0 (count at build.scala:13) failed in 2.216 s
15/11/25 11:22:32 INFO TaskSetManager: Lost task 4.2 in stage 0.0 (TID 11) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 11]
15/11/25 11:22:32 INFO DAGScheduler: Job 0 failed: count at build.scala:13, took 2.373631 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 10, 172.17.0.221): java.io.FileNotFoundException: File file:/home/ujjwal/cs453project/input.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD.count(RDD.scala:1125)
at Rice$.main(build.scala:13)
at Rice.main(build.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: File file:/home/ujjwal/cs453project/input.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 12) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 12]
15/11/25 11:22:32 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/11/25 11:22:32 INFO SparkContext: Invoking stop() from shutdown hook
15/11/25 11:22:33 INFO SparkUI: Stopped Spark web UI at http://172.17.0.224:4040
15/11/25 11:22:33 INFO DAGScheduler: Stopping DAGScheduler
15/11/25 11:22:33 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/11/25 11:22:33 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
15/11/25 11:22:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/11/25 11:22:33 INFO MemoryStore: MemoryStore cleared
15/11/25 11:22:33 INFO BlockManager: BlockManager stopped
15/11/25 11:22:33 INFO BlockManagerMaster: BlockManagerMaster stopped
15/11/25 11:22:33 INFO SparkContext: Successfully stopped SparkContext
15/11/25 11:22:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/11/25 11:22:33 INFO ShutdownHookManager: Shutdown hook called
15/11/25 11:22:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-7c6e0ad4-52ae-4f5a-9aaa-6ad9fbf48685
Could you please help me understand how can I solve my problem? Thanks!
The path you used is probably only local to the driver. You have to use a path that is accessible to all of the workers. The driver does not send the actual data to the workers - that would be unfortunately slow. The workers will try to read the data using the path you gave them. In this case, they will fail because the don't have the files locally.
#user3180835, as suggested by #Mike Park, in my case, after I copied the file from local linux file system to hdfs, it started working.
hdfs dfs -cp file:///<path_to_local_file> /<hdfs_file_dir>

Spark metrics on wordcount example

I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make it work.
spark/conf/metrics.properties :
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink.csv.period=1
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=/home/spark/Documents/test/
# Worker instance overlap polling period
worker.sink.csv.period=1
worker.sink.csv.unit=seconds
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I run my app in local like in the documentation :
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
I checked /home/spark/Documents/test/ and it is empty.
What did I miss?
Shell:
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] --conf spark.metrics.conf=/home/spark/development/spark/conf/metrics.properties target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
INFO SparkContext: Running Spark version 1.3.0
WARN Utils: Your hostname, cv-local resolves to a loopback address: 127.0.1.1; using 192.168.1.64 instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
INFO SecurityManager: Changing view acls to: spark
INFO SecurityManager: Changing modify acls to: spark
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
INFO Slf4jLogger: Slf4jLogger started
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#cv-local.local:35895]
INFO Utils: Successfully started service 'sparkDriver' on port 35895.
INFO SparkEnv: Registering MapOutputTracker
INFO SparkEnv: Registering BlockManagerMaster
INFO DiskBlockManager: Created local directory at /tmp/spark-447d56c9-cfe5-4f9d-9e0a-6bb476ddede6/blockmgr-4eaa04f4-b4b2-4b05-ba0e-fd1aeb92b289
INFO MemoryStore: MemoryStore started with capacity 265.4 MB
INFO HttpFileServer: HTTP File server directory is /tmp/spark-fae11cd2-937e-4be3-a273-be8b4c4847df/httpd-ca163445-6fff-45e4-9c69-35edcea83b68
INFO HttpServer: Starting HTTP Server
INFO Utils: Successfully started service 'HTTP file server' on port 52828.
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://cv-local.local:4040
INFO SparkContext: Added JAR file:/home/spark/workspace/IdeaProjects/wordcount/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Executor: Starting executor ID <driver> on host localhost
INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#cv-local.local:35895/user/HeartbeatReceiver
INFO NettyBlockTransferService: Server created on 60320
INFO BlockManagerMaster: Trying to register BlockManager
INFO BlockManagerMasterActor: Registering block manager localhost:60320 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 60320)
INFO BlockManagerMaster: Registered BlockManager
INFO MemoryStore: ensureFreeSpace(34046) called with curMem=0, maxMem=278302556
INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 33.2 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(5221) called with curMem=34046, maxMem=278302556
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60320 (size: 5.1 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN LoadSnappy: Snappy native library not loaded
INFO FileInputFormat: Total input paths to process : 1
INFO SparkContext: Starting job: count at SimpleApp.scala:12
INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:12)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=39267, maxMem=278302556
INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.8 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=42115, maxMem=278302556
INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
INFO Executor: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Utils: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar to /tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/fetchFileTemp4229868141058449157.tmp
INFO Executor: Adding file:/tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/simple-project_2.10-1.0.jar to class loader
INFO CacheManager: Partition rdd_1_1 not found, computing it
INFO CacheManager: Partition rdd_1_0 not found, computing it
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:2659+2659
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:0+2659
INFO MemoryStore: ensureFreeSpace(7840) called with curMem=44171, maxMem=278302556
INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 7.7 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:60320 (size: 7.7 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_0
INFO MemoryStore: ensureFreeSpace(8648) called with curMem=52011, maxMem=278302556
INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 8.4 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:60320 (size: 8.4 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_1
INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2399 bytes result sent to driver
INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2399 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 139 ms on localhost (1/2)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 133 ms on localhost (2/2)
INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:12) finished in 0.151 s
INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.225939 s
INFO SparkContext: Starting job: count at SimpleApp.scala:13
INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:13)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=60659, maxMem=278302556
INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.8 KB, free 265.3 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=63507, maxMem=278302556
INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.3 MB)
INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
INFO BlockManager: Found block rdd_1_0 locally
INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 9 ms on localhost (1/2)
INFO BlockManager: Found block rdd_1_1 locally
INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10 ms on localhost (2/2)
INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:13) finished in 0.011 s
INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.024084 s
Lines with a: 5, Lines with b: 12
I made it work specifying in the spark submit the path to the metrics file
--files=/yourPath/metrics.properties --conf spark.metrics.conf=./metrics.properties

Resources