Identical code works in pyspark shell but not via spark-submit - apache-spark

So I have a Pyspark project in the following structure:
main.py: doing the real stuff (imports pyspark udf's from utils.py and stuff from common.py)
utils.py: some utility functions (imports from common.py)
common.py: some params
Inside a Pyspark shell, I could run the code from common.py, utils.py, main.py in this order, and could get the result I wanted; however, if I submit it via spark-submit, no error will be reported but the job kept executing at a cluster load of < 1%, which I suspect indicates nothing was being computed really.
Here is the spark-submit code:
spark-submit --master yarn --deploy-mode cluster --driver-cores 4 --driver-memory 20G --num-executors 10 --executor-cores 4 --executor-memory 20G --py-files project.zip project/main.py
Here is what's inside main.py (the other 2 .py files were a bit lengthy):
from pyspark.sql import SparkSession
from utils import foo1, foo2
from common import bar1, bar2
if __name__ == '__main__':
input_path = bar1
output_path = bar2
# build spark session
spark = SparkSession.builder\
.appName("app")\
.getOrCreate()
rd = spark.read.json(input_path).repartition(100)
# add a new column Col2
rd_app = rd.withColumn('Col2', foo1(rd.Col1))
rd_app_not_null = rd_app.filter('Col2 is not null')
# cleanup queries
rd_no_query = rd_app_not_null\
.withColumn('URLClean', foo2(rd_app_not_null.URL))\
.drop('URL')\
.withColumnRenamed('URLClean', 'URL')
# save to S3
rd_no_query.write.json(output_path, compression='gzip')
Running on EMR; Pyspark shell in Yarn client mode, spark-submit in Yarn cluster mode.
Any help is appreciated!
Edit: last couple lines of logs (stayed like that for hours):
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 285.578911 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 41.334709 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 9.883221 ms
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 313.5 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 28.2 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.42.228:35995 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 2 from json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
17/03/08 22:06:31 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO DAGScheduler: Registering RDD 8 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Got job 1 (json at NativeMethodAccessorImpl.java:0) with 100 output partitions
17/03/08 22:06:31 INFO DAGScheduler: Final stage: ResultStage 2 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 25.3 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.8 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.31.42.228:35995 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/03/08 22:06:31 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO YarnClusterScheduler: Adding task set 1.0 with 24 tasks
17/03/08 22:06:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, ip-172-31-46-251.ec2.internal, executor 2, partition 0, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, ip-172-31-43-215.ec2.internal, executor 4, partition 1, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, ip-172-31-41-81.ec2.internal, executor 3, partition 2, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, ip-172-31-34-182.ec2.internal, executor 5, partition 3, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, ip-172-31-46-251.ec2.internal, executor 2, partition 4, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, ip-172-31-43-215.ec2.internal, executor 4, partition 5, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, ip-172-31-41-81.ec2.internal, executor 3, partition 6, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, ip-172-31-34-182.ec2.internal, executor 5, partition 7, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, ip-172-31-46-251.ec2.internal, executor 2, partition 8, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, ip-172-31-43-215.ec2.internal, executor 4, partition 9, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 34, ip-172-31-41-81.ec2.internal, executor 3, partition 10, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 35, ip-172-31-34-182.ec2.internal, executor 5, partition 11, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 36, ip-172-31-46-251.ec2.internal, executor 2, partition 12, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 37, ip-172-31-43-215.ec2.internal, executor 4, partition 13, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 14.0 in stage 1.0 (TID 38, ip-172-31-41-81.ec2.internal, executor 3, partition 14, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 15.0 in stage 1.0 (TID 39, ip-172-31-34-182.ec2.internal, executor 5, partition 15, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO YarnAllocator: Driver requested a total number of 5 executor(s).
17/03/08 22:06:32 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:32 INFO ExecutorAllocationManager: Requesting 5 new executors because tasks are backlogged (new desired total will be 5)
17/03/08 22:06:32 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO AMRMClientImpl: Received new token for : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:32 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000009 on host ip-172-31-45-4.ec2.internal
17/03/08 22:06:32 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:33 INFO YarnAllocator: Driver requested a total number of 6 executor(s).
17/03/08 22:06:33 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:33 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 6)
17/03/08 22:06:33 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:33 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000012 on host ip-172-31-40-142.ec2.internal
17/03/08 22:06:33 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-40-142.ec2.internal:8041
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.40.142:36208) with ID 8
17/03/08 22:06:36 INFO TaskSetManager: Starting task 16.0 in stage 1.0 (TID 40, ip-172-31-40-142.ec2.internal, executor 8, partition 16, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:36 INFO ExecutorAllocationManager: New executor 8 has registered (new total is 5)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 17.0 in stage 1.0 (TID 41, ip-172-31-40-142.ec2.internal, executor 8, partition 17, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 18.0 in stage 1.0 (TID 42, ip-172-31-40-142.ec2.internal, executor 8, partition 18, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 19.0 in stage 1.0 (TID 43, ip-172-31-40-142.ec2.internal, executor 8, partition 19, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-40-142.ec2.internal:42371 with 11.8 GB RAM, BlockManagerId(8, ip-172-31-40-142.ec2.internal, 42371, None)
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO AMRMClientImpl: Received new token for : ip-172-31-33-112.ec2.internal:8041
17/03/08 22:06:36 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 0 of them.
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:37 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.45.4:33330) with ID 7
17/03/08 22:06:37 INFO TaskSetManager: Starting task 20.0 in stage 1.0 (TID 44, ip-172-31-45-4.ec2.internal, executor 7, partition 20, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO ExecutorAllocationManager: New executor 7 has registered (new total is 6)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 21.0 in stage 1.0 (TID 45, ip-172-31-45-4.ec2.internal, executor 7, partition 21, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 22.0 in stage 1.0 (TID 46, ip-172-31-45-4.ec2.internal, executor 7, partition 22, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 23.0 in stage 1.0 (TID 47, ip-172-31-45-4.ec2.internal, executor 7, partition 23, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-45-4.ec2.internal:45646 with 11.8 GB RAM, BlockManagerId(7, ip-172-31-45-4.ec2.internal, 45646, None)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 28.2 KB, free: 11.8 GB)

Related

"Application attempt...doesn't exist in ApplicationMasterService cacheā€ cause? (Pregel: maxIterations impact on cluster for non-convergent algorithm)

I've tried to run my own Pregel method for a relatively small graph (250k vertices, 1.5M edges). The algorithm which I use may (high chances are) be non-convergent meaning in most cases maxIterations setting is actually acting as hard stop finishing all calculations.
I'm using AWS EMR with apache spark and m5.2xlarge instances for all nodes in a setup with EMR-managed scaling. Initially, cluster is set to run 1 master and 4 worker nodes with expansion up to 8 max.
For the same setup of cluster, I was increasing the number of maxIterations gradually from 100 to 500 with step of 100 [100, 200, 300, 400, 500]. I was under the assumption that setup enough for 100 iterations will be also enough for any other number just because not used memory will be freeing up.
However, when I ran a set of jobs with maxIterations increasing from 100 to 500 I found that all jobs with maxIterations > 100 were terminated due to step error. I've checked logs of Spark to find issues and this is what I got:
log start
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/10/__spark_libs__364046395941885636.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for TERM
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for HUP
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for INT
21/02/13 21:23:24 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing view acls groups to:
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls groups to:
21/02/13 21:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:24 INFO ApplicationMaster: Preparing Local resources
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:25 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1613251201422_0001_000001
21/02/13 21:23:25 INFO ApplicationMaster: Starting the user application in a separate Thread
21/02/13 21:23:25 INFO ApplicationMaster: Waiting for spark context initialization...
21/02/13 21:23:25 INFO SparkContext: Running Spark version 2.4.7-amzn-0
21/02/13 21:23:25 INFO SparkContext: Submitted application: Read JDBC Datasites2
21/02/13 21:23:25 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing view acls groups to:
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls groups to:
21/02/13 21:23:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:25 INFO Utils: Successfully started service 'sparkDriver' on port 41117.
21/02/13 21:23:25 INFO SparkEnv: Registering MapOutputTracker
21/02/13 21:23:25 INFO SparkEnv: Registering BlockManagerMaster
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/02/13 21:23:25 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-bc544c91-1a59-41f3-890f-faaa392bea09
21/02/13 21:23:25 INFO DiskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-14e3f36f-6d3f-4ffe-a28c-fa3f81f0c5c9
21/02/13 21:23:26 INFO MemoryStore: MemoryStore started with capacity 1008.9 MB
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:26 INFO SparkEnv: Registering OutputCommitCoordinator
21/02/13 21:23:26 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
21/02/13 21:23:26 INFO Utils: Successfully started service 'SparkUI' on port 43659.
21/02/13 21:23:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ip-172-31-21-88.ec2.internal:43659
21/02/13 21:23:26 INFO YarnClusterScheduler: Created YarnClusterScheduler
21/02/13 21:23:26 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1613251201422_0001 and attemptId Some(appattempt_1613251201422_0001_000001)
21/02/13 21:23:26 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34665.
21/02/13 21:23:26 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:26 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir; Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs; Ignoring.
21/02/13 21:23:27 INFO RMProxy: Connecting to ResourceManager at ip-172-31-29-
command:
LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH\" \
{{JAVA_HOME}}/bin/java \
-server \
-Xmx4743m \
'-verbose:gc' \
'-XX:+PrintGCDetails' \
'-XX:+PrintGCDateStamps' \
'-XX:OnOutOfMemoryError=kill -9 %p' \
'-XX:+UseParallelGC' \
'-XX:InitiatingHeapOccupancyPercent=70' \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.history.ui.port=18080' \
'-Dspark.ui.port=0' \
'-Dspark.driver.port=41117' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler#ip-172-31-21-88.ec2.internal:41117 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
2 \
--app-id \
application_1613251201422_0001 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__app__.jar -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/force-pregel.jar" } size: 27378 timestamp: 1613251399566 type: FILE visibility: PRIVATE
__spark_libs__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_libs__364046395941885636.zip" } size: 239655683 timestamp: 1613251397751 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_conf__.zip" } size: 274365 timestamp: 1613251399776 type: ARCHIVE visibility: PRIVATE
hive-site.xml -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/hive-site.xml" } size: 2137 timestamp: 1613251399631 type: FILE visibility: PRIVATE
===============================================================================
21/02/13 21:23:27 INFO Configuration: resource-types.xml not found
21/02/13 21:23:27 INFO ResourceUtils: Unable to find 'resource-types.xml'.
21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/02/13 21:23:27 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:27 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM#ip-172-31-21-88.ec2.internal:41117)
21/02/13 21:23:27 INFO YarnAllocator: Will request up to 100 executor container(s), each with <memory:5632, max memory:2147483647, vCores:2, max vCores:2147483647>
21/02/13 21:23:27 INFO YarnAllocator: Submitted 100 unlocalized container requests.
21/02/13 21:23:27 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
21/02/13 21:23:27 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000002 on host ip-172-31-21-88.ec2.internal for executor with ID 1 with resources <memory:5632, max memory:12288, vCores:1, max vCores:8>
21/02/13 21:23:27 INFO YarnAllocator: Launching executor with 4742m of heap (plus 890m overhead) and 2 cores
21/02/13 21:23:27 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000004 on host ip-172-31-25-102.ec2.internal for executor with ID 2 with resources <memory:11264, vCores:2>
21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000006 on host ip-172-31-28-143.ec2.internal for executor with ID 3 with resources <memory:11264, vCores:2>
21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
21/02/13 21:23:28 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
30 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.21.88:53634) with ID 1
21/02/13 21:23:30 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
21/02/13 21:23:30 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-21-88.ec2.internal:45667 with 2.3 GB RAM, BlockManagerId(1, ip-172-31-21-88.ec2.internal, 45667, None)
then approximately 2Mbytes of same output and then it finishes:
21/02/13 21:28:25 INFO TaskSetManager: Finished task 199.0 in stage 37207.0 (TID 93528) in 8 ms on ip-172-31-25-102.ec2.internal (executor 2) (158/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_31 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 252.3 KB, free: 2.1 GB)
21/02/13 21:28:25 ERROR ApplicationMaster: Exception from Reporter thread.
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy23.allocate(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:300)
at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:279)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$allocationThreadImpl(ApplicationMaster.scala:541)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:607)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
at org.apache.hadoop.ipc.Client.call(Client.java:1495)
at org.apache.hadoop.ipc.Client.call(Client.java:1394)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy22.allocate(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
... 13 more
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_30 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.8 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 40.0 in stage 37207.0 (TID 93533, ip-172-31-21-88.ec2.internal, executor 1, partition 40, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 31.0 in stage 37207.0 (TID 93532) in 16 ms on ip-172-31-21-88.ec2.internal (executor 1) (162/200)
21/02/13 21:28:25 INFO ApplicationMaster: Final app status: FAILED, exitCode: 12, (reason: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 41.0 in stage 37207.0 (TID 93534, ip-172-31-21-88.ec2.internal, executor 1, partition 41, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 30.0 in stage 37207.0 (TID 93531) in 22 ms on ip-172-31-21-88.ec2.internal (executor 1) (163/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_40 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 234.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 48.0 in stage 37207.0 (TID 93535, ip-172-31-21-88.ec2.internal, executor 1, partition 48, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 40.0 in stage 37207.0 (TID 93533) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (164/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_41 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 233.4 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 51.0 in stage 37207.0 (TID 93536, ip-172-31-21-88.ec2.internal, executor 1, partition 51, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 41.0 in stage 37207.0 (TID 93534) in 15 ms on ip-172-31-21-88.ec2.internal (executor 1) (165/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_48 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 235.1 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 57.0 in stage 37207.0 (TID 93537, ip-172-31-21-88.ec2.internal, executor 1, partition 57, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 48.0 in stage 37207.0 (TID 93535) in 11 ms on ip-172-31-21-88.ec2.internal (executor 1) (166/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_57 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 232.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_51 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 61.0 in stage 37207.0 (TID 93538, ip-172-31-21-88.ec2.internal, executor 1, partition 61, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 57.0 in stage 37207.0 (TID 93537) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (167/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 63.0 in stage 37207.0 (TID 93539, ip-172-31-21-88.ec2.internal, executor 1, partition 63, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 51.0 in stage 37207.0 (TID 93536) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (168/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_61 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 228.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 67.0 in stage 37207.0 (TID 93540, ip-172-31-21-88.ec2.internal, executor 1, partition 67, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 61.0 in stage 37207.0 (TID 93538) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (169/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_63 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 238.3 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 71.0 in stage 37207.0 (TID 93541, ip-172-31-21-88.ec2.internal, executor 1, partition 71, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 63.0 in stage 37207.0 (TID 93539) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (170/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_67 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 247.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_71 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 243.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 77.0 in stage 37207.0 (TID 93542, ip-172-31-21-88.ec2.internal, executor 1, partition 77, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 67.0 in stage 37207.0 (TID 93540) in 18 ms on ip-172-31-21-88.ec2.internal (executor 1) (171/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 79.0 in stage 37207.0 (TID 93543, ip-172-31-21-88.ec2.internal, executor 1, partition 79, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 71.0 in stage 37207.0 (TID 93541) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (172/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_79 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 253.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_77 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 222.5 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 86.0 in stage 37207.0 (TID 93544, ip-172-31-21-88.ec2.internal, executor 1, partition 86, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 79.0 in stage 37207.0 (TID 93543) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (173/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 87.0 in stage 37207.0 (TID 93545, ip-172-31-21-88.ec2.internal, executor 1, partition 87, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 77.0 in stage 37207.0 (TID 93542) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (174/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_86 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 254.5 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_87 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 267.1 KB, free: 2.1 GB)
Am I correct that Pregel doesn't finish 200 or more iterations due to OutOfMemory error on some of the cluster nodes?
If so, how does Pregel work that 100 iterations are not causing it and 200 or 300 are causing? My understand before this issue was that Pregel as many other iterative approaches only 'store' previous and current iteration values and results and iteration by iteration values are changing, but their quantity is not increasing, meaning it is still graph with 250k vertices and 1.5m edges and only messages valid for current iteration are adding up to the heap.
Throughout the log I was not able to find any information on low memory and as seen, there are Gigabytes of it available on each node before it terminates

How to correctly parallelize multiple JSON file aggregation in PySpark

I have a large set of json_list files on S3 with some logs that I would like to aggregate (basically just count number of requests by path, location etc.) I've been doing the following, but judging by the logs, I'm not sure it's actually parallelized.. first it takes about 3 minutes to download the individual S3 files one by one, and then the rest still seems split to 1000 executions.. I thought Spark would break this down into a map-reduce kind of approach itself but maybe I totally misunderstood what it does and what it doesn't do. Could someone provide a hint please.
df = (
spark.read
.json(test_paths, schema=schema)
.filter(col('method') == 'GET')
.filter((col('status_code') == 200) | (col('status_code') == 206))
.withColumn('date', from_unixtime('timestamp').cast(DateType()))
.groupBy('path', 'client_country_code', 'date', 'file_size')
.count()
)
Here's the driver log for 1000 urls:
20/11/15 19:15:23 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under 1000 paths. The first several paths are: s3n://bucket../10004.json_lines.gz.
20/11/15 19:15:23 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
20/11/15 19:15:23 INFO DAGScheduler: Got job 49 (json at NativeMethodAccessorImpl.java:0) with 1000 output partitions
20/11/15 19:15:23 INFO DAGScheduler: Final stage: ResultStage 75 (json at NativeMethodAccessorImpl.java:0)
20/11/15 19:15:23 INFO DAGScheduler: Parents of final stage: List()
20/11/15 19:15:23 INFO DAGScheduler: Missing parents: List()
20/11/15 19:15:23 INFO DAGScheduler: Submitting ResultStage 75 (MapPartitionsRDD[206] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
20/11/15 19:15:23 INFO MemoryStore: Block broadcast_77 stored as values in memory (estimated size 84.3 KiB, free 2.2 GiB)
20/11/15 19:15:23 INFO MemoryStore: Block broadcast_77_piece0 stored as bytes in memory (estimated size 29.9 KiB, free 2.2 GiB)
20/11/15 19:15:23 INFO BlockManagerInfo: Added broadcast_77_piece0 in memory on e05e979b7108:34999 (size: 29.9 KiB, free: 2.2 GiB)
20/11/15 19:15:23 INFO SparkContext: Created broadcast 77 from broadcast at DAGScheduler.scala:1223
20/11/15 19:15:23 INFO DAGScheduler: Submitting 1000 missing tasks from ResultStage 75 (MapPartitionsRDD[206] at json at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
20/11/15 19:15:23 INFO TaskSchedulerImpl: Adding task set 75.0 with 1000 tasks
20/11/15 19:15:23 INFO TaskSetManager: Starting task 0.0 in stage 75.0 (TID 33224, e05e979b7108, executor driver, partition 0, PROCESS_LOCAL, 7473 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 1.0 in stage 75.0 (TID 33225, e05e979b7108, executor driver, partition 1, PROCESS_LOCAL, 7473 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 2.0 in stage 75.0 (TID 33226, e05e979b7108, executor driver, partition 2, PROCESS_LOCAL, 7474 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 3.0 in stage 75.0 (TID 33227, e05e979b7108, executor driver, partition 3, PROCESS_LOCAL, 7475 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 4.0 in stage 75.0 (TID 33228, e05e979b7108, executor driver, partition 4, PROCESS_LOCAL, 7476 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 5.0 in stage 75.0 (TID 33229, e05e979b7108, executor driver, partition 5, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 6.0 in stage 75.0 (TID 33230, e05e979b7108, executor driver, partition 6, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 7.0 in stage 75.0 (TID 33231, e05e979b7108, executor driver, partition 7, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO Executor: Running task 0.0 in stage 75.0 (TID 33224)
20/11/15 19:15:23 INFO Executor: Running task 1.0 in stage 75.0 (TID 33225)
20/11/15 19:15:23 INFO Executor: Running task 2.0 in stage 75.0 (TID 33226)
20/11/15 19:15:23 INFO Executor: Running task 5.0 in stage 75.0 (TID 33229)
20/11/15 19:15:23 INFO Executor: Running task 3.0 in stage 75.0 (TID 33227)
20/11/15 19:15:23 INFO Executor: Running task 7.0 in stage 75.0 (TID 33231)
20/11/15 19:15:23 INFO Executor: Running task 6.0 in stage 75.0 (TID 33230)
20/11/15 19:15:23 INFO Executor: Running task 4.0 in stage 75.0 (TID 33228)
20/11/15 19:15:24 INFO Executor: Finished task 1.0 in stage 75.0 (TID 33225). 2025 bytes result sent to driver
20/11/15 19:15:24 INFO Executor: Finished task 0.0 in stage 75.0 (TID 33224). 2025 bytes result sent to driver
20/11/15 19:15:24 INFO TaskSetManager: Starting task 8.0 in stage 75.0 (TID 33232, e05e979b7108, executor driver, partition 8, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 1.0 in stage 75.0 (TID 33225) in 567 ms on e05e979b7108 (executor driver) (1/1000)
20/11/15 19:15:24 INFO Executor: Running task 8.0 in stage 75.0 (TID 33232)
20/11/15 19:15:24 INFO Executor: Finished task 6.0 in stage 75.0 (TID 33230). 2033 bytes result sent to driver
20/11/15 19:15:24 INFO TaskSetManager: Starting task 9.0 in stage 75.0 (TID 33233, e05e979b7108, executor driver, partition 9, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Starting task 10.0 in stage 75.0 (TID 33234, e05e979b7108, executor driver, partition 10, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 0.0 in stage 75.0 (TID 33224) in 570 ms on e05e979b7108 (executor driver) (2/1000)
20/11/15 19:15:24 INFO Executor: Running task 9.0 in stage 75.0 (TID 33233)
20/11/15 19:15:24 INFO Executor: Running task 10.0 in stage 75.0 (TID 33234)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 6.0 in stage 75.0 (TID 33230) in 571 ms on e05e979b7108 (executor driver) (3/1000)
....
20/11/15 19:15:43 INFO TaskSetManager: Finished task 998.0 in stage 75.0 (TID 34222) in 158 ms on e05e979b7108 (executor driver) (999/1000)
20/11/15 19:15:43 INFO Executor: Finished task 999.0 in stage 75.0 (TID 34223). 2033 bytes result sent to driver
20/11/15 19:15:43 INFO TaskSetManager: Finished task 999.0 in stage 75.0 (TID 34223) in 175 ms on e05e979b7108 (executor driver) (1000/1000)
20/11/15 19:15:43 INFO TaskSchedulerImpl: Removed TaskSet 75.0, whose tasks have all completed, from pool
20/11/15 19:15:43 INFO DAGScheduler: ResultStage 75 (json at NativeMethodAccessorImpl.java:0) finished in 19.850 s
20/11/15 19:15:43 INFO DAGScheduler: Job 49 is finished. Cancelling potential speculative or zombie tasks for this job
20/11/15 19:15:43 INFO TaskSchedulerImpl: Killing all running tasks in stage 75: Stage finished
20/11/15 19:15:43 INFO DAGScheduler: Job 49 finished: json at NativeMethodAccessorImpl.java:0, took 19.890458 s
20/11/15 19:15:43 INFO InMemoryFileIndex: It took 19936 ms to list leaf files for 1000 paths.
There's a lot of setup overhead, especially with many small files. JSON is also a very inefficient storage format as the whole file will be needed to be read every time. Ideally each file should be 64+MB to give the spark workers enough data to process efficiently.
Have you considered making step 1 of your workflow just reading in the JSON files and then saving in a columnar format like Parquet to a smaller number of files.?

Spark Memory Error Java Runtime Environment

I have a spark job made in python, where I retrieve data from Redshift, and then I apply many transformations, join, filter, withColumn, agg ...
There are around 30K records in the dataframes
I perform all the transformation and when I try to write an AVRO file the spark job fails
My spark submit:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 10 --executor-cores 3 --executor-memory 10G --driver-memory 14g --conf spark.sql.broadcastTimeout=3600 --conf spark.network.timeout=10000000 --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
I'm using --executor-memory 10G --driver-memory 14g, 6 machines in amazon with 8 cores and 15G RAM, why Im getting out of memory error ???
Error returned:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hadoop/hs_err_pid13688.log
This is the end of spark log:
17/05/29 10:13:09 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 19, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:09 INFO TransportClientFactory: Successfully created connection to ip-10-185-53-172.eu-west-1.compute.internal/10.185.53.172:39759 after 3 ms (0 ms spent in bootstraps)
17/05/29 10:13:09 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.9 KB, free: 5.3 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 10.185.52.91:43829 in memory (size: 30.4 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.4 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 10.185.52.91:43829 in memory (size: 30.3 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.3 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 10.185.52.91:43829 in memory (size: 30.6 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.6 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Added taskresult_2 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 499.6 MB, free: 4.8 GB)
17/05/29 10:13:11 INFO TaskSetManager: Starting task 0.0 in stage 23.0 (TID 20, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:12 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.8 KB, free: 4.8 GB)
17/05/29 10:13:13 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 270161 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:13 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/05/29 10:13:13 INFO DAGScheduler: ResultStage 3 (run at ThreadPoolExecutor.java:1142) finished in 270.162 s
17/05/29 10:13:13 INFO DAGScheduler: Job 3 finished: run at ThreadPoolExecutor.java:1142, took 270.230067 s
17/05/29 10:13:13 INFO BlockManagerInfo: Removed taskresult_3 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.5 MB, free: 5.3 GB)
17/05/29 10:13:16 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.185.52.91:43829 in memory (size: 5.5 KB, free: 8.2 GB)
17/05/29 10:13:17 INFO BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 5.5 KB, free: 5.3 GB)
17/05/29 10:13:20 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 276982 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:20 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/05/29 10:13:20 INFO DAGScheduler: ResultStage 2 (run at ThreadPoolExecutor.java:1142) finished in 276.984 s
17/05/29 10:13:20 INFO DAGScheduler: Job 2 finished: run at ThreadPoolExecutor.java:1142, took 277.000009 s
17/05/29 10:13:20 INFO BlockManagerInfo: Removed taskresult_2 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.6 MB, free: 5.8 GB)
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000667766000, 196608, 0) failed; error='Cannot allocate memory' (errno=12)

EMR Spark 2.1.0 process get stuck at at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)

I have an EMR applications that reads json files, "explodes" the hierarchical structure into a relational table and then writes it out as parquet. Source and Destination are HDFS. The application has been running fine for about a month, suddenly today---I assume due to a new source data set---it does not terminate. The log suggests that it gets stuck at some point with this driver output:
17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at NativeMethodAccessorImpl.java:0)
17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has no missing parents
17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 170.5 KB, free 2.2 GB)
17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 65.2 KB, free 2.2 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xxx.xx:37907 (size: 65.2 KB, free: 2.2 GB)
17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:996
17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0)
17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks
17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 72, xxx.xxx.xx.xx.xx, executor 12, partition 1, NODE_LOCAL, 8184 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 73, xxx.xxx.xx.xx.xx, executor 13, partition 0, NODE_LOCAL, 7967 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 74, xxx.xxx.xx.xx.xx, executor 14, partition 2, NODE_LOCAL, 8181 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 75, xxx.xxx.xx.xx.xx, executor 16, partition 6, NODE_LOCAL, 8400 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 76, xxx.xxx.xx.xx.xx, executor 10, partition 7, NODE_LOCAL, 8398 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 77, xxx.xxx.xx.xx.xx, executor 11, partition 3, NODE_LOCAL, 8182 bytes)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46030 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:40494 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:35861 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34157 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43202 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46053 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46030 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors because tasks are backlogged (new desired total will be 9)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34157 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:40494 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:35861 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46053 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43202 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 78, xxx.xxx.xx.xx.xx, executor 15, partition 4, RACK_LOCAL, 8400 bytes)
17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 79, xxx.xxx.xx.xx.xx, executor 9, partition 5, RACK_LOCAL, 8400 bytes)
17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34045 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43887 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34045 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43887 (size: 28.0 KB, free: 4.0 GB)
When I connect to one of the nodes in the cluster and run jstack, I get the following stack trace (among others that do not look very interesting):
"Executor task launch worker-0" #39 daemon prio=5 os_prio=0 tid=0x00007f6210352800 nid=0x4542 runnable [0x00007f61f52b3000]
java.lang.Thread.State: RUNNABLE
at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any suggestions will be very much appreciated.

Spark metrics on wordcount example

I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make it work.
spark/conf/metrics.properties :
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink.csv.period=1
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=/home/spark/Documents/test/
# Worker instance overlap polling period
worker.sink.csv.period=1
worker.sink.csv.unit=seconds
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I run my app in local like in the documentation :
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
I checked /home/spark/Documents/test/ and it is empty.
What did I miss?
Shell:
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] --conf spark.metrics.conf=/home/spark/development/spark/conf/metrics.properties target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
INFO SparkContext: Running Spark version 1.3.0
WARN Utils: Your hostname, cv-local resolves to a loopback address: 127.0.1.1; using 192.168.1.64 instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
INFO SecurityManager: Changing view acls to: spark
INFO SecurityManager: Changing modify acls to: spark
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
INFO Slf4jLogger: Slf4jLogger started
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#cv-local.local:35895]
INFO Utils: Successfully started service 'sparkDriver' on port 35895.
INFO SparkEnv: Registering MapOutputTracker
INFO SparkEnv: Registering BlockManagerMaster
INFO DiskBlockManager: Created local directory at /tmp/spark-447d56c9-cfe5-4f9d-9e0a-6bb476ddede6/blockmgr-4eaa04f4-b4b2-4b05-ba0e-fd1aeb92b289
INFO MemoryStore: MemoryStore started with capacity 265.4 MB
INFO HttpFileServer: HTTP File server directory is /tmp/spark-fae11cd2-937e-4be3-a273-be8b4c4847df/httpd-ca163445-6fff-45e4-9c69-35edcea83b68
INFO HttpServer: Starting HTTP Server
INFO Utils: Successfully started service 'HTTP file server' on port 52828.
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://cv-local.local:4040
INFO SparkContext: Added JAR file:/home/spark/workspace/IdeaProjects/wordcount/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Executor: Starting executor ID <driver> on host localhost
INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#cv-local.local:35895/user/HeartbeatReceiver
INFO NettyBlockTransferService: Server created on 60320
INFO BlockManagerMaster: Trying to register BlockManager
INFO BlockManagerMasterActor: Registering block manager localhost:60320 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 60320)
INFO BlockManagerMaster: Registered BlockManager
INFO MemoryStore: ensureFreeSpace(34046) called with curMem=0, maxMem=278302556
INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 33.2 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(5221) called with curMem=34046, maxMem=278302556
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60320 (size: 5.1 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN LoadSnappy: Snappy native library not loaded
INFO FileInputFormat: Total input paths to process : 1
INFO SparkContext: Starting job: count at SimpleApp.scala:12
INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:12)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=39267, maxMem=278302556
INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.8 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=42115, maxMem=278302556
INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
INFO Executor: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Utils: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar to /tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/fetchFileTemp4229868141058449157.tmp
INFO Executor: Adding file:/tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/simple-project_2.10-1.0.jar to class loader
INFO CacheManager: Partition rdd_1_1 not found, computing it
INFO CacheManager: Partition rdd_1_0 not found, computing it
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:2659+2659
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:0+2659
INFO MemoryStore: ensureFreeSpace(7840) called with curMem=44171, maxMem=278302556
INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 7.7 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:60320 (size: 7.7 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_0
INFO MemoryStore: ensureFreeSpace(8648) called with curMem=52011, maxMem=278302556
INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 8.4 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:60320 (size: 8.4 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_1
INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2399 bytes result sent to driver
INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2399 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 139 ms on localhost (1/2)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 133 ms on localhost (2/2)
INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:12) finished in 0.151 s
INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.225939 s
INFO SparkContext: Starting job: count at SimpleApp.scala:13
INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:13)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=60659, maxMem=278302556
INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.8 KB, free 265.3 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=63507, maxMem=278302556
INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.3 MB)
INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
INFO BlockManager: Found block rdd_1_0 locally
INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 9 ms on localhost (1/2)
INFO BlockManager: Found block rdd_1_1 locally
INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10 ms on localhost (2/2)
INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:13) finished in 0.011 s
INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.024084 s
Lines with a: 5, Lines with b: 12
I made it work specifying in the spark submit the path to the metrics file
--files=/yourPath/metrics.properties --conf spark.metrics.conf=./metrics.properties

Resources