I run Apache Spark on AWS EMR under YARN.
The cluster has 1 master and 10 executors.
After some hours of processing my cluster failed and I go to look on a log.
So, I see that all working executors were trying to kill task at one time (It's the log of someone executor):
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 66.0 in stage 2.0 (TID 466), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 65.0 in stage 2.0 (TID 465), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 67.0 in stage 2.0 (TID 467), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 64.0 in stage 2.0 (TID 464), reason: Stage cancelled
20/03/05 00:02:12 ERROR Utils: Aborting a task
I see that reason is Stage cancelled but I can't get any details about that. I see driver logs and find that they have the last record at much earlier time.
So I have 2 questions:
Why driver logs are much shorter than executors logs?
How can I get the real reason why stage cancelled?
20/03/04 18:39:40 INFO TaskSetManager: Starting task 159.0 in stage 1.0 (TID 359, ip-172-31-6-236.us-west-2.compute.internal, executor 40, partition 159, RACK_LOCAL, 8421 bytes)
20/03/04 18:39:40 INFO ExecutorAllocationManager: New executor 40 has registered (new total is 40)
20/03/04 18:39:41 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-6-236.us-west-2.compute.internal:33589 with 2.8 GB RAM, BlockManagerId(40, ip-172-31-6-236.us-west-2.compute.internal, 33589, None)
20/03/04 18:39:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 44.7 KB, free: 2.8 GB)
20/03/04 18:39:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 37.4 KB, free: 2.8 GB)
I have a spark job made in python, where I retrieve data from Redshift, and then I apply many transformations, join, filter, withColumn, agg ...
There are around 30K records in the dataframes
I perform all the transformation and when I try to write an AVRO file the spark job fails
My spark submit:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 10 --executor-cores 3 --executor-memory 10G --driver-memory 14g --conf spark.sql.broadcastTimeout=3600 --conf spark.network.timeout=10000000 --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
I'm using --executor-memory 10G --driver-memory 14g, 6 machines in amazon with 8 cores and 15G RAM, why Im getting out of memory error ???
Error returned:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hadoop/hs_err_pid13688.log
This is the end of spark log:
17/05/29 10:13:09 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 19, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:09 INFO TransportClientFactory: Successfully created connection to ip-10-185-53-172.eu-west-1.compute.internal/10.185.53.172:39759 after 3 ms (0 ms spent in bootstraps)
17/05/29 10:13:09 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.9 KB, free: 5.3 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 10.185.52.91:43829 in memory (size: 30.4 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.4 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 10.185.52.91:43829 in memory (size: 30.3 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.3 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 10.185.52.91:43829 in memory (size: 30.6 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.6 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Added taskresult_2 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 499.6 MB, free: 4.8 GB)
17/05/29 10:13:11 INFO TaskSetManager: Starting task 0.0 in stage 23.0 (TID 20, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:12 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.8 KB, free: 4.8 GB)
17/05/29 10:13:13 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 270161 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:13 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/05/29 10:13:13 INFO DAGScheduler: ResultStage 3 (run at ThreadPoolExecutor.java:1142) finished in 270.162 s
17/05/29 10:13:13 INFO DAGScheduler: Job 3 finished: run at ThreadPoolExecutor.java:1142, took 270.230067 s
17/05/29 10:13:13 INFO BlockManagerInfo: Removed taskresult_3 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.5 MB, free: 5.3 GB)
17/05/29 10:13:16 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.185.52.91:43829 in memory (size: 5.5 KB, free: 8.2 GB)
17/05/29 10:13:17 INFO BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 5.5 KB, free: 5.3 GB)
17/05/29 10:13:20 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 276982 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:20 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/05/29 10:13:20 INFO DAGScheduler: ResultStage 2 (run at ThreadPoolExecutor.java:1142) finished in 276.984 s
17/05/29 10:13:20 INFO DAGScheduler: Job 2 finished: run at ThreadPoolExecutor.java:1142, took 277.000009 s
17/05/29 10:13:20 INFO BlockManagerInfo: Removed taskresult_2 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.6 MB, free: 5.8 GB)
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000667766000, 196608, 0) failed; error='Cannot allocate memory' (errno=12)
So I have a Pyspark project in the following structure:
main.py: doing the real stuff (imports pyspark udf's from utils.py and stuff from common.py)
utils.py: some utility functions (imports from common.py)
common.py: some params
Inside a Pyspark shell, I could run the code from common.py, utils.py, main.py in this order, and could get the result I wanted; however, if I submit it via spark-submit, no error will be reported but the job kept executing at a cluster load of < 1%, which I suspect indicates nothing was being computed really.
Here is the spark-submit code:
spark-submit --master yarn --deploy-mode cluster --driver-cores 4 --driver-memory 20G --num-executors 10 --executor-cores 4 --executor-memory 20G --py-files project.zip project/main.py
Here is what's inside main.py (the other 2 .py files were a bit lengthy):
from pyspark.sql import SparkSession
from utils import foo1, foo2
from common import bar1, bar2
if __name__ == '__main__':
input_path = bar1
output_path = bar2
# build spark session
spark = SparkSession.builder\
.appName("app")\
.getOrCreate()
rd = spark.read.json(input_path).repartition(100)
# add a new column Col2
rd_app = rd.withColumn('Col2', foo1(rd.Col1))
rd_app_not_null = rd_app.filter('Col2 is not null')
# cleanup queries
rd_no_query = rd_app_not_null\
.withColumn('URLClean', foo2(rd_app_not_null.URL))\
.drop('URL')\
.withColumnRenamed('URLClean', 'URL')
# save to S3
rd_no_query.write.json(output_path, compression='gzip')
Running on EMR; Pyspark shell in Yarn client mode, spark-submit in Yarn cluster mode.
Any help is appreciated!
Edit: last couple lines of logs (stayed like that for hours):
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 285.578911 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 41.334709 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 9.883221 ms
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 313.5 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 28.2 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.42.228:35995 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 2 from json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
17/03/08 22:06:31 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO DAGScheduler: Registering RDD 8 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Got job 1 (json at NativeMethodAccessorImpl.java:0) with 100 output partitions
17/03/08 22:06:31 INFO DAGScheduler: Final stage: ResultStage 2 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 25.3 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.8 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.31.42.228:35995 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/03/08 22:06:31 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO YarnClusterScheduler: Adding task set 1.0 with 24 tasks
17/03/08 22:06:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, ip-172-31-46-251.ec2.internal, executor 2, partition 0, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, ip-172-31-43-215.ec2.internal, executor 4, partition 1, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, ip-172-31-41-81.ec2.internal, executor 3, partition 2, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, ip-172-31-34-182.ec2.internal, executor 5, partition 3, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, ip-172-31-46-251.ec2.internal, executor 2, partition 4, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, ip-172-31-43-215.ec2.internal, executor 4, partition 5, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, ip-172-31-41-81.ec2.internal, executor 3, partition 6, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, ip-172-31-34-182.ec2.internal, executor 5, partition 7, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, ip-172-31-46-251.ec2.internal, executor 2, partition 8, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, ip-172-31-43-215.ec2.internal, executor 4, partition 9, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 34, ip-172-31-41-81.ec2.internal, executor 3, partition 10, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 35, ip-172-31-34-182.ec2.internal, executor 5, partition 11, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 36, ip-172-31-46-251.ec2.internal, executor 2, partition 12, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 37, ip-172-31-43-215.ec2.internal, executor 4, partition 13, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 14.0 in stage 1.0 (TID 38, ip-172-31-41-81.ec2.internal, executor 3, partition 14, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 15.0 in stage 1.0 (TID 39, ip-172-31-34-182.ec2.internal, executor 5, partition 15, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO YarnAllocator: Driver requested a total number of 5 executor(s).
17/03/08 22:06:32 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:32 INFO ExecutorAllocationManager: Requesting 5 new executors because tasks are backlogged (new desired total will be 5)
17/03/08 22:06:32 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO AMRMClientImpl: Received new token for : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:32 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000009 on host ip-172-31-45-4.ec2.internal
17/03/08 22:06:32 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:33 INFO YarnAllocator: Driver requested a total number of 6 executor(s).
17/03/08 22:06:33 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:33 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 6)
17/03/08 22:06:33 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:33 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000012 on host ip-172-31-40-142.ec2.internal
17/03/08 22:06:33 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-40-142.ec2.internal:8041
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.40.142:36208) with ID 8
17/03/08 22:06:36 INFO TaskSetManager: Starting task 16.0 in stage 1.0 (TID 40, ip-172-31-40-142.ec2.internal, executor 8, partition 16, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:36 INFO ExecutorAllocationManager: New executor 8 has registered (new total is 5)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 17.0 in stage 1.0 (TID 41, ip-172-31-40-142.ec2.internal, executor 8, partition 17, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 18.0 in stage 1.0 (TID 42, ip-172-31-40-142.ec2.internal, executor 8, partition 18, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 19.0 in stage 1.0 (TID 43, ip-172-31-40-142.ec2.internal, executor 8, partition 19, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-40-142.ec2.internal:42371 with 11.8 GB RAM, BlockManagerId(8, ip-172-31-40-142.ec2.internal, 42371, None)
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO AMRMClientImpl: Received new token for : ip-172-31-33-112.ec2.internal:8041
17/03/08 22:06:36 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 0 of them.
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:37 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.45.4:33330) with ID 7
17/03/08 22:06:37 INFO TaskSetManager: Starting task 20.0 in stage 1.0 (TID 44, ip-172-31-45-4.ec2.internal, executor 7, partition 20, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO ExecutorAllocationManager: New executor 7 has registered (new total is 6)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 21.0 in stage 1.0 (TID 45, ip-172-31-45-4.ec2.internal, executor 7, partition 21, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 22.0 in stage 1.0 (TID 46, ip-172-31-45-4.ec2.internal, executor 7, partition 22, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 23.0 in stage 1.0 (TID 47, ip-172-31-45-4.ec2.internal, executor 7, partition 23, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-45-4.ec2.internal:45646 with 11.8 GB RAM, BlockManagerId(7, ip-172-31-45-4.ec2.internal, 45646, None)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 28.2 KB, free: 11.8 GB)
SPARK-Version: 1.5.2 with yarn 2.7.1.2.3.0.0-2557
I'm running into a problem while I'm exploring the data through spark-shell that I'm trying to create a really fat dataframe that with 3000 columns. Code as below:
val valueFunctionUDF = udf((valMap: Map[String, String], dataItemId: String) =>
valMap.get(dataItemId) match {
case Some(v) => v.toDouble
case None => Double.NaN
})
s1 is being the main dataframe and the schema as below:
|-- combKey: string (nullable = true)
|-- valMaps: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
after I run the code:
dataItemIdVals.foreach{w =>
s1 = s1.withColumn(w, valueFunctionUDF($"valMaps", $"combKey"))}
my terminal just stuck after the above column with the info being printed out:
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 172.22.49.20:41494 in memory (size: 7.6 KB, free: 5.2 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43026 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:44890 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52020 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:33272 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:48481 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:44026 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:34539 in memory (size: 7.6 KB, free: 5.0 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43734 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:42769 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:60603 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:59102 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:47578 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43149 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52488 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52298 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 9
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 172.22.49.20:41494 in memory (size: 7.3 KB, free: 5.2 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:33272 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:59102 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:44026 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:42769 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43149 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43026 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52298 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:42890 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:47578 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:60603 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43734 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:48481 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52020 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52488 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:34539 in memory (size: 7.3 KB, free: 5.0 GB)
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 8
16/07/11 12:20:54 INFO ContextCleaner: Cleaned shuffle 0
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 7
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 6
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 5
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 4
Nothing is going on on sparkUI and I guess spark is calculating some metadata for the new dataframe(number of column etc.)? Anyone seen this kind of issue before? Anyway to get around with it?
I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make it work.
spark/conf/metrics.properties :
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink.csv.period=1
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=/home/spark/Documents/test/
# Worker instance overlap polling period
worker.sink.csv.period=1
worker.sink.csv.unit=seconds
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I run my app in local like in the documentation :
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
I checked /home/spark/Documents/test/ and it is empty.
What did I miss?
Shell:
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] --conf spark.metrics.conf=/home/spark/development/spark/conf/metrics.properties target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
INFO SparkContext: Running Spark version 1.3.0
WARN Utils: Your hostname, cv-local resolves to a loopback address: 127.0.1.1; using 192.168.1.64 instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
INFO SecurityManager: Changing view acls to: spark
INFO SecurityManager: Changing modify acls to: spark
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
INFO Slf4jLogger: Slf4jLogger started
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#cv-local.local:35895]
INFO Utils: Successfully started service 'sparkDriver' on port 35895.
INFO SparkEnv: Registering MapOutputTracker
INFO SparkEnv: Registering BlockManagerMaster
INFO DiskBlockManager: Created local directory at /tmp/spark-447d56c9-cfe5-4f9d-9e0a-6bb476ddede6/blockmgr-4eaa04f4-b4b2-4b05-ba0e-fd1aeb92b289
INFO MemoryStore: MemoryStore started with capacity 265.4 MB
INFO HttpFileServer: HTTP File server directory is /tmp/spark-fae11cd2-937e-4be3-a273-be8b4c4847df/httpd-ca163445-6fff-45e4-9c69-35edcea83b68
INFO HttpServer: Starting HTTP Server
INFO Utils: Successfully started service 'HTTP file server' on port 52828.
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://cv-local.local:4040
INFO SparkContext: Added JAR file:/home/spark/workspace/IdeaProjects/wordcount/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Executor: Starting executor ID <driver> on host localhost
INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#cv-local.local:35895/user/HeartbeatReceiver
INFO NettyBlockTransferService: Server created on 60320
INFO BlockManagerMaster: Trying to register BlockManager
INFO BlockManagerMasterActor: Registering block manager localhost:60320 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 60320)
INFO BlockManagerMaster: Registered BlockManager
INFO MemoryStore: ensureFreeSpace(34046) called with curMem=0, maxMem=278302556
INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 33.2 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(5221) called with curMem=34046, maxMem=278302556
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60320 (size: 5.1 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN LoadSnappy: Snappy native library not loaded
INFO FileInputFormat: Total input paths to process : 1
INFO SparkContext: Starting job: count at SimpleApp.scala:12
INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:12)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=39267, maxMem=278302556
INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.8 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=42115, maxMem=278302556
INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
INFO Executor: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Utils: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar to /tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/fetchFileTemp4229868141058449157.tmp
INFO Executor: Adding file:/tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/simple-project_2.10-1.0.jar to class loader
INFO CacheManager: Partition rdd_1_1 not found, computing it
INFO CacheManager: Partition rdd_1_0 not found, computing it
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:2659+2659
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:0+2659
INFO MemoryStore: ensureFreeSpace(7840) called with curMem=44171, maxMem=278302556
INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 7.7 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:60320 (size: 7.7 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_0
INFO MemoryStore: ensureFreeSpace(8648) called with curMem=52011, maxMem=278302556
INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 8.4 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:60320 (size: 8.4 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_1
INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2399 bytes result sent to driver
INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2399 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 139 ms on localhost (1/2)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 133 ms on localhost (2/2)
INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:12) finished in 0.151 s
INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.225939 s
INFO SparkContext: Starting job: count at SimpleApp.scala:13
INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:13)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=60659, maxMem=278302556
INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.8 KB, free 265.3 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=63507, maxMem=278302556
INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.3 MB)
INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
INFO BlockManager: Found block rdd_1_0 locally
INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 9 ms on localhost (1/2)
INFO BlockManager: Found block rdd_1_1 locally
INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10 ms on localhost (2/2)
INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:13) finished in 0.011 s
INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.024084 s
Lines with a: 5, Lines with b: 12
I made it work specifying in the spark submit the path to the metrics file
--files=/yourPath/metrics.properties --conf spark.metrics.conf=./metrics.properties