What is the maximum number of column being supported by apache spark dataframe - apache-spark

SPARK-Version: 1.5.2 with yarn 2.7.1.2.3.0.0-2557
I'm running into a problem while I'm exploring the data through spark-shell that I'm trying to create a really fat dataframe that with 3000 columns. Code as below:
val valueFunctionUDF = udf((valMap: Map[String, String], dataItemId: String) =>
valMap.get(dataItemId) match {
case Some(v) => v.toDouble
case None => Double.NaN
})
s1 is being the main dataframe and the schema as below:
|-- combKey: string (nullable = true)
|-- valMaps: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
after I run the code:
dataItemIdVals.foreach{w =>
s1 = s1.withColumn(w, valueFunctionUDF($"valMaps", $"combKey"))}
my terminal just stuck after the above column with the info being printed out:
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 172.22.49.20:41494 in memory (size: 7.6 KB, free: 5.2 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43026 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:44890 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52020 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:33272 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:48481 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:44026 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:34539 in memory (size: 7.6 KB, free: 5.0 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43734 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:42769 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:60603 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:59102 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:47578 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:43149 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52488 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_3_piece0 on xxxxx:52298 in memory (size: 7.6 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 9
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 172.22.49.20:41494 in memory (size: 7.3 KB, free: 5.2 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:33272 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:59102 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:44026 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:42769 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43149 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43026 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52298 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:42890 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:47578 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:60603 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:43734 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:48481 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52020 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:52488 in memory (size: 7.3 KB, free: 5.1 GB)
16/07/11 12:20:54 INFO BlockManagerInfo: Removed broadcast_2_piece0 on xxxxx:34539 in memory (size: 7.3 KB, free: 5.0 GB)
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 8
16/07/11 12:20:54 INFO ContextCleaner: Cleaned shuffle 0
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 7
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 6
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 5
16/07/11 12:20:54 INFO ContextCleaner: Cleaned accumulator 4
Nothing is going on on sparkUI and I guess spark is calculating some metadata for the new dataframe(number of column etc.)? Anyone seen this kind of issue before? Anyway to get around with it?

Related

Spark Memory Error Java Runtime Environment

I have a spark job made in python, where I retrieve data from Redshift, and then I apply many transformations, join, filter, withColumn, agg ...
There are around 30K records in the dataframes
I perform all the transformation and when I try to write an AVRO file the spark job fails
My spark submit:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 10 --executor-cores 3 --executor-memory 10G --driver-memory 14g --conf spark.sql.broadcastTimeout=3600 --conf spark.network.timeout=10000000 --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
I'm using --executor-memory 10G --driver-memory 14g, 6 machines in amazon with 8 cores and 15G RAM, why Im getting out of memory error ???
Error returned:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hadoop/hs_err_pid13688.log
This is the end of spark log:
17/05/29 10:13:09 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 19, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:09 INFO TransportClientFactory: Successfully created connection to ip-10-185-53-172.eu-west-1.compute.internal/10.185.53.172:39759 after 3 ms (0 ms spent in bootstraps)
17/05/29 10:13:09 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.9 KB, free: 5.3 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 10.185.52.91:43829 in memory (size: 30.4 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_8_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.4 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 10.185.52.91:43829 in memory (size: 30.3 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_7_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.3 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on 10.185.52.91:43829 in memory (size: 30.6 KB, free: 8.2 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Removed broadcast_6_piece0 on ip-10-185-52-186.eu-west-1.compute.internal:35010 in memory (size: 30.6 KB, free: 5.8 GB)
17/05/29 10:13:11 INFO BlockManagerInfo: Added taskresult_2 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 499.6 MB, free: 4.8 GB)
17/05/29 10:13:11 INFO TaskSetManager: Starting task 0.0 in stage 23.0 (TID 20, ip-10-185-53-172.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5779 bytes)
17/05/29 10:13:12 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on ip-10-185-53-172.eu-west-1.compute.internal:39759 (size: 8.8 KB, free: 4.8 GB)
17/05/29 10:13:13 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 270161 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:13 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/05/29 10:13:13 INFO DAGScheduler: ResultStage 3 (run at ThreadPoolExecutor.java:1142) finished in 270.162 s
17/05/29 10:13:13 INFO DAGScheduler: Job 3 finished: run at ThreadPoolExecutor.java:1142, took 270.230067 s
17/05/29 10:13:13 INFO BlockManagerInfo: Removed taskresult_3 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.5 MB, free: 5.3 GB)
17/05/29 10:13:16 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.185.52.91:43829 in memory (size: 5.5 KB, free: 8.2 GB)
17/05/29 10:13:17 INFO BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 5.5 KB, free: 5.3 GB)
17/05/29 10:13:20 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 276982 ms on ip-10-185-53-172.eu-west-1.compute.internal (executor 2) (1/1)
17/05/29 10:13:20 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/05/29 10:13:20 INFO DAGScheduler: ResultStage 2 (run at ThreadPoolExecutor.java:1142) finished in 276.984 s
17/05/29 10:13:20 INFO DAGScheduler: Job 2 finished: run at ThreadPoolExecutor.java:1142, took 277.000009 s
17/05/29 10:13:20 INFO BlockManagerInfo: Removed taskresult_2 on ip-10-185-53-172.eu-west-1.compute.internal:39759 in memory (size: 499.6 MB, free: 5.8 GB)
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000667766000, 196608, 0) failed; error='Cannot allocate memory' (errno=12)

EMR Spark 2.1.0 process get stuck at at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)

I have an EMR applications that reads json files, "explodes" the hierarchical structure into a relational table and then writes it out as parquet. Source and Destination are HDFS. The application has been running fine for about a month, suddenly today---I assume due to a new source data set---it does not terminate. The log suggests that it gets stuck at some point with this driver output:
17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at NativeMethodAccessorImpl.java:0)
17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has no missing parents
17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 170.5 KB, free 2.2 GB)
17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 65.2 KB, free 2.2 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xxx.xx:37907 (size: 65.2 KB, free: 2.2 GB)
17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:996
17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0)
17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks
17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 72, xxx.xxx.xx.xx.xx, executor 12, partition 1, NODE_LOCAL, 8184 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 73, xxx.xxx.xx.xx.xx, executor 13, partition 0, NODE_LOCAL, 7967 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 74, xxx.xxx.xx.xx.xx, executor 14, partition 2, NODE_LOCAL, 8181 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 75, xxx.xxx.xx.xx.xx, executor 16, partition 6, NODE_LOCAL, 8400 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 76, xxx.xxx.xx.xx.xx, executor 10, partition 7, NODE_LOCAL, 8398 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 77, xxx.xxx.xx.xx.xx, executor 11, partition 3, NODE_LOCAL, 8182 bytes)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46030 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:40494 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:35861 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34157 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43202 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46053 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46030 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors because tasks are backlogged (new desired total will be 9)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34157 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:40494 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:35861 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46053 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43202 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 78, xxx.xxx.xx.xx.xx, executor 15, partition 4, RACK_LOCAL, 8400 bytes)
17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 79, xxx.xxx.xx.xx.xx, executor 9, partition 5, RACK_LOCAL, 8400 bytes)
17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34045 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43887 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34045 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43887 (size: 28.0 KB, free: 4.0 GB)
When I connect to one of the nodes in the cluster and run jstack, I get the following stack trace (among others that do not look very interesting):
"Executor task launch worker-0" #39 daemon prio=5 os_prio=0 tid=0x00007f6210352800 nid=0x4542 runnable [0x00007f61f52b3000]
java.lang.Thread.State: RUNNABLE
at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any suggestions will be very much appreciated.

Identical code works in pyspark shell but not via spark-submit

So I have a Pyspark project in the following structure:
main.py: doing the real stuff (imports pyspark udf's from utils.py and stuff from common.py)
utils.py: some utility functions (imports from common.py)
common.py: some params
Inside a Pyspark shell, I could run the code from common.py, utils.py, main.py in this order, and could get the result I wanted; however, if I submit it via spark-submit, no error will be reported but the job kept executing at a cluster load of < 1%, which I suspect indicates nothing was being computed really.
Here is the spark-submit code:
spark-submit --master yarn --deploy-mode cluster --driver-cores 4 --driver-memory 20G --num-executors 10 --executor-cores 4 --executor-memory 20G --py-files project.zip project/main.py
Here is what's inside main.py (the other 2 .py files were a bit lengthy):
from pyspark.sql import SparkSession
from utils import foo1, foo2
from common import bar1, bar2
if __name__ == '__main__':
input_path = bar1
output_path = bar2
# build spark session
spark = SparkSession.builder\
.appName("app")\
.getOrCreate()
rd = spark.read.json(input_path).repartition(100)
# add a new column Col2
rd_app = rd.withColumn('Col2', foo1(rd.Col1))
rd_app_not_null = rd_app.filter('Col2 is not null')
# cleanup queries
rd_no_query = rd_app_not_null\
.withColumn('URLClean', foo2(rd_app_not_null.URL))\
.drop('URL')\
.withColumnRenamed('URLClean', 'URL')
# save to S3
rd_no_query.write.json(output_path, compression='gzip')
Running on EMR; Pyspark shell in Yarn client mode, spark-submit in Yarn cluster mode.
Any help is appreciated!
Edit: last couple lines of logs (stayed like that for hours):
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 285.578911 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 41.334709 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 9.883221 ms
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 313.5 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 28.2 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.42.228:35995 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 2 from json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
17/03/08 22:06:31 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO DAGScheduler: Registering RDD 8 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Got job 1 (json at NativeMethodAccessorImpl.java:0) with 100 output partitions
17/03/08 22:06:31 INFO DAGScheduler: Final stage: ResultStage 2 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 25.3 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.8 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.31.42.228:35995 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/03/08 22:06:31 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO YarnClusterScheduler: Adding task set 1.0 with 24 tasks
17/03/08 22:06:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, ip-172-31-46-251.ec2.internal, executor 2, partition 0, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, ip-172-31-43-215.ec2.internal, executor 4, partition 1, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, ip-172-31-41-81.ec2.internal, executor 3, partition 2, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, ip-172-31-34-182.ec2.internal, executor 5, partition 3, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, ip-172-31-46-251.ec2.internal, executor 2, partition 4, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, ip-172-31-43-215.ec2.internal, executor 4, partition 5, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, ip-172-31-41-81.ec2.internal, executor 3, partition 6, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, ip-172-31-34-182.ec2.internal, executor 5, partition 7, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, ip-172-31-46-251.ec2.internal, executor 2, partition 8, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, ip-172-31-43-215.ec2.internal, executor 4, partition 9, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 34, ip-172-31-41-81.ec2.internal, executor 3, partition 10, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 35, ip-172-31-34-182.ec2.internal, executor 5, partition 11, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 36, ip-172-31-46-251.ec2.internal, executor 2, partition 12, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 37, ip-172-31-43-215.ec2.internal, executor 4, partition 13, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 14.0 in stage 1.0 (TID 38, ip-172-31-41-81.ec2.internal, executor 3, partition 14, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 15.0 in stage 1.0 (TID 39, ip-172-31-34-182.ec2.internal, executor 5, partition 15, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO YarnAllocator: Driver requested a total number of 5 executor(s).
17/03/08 22:06:32 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:32 INFO ExecutorAllocationManager: Requesting 5 new executors because tasks are backlogged (new desired total will be 5)
17/03/08 22:06:32 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO AMRMClientImpl: Received new token for : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:32 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000009 on host ip-172-31-45-4.ec2.internal
17/03/08 22:06:32 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:33 INFO YarnAllocator: Driver requested a total number of 6 executor(s).
17/03/08 22:06:33 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:33 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 6)
17/03/08 22:06:33 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:33 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000012 on host ip-172-31-40-142.ec2.internal
17/03/08 22:06:33 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-40-142.ec2.internal:8041
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.40.142:36208) with ID 8
17/03/08 22:06:36 INFO TaskSetManager: Starting task 16.0 in stage 1.0 (TID 40, ip-172-31-40-142.ec2.internal, executor 8, partition 16, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:36 INFO ExecutorAllocationManager: New executor 8 has registered (new total is 5)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 17.0 in stage 1.0 (TID 41, ip-172-31-40-142.ec2.internal, executor 8, partition 17, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 18.0 in stage 1.0 (TID 42, ip-172-31-40-142.ec2.internal, executor 8, partition 18, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 19.0 in stage 1.0 (TID 43, ip-172-31-40-142.ec2.internal, executor 8, partition 19, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-40-142.ec2.internal:42371 with 11.8 GB RAM, BlockManagerId(8, ip-172-31-40-142.ec2.internal, 42371, None)
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO AMRMClientImpl: Received new token for : ip-172-31-33-112.ec2.internal:8041
17/03/08 22:06:36 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 0 of them.
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:37 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.45.4:33330) with ID 7
17/03/08 22:06:37 INFO TaskSetManager: Starting task 20.0 in stage 1.0 (TID 44, ip-172-31-45-4.ec2.internal, executor 7, partition 20, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO ExecutorAllocationManager: New executor 7 has registered (new total is 6)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 21.0 in stage 1.0 (TID 45, ip-172-31-45-4.ec2.internal, executor 7, partition 21, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 22.0 in stage 1.0 (TID 46, ip-172-31-45-4.ec2.internal, executor 7, partition 22, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 23.0 in stage 1.0 (TID 47, ip-172-31-45-4.ec2.internal, executor 7, partition 23, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-45-4.ec2.internal:45646 with 11.8 GB RAM, BlockManagerId(7, ip-172-31-45-4.ec2.internal, 45646, None)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 28.2 KB, free: 11.8 GB)

Spark metrics on wordcount example

I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make it work.
spark/conf/metrics.properties :
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink.csv.period=1
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=/home/spark/Documents/test/
# Worker instance overlap polling period
worker.sink.csv.period=1
worker.sink.csv.unit=seconds
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I run my app in local like in the documentation :
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
I checked /home/spark/Documents/test/ and it is empty.
What did I miss?
Shell:
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] --conf spark.metrics.conf=/home/spark/development/spark/conf/metrics.properties target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
INFO SparkContext: Running Spark version 1.3.0
WARN Utils: Your hostname, cv-local resolves to a loopback address: 127.0.1.1; using 192.168.1.64 instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
INFO SecurityManager: Changing view acls to: spark
INFO SecurityManager: Changing modify acls to: spark
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
INFO Slf4jLogger: Slf4jLogger started
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#cv-local.local:35895]
INFO Utils: Successfully started service 'sparkDriver' on port 35895.
INFO SparkEnv: Registering MapOutputTracker
INFO SparkEnv: Registering BlockManagerMaster
INFO DiskBlockManager: Created local directory at /tmp/spark-447d56c9-cfe5-4f9d-9e0a-6bb476ddede6/blockmgr-4eaa04f4-b4b2-4b05-ba0e-fd1aeb92b289
INFO MemoryStore: MemoryStore started with capacity 265.4 MB
INFO HttpFileServer: HTTP File server directory is /tmp/spark-fae11cd2-937e-4be3-a273-be8b4c4847df/httpd-ca163445-6fff-45e4-9c69-35edcea83b68
INFO HttpServer: Starting HTTP Server
INFO Utils: Successfully started service 'HTTP file server' on port 52828.
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://cv-local.local:4040
INFO SparkContext: Added JAR file:/home/spark/workspace/IdeaProjects/wordcount/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Executor: Starting executor ID <driver> on host localhost
INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#cv-local.local:35895/user/HeartbeatReceiver
INFO NettyBlockTransferService: Server created on 60320
INFO BlockManagerMaster: Trying to register BlockManager
INFO BlockManagerMasterActor: Registering block manager localhost:60320 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 60320)
INFO BlockManagerMaster: Registered BlockManager
INFO MemoryStore: ensureFreeSpace(34046) called with curMem=0, maxMem=278302556
INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 33.2 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(5221) called with curMem=34046, maxMem=278302556
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60320 (size: 5.1 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN LoadSnappy: Snappy native library not loaded
INFO FileInputFormat: Total input paths to process : 1
INFO SparkContext: Starting job: count at SimpleApp.scala:12
INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:12)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=39267, maxMem=278302556
INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.8 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=42115, maxMem=278302556
INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
INFO Executor: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Utils: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar to /tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/fetchFileTemp4229868141058449157.tmp
INFO Executor: Adding file:/tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/simple-project_2.10-1.0.jar to class loader
INFO CacheManager: Partition rdd_1_1 not found, computing it
INFO CacheManager: Partition rdd_1_0 not found, computing it
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:2659+2659
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:0+2659
INFO MemoryStore: ensureFreeSpace(7840) called with curMem=44171, maxMem=278302556
INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 7.7 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:60320 (size: 7.7 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_0
INFO MemoryStore: ensureFreeSpace(8648) called with curMem=52011, maxMem=278302556
INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 8.4 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:60320 (size: 8.4 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_1
INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2399 bytes result sent to driver
INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2399 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 139 ms on localhost (1/2)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 133 ms on localhost (2/2)
INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:12) finished in 0.151 s
INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.225939 s
INFO SparkContext: Starting job: count at SimpleApp.scala:13
INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:13)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=60659, maxMem=278302556
INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.8 KB, free 265.3 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=63507, maxMem=278302556
INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.3 MB)
INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
INFO BlockManager: Found block rdd_1_0 locally
INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 9 ms on localhost (1/2)
INFO BlockManager: Found block rdd_1_1 locally
INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10 ms on localhost (2/2)
INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:13) finished in 0.011 s
INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.024084 s
Lines with a: 5, Lines with b: 12
I made it work specifying in the spark submit the path to the metrics file
--files=/yourPath/metrics.properties --conf spark.metrics.conf=./metrics.properties

Spark Shell Listens on localhost instead of configured IP address

I am trying to run a simple spark job via spark-shell and it looks like
BlockManager for the spark-shell listens on localhost instead of configured IP
address which causes the spark job to fail. The exception thrown is "Failed to connect to localhost" .
Here is the my configuration:
Machine 1(ubunt64): Spark master [192.168.253.136]
Machine 2(ubuntu64server): Spark Slave [192.168.253.137]
Machine 3(ubuntu64server2): Spark Shell Client[192.168.253.138]
Spark Version: spark-1.3.0-bin-hadoop2.4
Environment: Ubuntu 14.04
Source Code to be executed in Spark Shell:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
var conf = new SparkConf().setMaster("spark://192.168.253.136:7077")
conf.set("spark.driver.host","192.168.253.138")
conf.set("spark.local.ip","192.168.253.138")
sc.stop
var sc = new SparkContext(conf)
val textFile = sc.textFile("README.md")
textFile.count()
The above code just works file if I run it on Machine 2 where the slave is
running, but it fails on Machine 1 (Master) and Machine 3(Spark Shell).
Not sure why spark shell listens on a localhost instead of
configured IP address. I have set SPARK_LOCAL_IP on Machine 3 using spark-env.sh as well in .bashrc (export SPARK_LOCAL_IP=192.168.253.138). I confirmed that spark shell java program does listen on the port 44015. Not sure why spark shell is broadcasting localhost address.
Any help to troubleshoot this issue will be highly appreciated. Probably I am
missing some configuration setting.
Logs:
scala> val textFile = sc.textFile("README.md")
15/04/22 18:15:22 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975
15/04/22 18:15:22 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB)
15/04/22 18:15:22 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975
15/04/22 18:15:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB)
15/04/22 18:15:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:44015 (size: 22.2 KB, free: 267.2 MB)
scala> textFile.count()
15/04/22 18:16:07 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (README.md MapPartitionsRDD[1] at textFile at :25)
15/04/22 18:16:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/04/22 18:16:08 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ubuntu64server, PROCESS_LOCAL, 1326 bytes)
15/04/22 18:16:23 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ubuntu64server, PROCESS_LOCAL, 1326 bytes)
15/04/22 18:16:23 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ubuntu64server): java.io.IOException: Failed to connect to localhost/127.0.0.1:44015
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Found a work-around for this BlockManager localhost issue by providing spark master address at shell initiation (or can bein spark-defaults.conf).
./spark-shell --master spark://192.168.253.136:7077
This way, I didn't have to stop the spark context and the original context was able to read files as well as read data from Cassandra tables.
Here is the log of BlockManager listening on localhost (stop and dynamic creation of context) which fails with "Failed to connect exception"
15/04/25 07:10:27 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:40235 (size: 1966.0 B, free: 267.2 MB)
compare to listening on actual server name (if spark master provided at command line) which works
15/04/25 07:12:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ubuntu64server2:33301 (size: 1966.0 B, free: 267.2 MB)
Looks like a bug in BlockManager code when context is dynamically created in the shell.
Hope this helps someone.

Resources