How to correctly parallelize multiple JSON file aggregation in PySpark - apache-spark

I have a large set of json_list files on S3 with some logs that I would like to aggregate (basically just count number of requests by path, location etc.) I've been doing the following, but judging by the logs, I'm not sure it's actually parallelized.. first it takes about 3 minutes to download the individual S3 files one by one, and then the rest still seems split to 1000 executions.. I thought Spark would break this down into a map-reduce kind of approach itself but maybe I totally misunderstood what it does and what it doesn't do. Could someone provide a hint please.
df = (
spark.read
.json(test_paths, schema=schema)
.filter(col('method') == 'GET')
.filter((col('status_code') == 200) | (col('status_code') == 206))
.withColumn('date', from_unixtime('timestamp').cast(DateType()))
.groupBy('path', 'client_country_code', 'date', 'file_size')
.count()
)
Here's the driver log for 1000 urls:
20/11/15 19:15:23 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under 1000 paths. The first several paths are: s3n://bucket../10004.json_lines.gz.
20/11/15 19:15:23 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
20/11/15 19:15:23 INFO DAGScheduler: Got job 49 (json at NativeMethodAccessorImpl.java:0) with 1000 output partitions
20/11/15 19:15:23 INFO DAGScheduler: Final stage: ResultStage 75 (json at NativeMethodAccessorImpl.java:0)
20/11/15 19:15:23 INFO DAGScheduler: Parents of final stage: List()
20/11/15 19:15:23 INFO DAGScheduler: Missing parents: List()
20/11/15 19:15:23 INFO DAGScheduler: Submitting ResultStage 75 (MapPartitionsRDD[206] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
20/11/15 19:15:23 INFO MemoryStore: Block broadcast_77 stored as values in memory (estimated size 84.3 KiB, free 2.2 GiB)
20/11/15 19:15:23 INFO MemoryStore: Block broadcast_77_piece0 stored as bytes in memory (estimated size 29.9 KiB, free 2.2 GiB)
20/11/15 19:15:23 INFO BlockManagerInfo: Added broadcast_77_piece0 in memory on e05e979b7108:34999 (size: 29.9 KiB, free: 2.2 GiB)
20/11/15 19:15:23 INFO SparkContext: Created broadcast 77 from broadcast at DAGScheduler.scala:1223
20/11/15 19:15:23 INFO DAGScheduler: Submitting 1000 missing tasks from ResultStage 75 (MapPartitionsRDD[206] at json at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
20/11/15 19:15:23 INFO TaskSchedulerImpl: Adding task set 75.0 with 1000 tasks
20/11/15 19:15:23 INFO TaskSetManager: Starting task 0.0 in stage 75.0 (TID 33224, e05e979b7108, executor driver, partition 0, PROCESS_LOCAL, 7473 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 1.0 in stage 75.0 (TID 33225, e05e979b7108, executor driver, partition 1, PROCESS_LOCAL, 7473 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 2.0 in stage 75.0 (TID 33226, e05e979b7108, executor driver, partition 2, PROCESS_LOCAL, 7474 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 3.0 in stage 75.0 (TID 33227, e05e979b7108, executor driver, partition 3, PROCESS_LOCAL, 7475 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 4.0 in stage 75.0 (TID 33228, e05e979b7108, executor driver, partition 4, PROCESS_LOCAL, 7476 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 5.0 in stage 75.0 (TID 33229, e05e979b7108, executor driver, partition 5, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 6.0 in stage 75.0 (TID 33230, e05e979b7108, executor driver, partition 6, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 7.0 in stage 75.0 (TID 33231, e05e979b7108, executor driver, partition 7, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO Executor: Running task 0.0 in stage 75.0 (TID 33224)
20/11/15 19:15:23 INFO Executor: Running task 1.0 in stage 75.0 (TID 33225)
20/11/15 19:15:23 INFO Executor: Running task 2.0 in stage 75.0 (TID 33226)
20/11/15 19:15:23 INFO Executor: Running task 5.0 in stage 75.0 (TID 33229)
20/11/15 19:15:23 INFO Executor: Running task 3.0 in stage 75.0 (TID 33227)
20/11/15 19:15:23 INFO Executor: Running task 7.0 in stage 75.0 (TID 33231)
20/11/15 19:15:23 INFO Executor: Running task 6.0 in stage 75.0 (TID 33230)
20/11/15 19:15:23 INFO Executor: Running task 4.0 in stage 75.0 (TID 33228)
20/11/15 19:15:24 INFO Executor: Finished task 1.0 in stage 75.0 (TID 33225). 2025 bytes result sent to driver
20/11/15 19:15:24 INFO Executor: Finished task 0.0 in stage 75.0 (TID 33224). 2025 bytes result sent to driver
20/11/15 19:15:24 INFO TaskSetManager: Starting task 8.0 in stage 75.0 (TID 33232, e05e979b7108, executor driver, partition 8, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 1.0 in stage 75.0 (TID 33225) in 567 ms on e05e979b7108 (executor driver) (1/1000)
20/11/15 19:15:24 INFO Executor: Running task 8.0 in stage 75.0 (TID 33232)
20/11/15 19:15:24 INFO Executor: Finished task 6.0 in stage 75.0 (TID 33230). 2033 bytes result sent to driver
20/11/15 19:15:24 INFO TaskSetManager: Starting task 9.0 in stage 75.0 (TID 33233, e05e979b7108, executor driver, partition 9, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Starting task 10.0 in stage 75.0 (TID 33234, e05e979b7108, executor driver, partition 10, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 0.0 in stage 75.0 (TID 33224) in 570 ms on e05e979b7108 (executor driver) (2/1000)
20/11/15 19:15:24 INFO Executor: Running task 9.0 in stage 75.0 (TID 33233)
20/11/15 19:15:24 INFO Executor: Running task 10.0 in stage 75.0 (TID 33234)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 6.0 in stage 75.0 (TID 33230) in 571 ms on e05e979b7108 (executor driver) (3/1000)
....
20/11/15 19:15:43 INFO TaskSetManager: Finished task 998.0 in stage 75.0 (TID 34222) in 158 ms on e05e979b7108 (executor driver) (999/1000)
20/11/15 19:15:43 INFO Executor: Finished task 999.0 in stage 75.0 (TID 34223). 2033 bytes result sent to driver
20/11/15 19:15:43 INFO TaskSetManager: Finished task 999.0 in stage 75.0 (TID 34223) in 175 ms on e05e979b7108 (executor driver) (1000/1000)
20/11/15 19:15:43 INFO TaskSchedulerImpl: Removed TaskSet 75.0, whose tasks have all completed, from pool
20/11/15 19:15:43 INFO DAGScheduler: ResultStage 75 (json at NativeMethodAccessorImpl.java:0) finished in 19.850 s
20/11/15 19:15:43 INFO DAGScheduler: Job 49 is finished. Cancelling potential speculative or zombie tasks for this job
20/11/15 19:15:43 INFO TaskSchedulerImpl: Killing all running tasks in stage 75: Stage finished
20/11/15 19:15:43 INFO DAGScheduler: Job 49 finished: json at NativeMethodAccessorImpl.java:0, took 19.890458 s
20/11/15 19:15:43 INFO InMemoryFileIndex: It took 19936 ms to list leaf files for 1000 paths.

There's a lot of setup overhead, especially with many small files. JSON is also a very inefficient storage format as the whole file will be needed to be read every time. Ideally each file should be 64+MB to give the spark workers enough data to process efficiently.
Have you considered making step 1 of your workflow just reading in the JSON files and then saving in a columnar format like Parquet to a smaller number of files.?

Related

Spark Streaming integration with Kinesis not receiving records in EMR

I'm trying to run the word count example described here, but the DStream reading from the Kinesis stream is always empty.
This is how I'm running:
Launched an AWS EMR cluster in version 6.5.0 (Running spark 3.1.2)
SSHed into the master instance
ran: spark-example --packages org.apache.spark:spark-streaming-kinesis-asl_2.12:3.1.2 streaming.JavaKinesisWordCountASL streaming_test streaming_test https://kinesis.sa-east-1.amazonaws.com
In another tab, ran: spark-example --packages org.apache.spark:spark-streaming-kinesis-asl_2.12:3.1.2 streaming.KinesisWordProducerASL streaming-test https://kinesis.sa-east-1.amazonaws.com 100 10
Additional info:
EMR cluster with 2 m5.xlarge instances
Kinesis with a single shard only
I can fetch records from the stream using boto3
A DynamoDB table was indeed created for storing checkpoints, but nothing was written on it
Logs (This is just a sample - after it finishes initializing, it keeps repeating that pattern of pprint with no records, followed by a bunch of spark related logs, then followed again by another pprint with no records)
GiB)
22/01/27 21:39:46 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 77) (ip-10-0-13-187.sa-east-1.compute.internal, executor 1, partition 6, PROCESS_LOCAL, 4443 bytes) taskResourceAssignments Map()
22/01/27 21:39:46 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 76) in 19 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (1/3)
22/01/27 21:39:46 INFO TaskSetManager: Starting task 2.0 in stage 8.0 (TID 78) (ip-10-0-13-187.sa-east-1.compute.internal, executor 1, partition 7, PROCESS_LOCAL, 4443 bytes) taskResourceAssignments Map()
22/01/27 21:39:46 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 77) in 10 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (2/3)
22/01/27 21:39:46 INFO TaskSetManager: Finished task 2.0 in stage 8.0 (TID 78) in 8 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (3/3)
22/01/27 21:39:46 INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
22/01/27 21:39:46 INFO DAGScheduler: ResultStage 8 (print at JavaKinesisWordCountASL.java:190) finished in 0,042 s
22/01/27 21:39:46 INFO DAGScheduler: Job 4 is finished. Cancelling potential speculative or zombie tasks for this job
22/01/27 21:39:46 INFO YarnScheduler: Killing all running tasks in stage 8: Stage finished
22/01/27 21:39:46 INFO DAGScheduler: Job 4 finished: print at JavaKinesisWordCountASL.java:190, took 0,048372 s
-------------------------------------------
Time: 1643319586000 ms
-------------------------------------------
22/01/27 21:39:46 INFO JobScheduler: Finished job streaming job 1643319586000 ms.0 from job set of time 1643319586000 ms
22/01/27 21:39:46 INFO JobScheduler: Total delay: 0,271 s for time 1643319586000 ms (execution: 0,227 s)
22/01/27 21:39:46 INFO ReceivedBlockTracker: Deleting batches:
Also, apparently, the Library does manage to connect to the Kinesis stream:
22/01/27 21:39:44 INFO KinesisInputDStream: Slide time = 2000 ms
22/01/27 21:39:44 INFO KinesisInputDStream: Storage level = Serialized 1x Replicated
22/01/27 21:39:44 INFO KinesisInputDStream: Checkpoint interval = null
22/01/27 21:39:44 INFO KinesisInputDStream: Remember interval = 2000 ms
22/01/27 21:39:44 INFO KinesisInputDStream: Initialized and validated org.apache.spark.streaming.kinesis.KinesisInputDStream#7cc3580b
Help would be very appreciated!

Apache Spark driver logs don't specify reason of stage cancelling

I run Apache Spark on AWS EMR under YARN.
The cluster has 1 master and 10 executors.
After some hours of processing my cluster failed and I go to look on a log.
So, I see that all working executors were trying to kill task at one time (It's the log of someone executor):
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 66.0 in stage 2.0 (TID 466), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 65.0 in stage 2.0 (TID 465), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 67.0 in stage 2.0 (TID 467), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 64.0 in stage 2.0 (TID 464), reason: Stage cancelled
20/03/05 00:02:12 ERROR Utils: Aborting a task
I see that reason is Stage cancelled but I can't get any details about that. I see driver logs and find that they have the last record at much earlier time.
So I have 2 questions:
Why driver logs are much shorter than executors logs?
How can I get the real reason why stage cancelled?
20/03/04 18:39:40 INFO TaskSetManager: Starting task 159.0 in stage 1.0 (TID 359, ip-172-31-6-236.us-west-2.compute.internal, executor 40, partition 159, RACK_LOCAL, 8421 bytes)
20/03/04 18:39:40 INFO ExecutorAllocationManager: New executor 40 has registered (new total is 40)
20/03/04 18:39:41 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-6-236.us-west-2.compute.internal:33589 with 2.8 GB RAM, BlockManagerId(40, ip-172-31-6-236.us-west-2.compute.internal, 33589, None)
20/03/04 18:39:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 44.7 KB, free: 2.8 GB)
20/03/04 18:39:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 37.4 KB, free: 2.8 GB)

Identical code works in pyspark shell but not via spark-submit

So I have a Pyspark project in the following structure:
main.py: doing the real stuff (imports pyspark udf's from utils.py and stuff from common.py)
utils.py: some utility functions (imports from common.py)
common.py: some params
Inside a Pyspark shell, I could run the code from common.py, utils.py, main.py in this order, and could get the result I wanted; however, if I submit it via spark-submit, no error will be reported but the job kept executing at a cluster load of < 1%, which I suspect indicates nothing was being computed really.
Here is the spark-submit code:
spark-submit --master yarn --deploy-mode cluster --driver-cores 4 --driver-memory 20G --num-executors 10 --executor-cores 4 --executor-memory 20G --py-files project.zip project/main.py
Here is what's inside main.py (the other 2 .py files were a bit lengthy):
from pyspark.sql import SparkSession
from utils import foo1, foo2
from common import bar1, bar2
if __name__ == '__main__':
input_path = bar1
output_path = bar2
# build spark session
spark = SparkSession.builder\
.appName("app")\
.getOrCreate()
rd = spark.read.json(input_path).repartition(100)
# add a new column Col2
rd_app = rd.withColumn('Col2', foo1(rd.Col1))
rd_app_not_null = rd_app.filter('Col2 is not null')
# cleanup queries
rd_no_query = rd_app_not_null\
.withColumn('URLClean', foo2(rd_app_not_null.URL))\
.drop('URL')\
.withColumnRenamed('URLClean', 'URL')
# save to S3
rd_no_query.write.json(output_path, compression='gzip')
Running on EMR; Pyspark shell in Yarn client mode, spark-submit in Yarn cluster mode.
Any help is appreciated!
Edit: last couple lines of logs (stayed like that for hours):
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 285.578911 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 41.334709 ms
17/03/08 22:06:31 INFO CodeGenerator: Code generated in 9.883221 ms
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 313.5 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 28.2 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.42.228:35995 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 2 from json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
17/03/08 22:06:31 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
17/03/08 22:06:31 INFO DAGScheduler: Registering RDD 8 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Got job 1 (json at NativeMethodAccessorImpl.java:0) with 100 output partitions
17/03/08 22:06:31 INFO DAGScheduler: Final stage: ResultStage 2 (json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
17/03/08 22:06:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 25.3 KB, free 11.8 GB)
17/03/08 22:06:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.8 KB, free 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.31.42.228:35995 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/03/08 22:06:31 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[8] at json at NativeMethodAccessorImpl.java:0)
17/03/08 22:06:31 INFO YarnClusterScheduler: Adding task set 1.0 with 24 tasks
17/03/08 22:06:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, ip-172-31-46-251.ec2.internal, executor 2, partition 0, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, ip-172-31-43-215.ec2.internal, executor 4, partition 1, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, ip-172-31-41-81.ec2.internal, executor 3, partition 2, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, ip-172-31-34-182.ec2.internal, executor 5, partition 3, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, ip-172-31-46-251.ec2.internal, executor 2, partition 4, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, ip-172-31-43-215.ec2.internal, executor 4, partition 5, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, ip-172-31-41-81.ec2.internal, executor 3, partition 6, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, ip-172-31-34-182.ec2.internal, executor 5, partition 7, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, ip-172-31-46-251.ec2.internal, executor 2, partition 8, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, ip-172-31-43-215.ec2.internal, executor 4, partition 9, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 34, ip-172-31-41-81.ec2.internal, executor 3, partition 10, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 35, ip-172-31-34-182.ec2.internal, executor 5, partition 11, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 36, ip-172-31-46-251.ec2.internal, executor 2, partition 12, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 37, ip-172-31-43-215.ec2.internal, executor 4, partition 13, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 14.0 in stage 1.0 (TID 38, ip-172-31-41-81.ec2.internal, executor 3, partition 14, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:31 INFO TaskSetManager: Starting task 15.0 in stage 1.0 (TID 39, ip-172-31-34-182.ec2.internal, executor 5, partition 15, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO YarnAllocator: Driver requested a total number of 5 executor(s).
17/03/08 22:06:32 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:32 INFO ExecutorAllocationManager: Requesting 5 new executors because tasks are backlogged (new desired total will be 5)
17/03/08 22:06:32 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-46-251.ec2.internal:43957 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:32 INFO AMRMClientImpl: Received new token for : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:32 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000009 on host ip-172-31-45-4.ec2.internal
17/03/08 22:06:32 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:32 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-45-4.ec2.internal:8041
17/03/08 22:06:33 INFO YarnAllocator: Driver requested a total number of 6 executor(s).
17/03/08 22:06:33 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 22528 MB memory (including 2048 MB of overhead)
17/03/08 22:06:33 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 6)
17/03/08 22:06:33 INFO YarnAllocator: Submitted container request for host *.
17/03/08 22:06:33 INFO YarnAllocator: Launching container container_1489010282810_0001_01_000012 on host ip-172-31-40-142.ec2.internal
17/03/08 22:06:33 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/03/08 22:06:33 INFO ContainerManagementProtocolProxy: Opening proxy : ip-172-31-40-142.ec2.internal:8041
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-41-81.ec2.internal:40460 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-34-182.ec2.internal:40881 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.40.142:36208) with ID 8
17/03/08 22:06:36 INFO TaskSetManager: Starting task 16.0 in stage 1.0 (TID 40, ip-172-31-40-142.ec2.internal, executor 8, partition 16, RACK_LOCAL, 6588 bytes)
17/03/08 22:06:36 INFO ExecutorAllocationManager: New executor 8 has registered (new total is 5)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 17.0 in stage 1.0 (TID 41, ip-172-31-40-142.ec2.internal, executor 8, partition 17, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 18.0 in stage 1.0 (TID 42, ip-172-31-40-142.ec2.internal, executor 8, partition 18, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO TaskSetManager: Starting task 19.0 in stage 1.0 (TID 43, ip-172-31-40-142.ec2.internal, executor 8, partition 19, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:36 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-40-142.ec2.internal:42371 with 11.8 GB RAM, BlockManagerId(8, ip-172-31-40-142.ec2.internal, 42371, None)
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-43-215.ec2.internal:34026 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:36 INFO AMRMClientImpl: Received new token for : ip-172-31-33-112.ec2.internal:8041
17/03/08 22:06:36 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 0 of them.
17/03/08 22:06:36 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:37 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (172.31.45.4:33330) with ID 7
17/03/08 22:06:37 INFO TaskSetManager: Starting task 20.0 in stage 1.0 (TID 44, ip-172-31-45-4.ec2.internal, executor 7, partition 20, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO ExecutorAllocationManager: New executor 7 has registered (new total is 6)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 21.0 in stage 1.0 (TID 45, ip-172-31-45-4.ec2.internal, executor 7, partition 21, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 22.0 in stage 1.0 (TID 46, ip-172-31-45-4.ec2.internal, executor 7, partition 22, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO TaskSetManager: Starting task 23.0 in stage 1.0 (TID 47, ip-172-31-45-4.ec2.internal, executor 7, partition 23, RACK_LOCAL, 6587 bytes)
17/03/08 22:06:37 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-45-4.ec2.internal:45646 with 11.8 GB RAM, BlockManagerId(7, ip-172-31-45-4.ec2.internal, 45646, None)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-40-142.ec2.internal:42371 (size: 28.2 KB, free: 11.8 GB)
17/03/08 22:06:38 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 9.8 KB, free: 11.8 GB)
17/03/08 22:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-45-4.ec2.internal:45646 (size: 28.2 KB, free: 11.8 GB)

Spark: why tasks assigned only to one worker?

I'm new to Apache Spark and trying to run a simple program on my cluster. The problem is that the driver allocates all tasks to one worker.
I am running as spark stand-alone cluster mode on 2 computers:
1 - runs the master and a worker with 4 cores: 1 used for the master, 3 for the worker. Ip: 192.168.1.101
2 - runs only a worker with 4 cores: all for worker. Ip: 192.168.1.104
this is the code:
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("spark-project");
JavaSparkContext sc = new JavaSparkContext(conf);
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
JavaRDD<String> lines = sc.textFile("/Datasets/somefile.txt",7);
System.out.println(lines.partitions().size());
Accumulator<Integer> sum = sc.accumulator(0);
JavaRDD<Integer> numbers = lines.map(line -> 1);
System.out.println(numbers.partitions().size());
numbers.foreach(num -> System.out.println(num));
numbers.foreach(num -> sum.add(num));
System.out.println(sum.value());
sc.close();
}
Note: used Thread.sleep() command because I tried this: https://issues.apache.org/jira/browse/SPARK-3100
I used the submit script:
bin/spark-submit --class spark.Main --master spark://192.168.1.101:7077 --deploy-mode cluster /home/sparkUser/JarsOfSpark/JarForSpark.jar
this is the result I have got from the driver stdout:
7
7
50144
logs from the master:
log4j:WARN No appenders could be found for logger(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/15 19:22:14 INFO SecurityManager: Changing view acls to: sparkUser
16/01/15 19:22:14 INFO SecurityManager: Changing modify acls to: sparkUser
16/01/15 19:22:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkUser); users with modify permissions: Set(sparkUser)
16/01/15 19:22:24 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:22:24 INFO Utils: Successfully started service 'Driver' on port 46546.
16/01/15 19:22:24 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#192.168.1.101:43150/user/Worker
16/01/15 19:22:24 INFO SparkContext: Running Spark version 1.4.1
16/01/15 19:22:24 INFO SecurityManager: Changing view acls to: sparkUser
16/01/15 19:22:24 INFO SecurityManager: Changing modify acls to: sparkUser
16/01/15 19:22:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkUser); users with modify permissions: Set(sparkUser)
16/01/15 19:22:24 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#192.168.1.101:43150/user/Worker
16/01/15 19:22:25 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:22:25 INFO Utils: Successfully started service 'sparkDriver' on port 38186.
16/01/15 19:22:25 INFO SparkEnv: Registering MapOutputTracker
16/01/15 19:22:25 INFO SparkEnv: Registering BlockManagerMaster
16/01/15 19:22:25 INFO DiskBlockManager: Created local directory at /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1/blockmgr-e80c1c60-fe19-4be1-b3f9-259b3f1031a0
16/01/15 19:22:25 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
16/01/15 19:22:25 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1/httpd-e05a5a70-dbf3-4055-b6ab-7efa22dfa4d2
16/01/15 19:22:25 INFO HttpServer: Starting HTTP Server
16/01/15 19:22:25 INFO Utils: Successfully started service 'HTTP file server' on port 34728.
16/01/15 19:22:25 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/15 19:22:35 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/15 19:22:35 INFO SparkUI: Started SparkUI at http://192.168.1.101:4040
16/01/15 19:22:35 INFO SparkContext: Added JAR file:/home/sparkUser/JarsOfSpark/JarForSpark.jar at http://192.168.1.101:34728/jars/JarForSpark.jar with timestamp 1452878555317
16/01/15 19:22:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#192.168.1.101:7077/user/Master...
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160115192235-0016
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor added: app-20160115192235-0016/0 on worker-20160115181337-192.168.1.104-50099 (192.168.1.104:50099) with 4 cores
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160115192235-0016/0 on hostPort 192.168.1.104:50099 with 4 cores, 512.0 MB RAM
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor added: app-20160115192235-0016/1 on worker-20160115125104-192.168.1.101-43150 (192.168.1.101:43150) with 3 cores
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160115192235-0016/1 on hostPort 192.168.1.101:43150 with 3 cores, 512.0 MB RAM
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/1 is now LOADING
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/0 is now LOADING
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/0 is now RUNNING
16/01/15 19:22:35 INFO AppClient$ClientActor: Executor updated: app-20160115192235-0016/1 is now RUNNING
16/01/15 19:22:35 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33359.
16/01/15 19:22:35 INFO NettyBlockTransferService: Server created on 33359
16/01/15 19:22:35 INFO BlockManagerMaster: Trying to register BlockManager
16/01/15 19:22:35 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.101:33359 with 265.1 MB RAM, BlockManagerId(driver, 192.168.1.101, 33359)
16/01/15 19:22:35 INFO BlockManagerMaster: Registered BlockManager
16/01/15 19:22:35 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/01/15 19:22:38 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#192.168.1.104:49573/user/Executor#1472403765]) with ID 0
16/01/15 19:22:39 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.104:33856 with 265.1 MB RAM, BlockManagerId(0, 192.168.1.104, 33856)
16/01/15 19:22:40 INFO MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=278019440
16/01/15 19:22:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 265.0 MB)
16/01/15 19:22:40 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=130448, maxMem=278019440
16/01/15 19:22:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 265.0 MB)
16/01/15 19:22:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.101:33359 (size: 13.9 KB, free: 265.1 MB)
16/01/15 19:22:40 INFO SparkContext: Created broadcast 0 from textFile at Main.java:25
16/01/15 19:22:41 INFO FileInputFormat: Total input paths to process : 1
16/01/15 19:22:41 INFO SparkContext: Starting job: foreach at Main.java:33
16/01/15 19:22:41 INFO DAGScheduler: Got job 0 (foreach at Main.java:33) with 7 output partitions (allowLocal=false)
16/01/15 19:22:41 INFO DAGScheduler: Final stage: ResultStage 0(foreach at Main.java:33)
16/01/15 19:22:41 INFO DAGScheduler: Parents of final stage: List()
16/01/15 19:22:41 INFO DAGScheduler: Missing parents: List()
16/01/15 19:22:41 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at map at Main.java:30), which has no missing parents
16/01/15 19:22:41 INFO MemoryStore: ensureFreeSpace(4400) called with curMem=144705, maxMem=278019440
16/01/15 19:22:41 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 265.0 MB)
16/01/15 19:22:41 INFO MemoryStore: ensureFreeSpace(2538) called with curMem=149105, maxMem=278019440
16/01/15 19:22:41 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 265.0 MB)
16/01/15 19:22:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.101:33359 (size: 2.5 KB, free: 265.1 MB)
16/01/15 19:22:41 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
16/01/15 19:22:41 INFO DAGScheduler: Submitting 7 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at map at Main.java:30)
16/01/15 19:22:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks
16/01/15 19:22:41 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.104:33856 (size: 2.5 KB, free: 265.1 MB)
16/01/15 19:22:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.104:33856 (size: 13.9 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 2017 ms on 192.168.1.104 (1/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2036 ms on 192.168.1.104 (2/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2027 ms on 192.168.1.104 (3/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2027 ms on 192.168.1.104 (4/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 143 ms on 192.168.1.104 (5/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 199 ms on 192.168.1.104 (6/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 206 ms on 192.168.1.104 (7/7)
16/01/15 19:22:43 INFO DAGScheduler: ResultStage 0 (foreach at Main.java:33) finished in 2.218 s
16/01/15 19:22:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/15 19:22:43 INFO DAGScheduler: Job 0 finished: foreach at Main.java:33, took 2.289399 s
16/01/15 19:22:43 INFO SparkContext: Starting job: foreach at Main.java:34
16/01/15 19:22:43 INFO DAGScheduler: Got job 1 (foreach at Main.java:34) with 7 output partitions (allowLocal=false)
16/01/15 19:22:43 INFO DAGScheduler: Final stage: ResultStage 1(foreach at Main.java:34)
16/01/15 19:22:43 INFO DAGScheduler: Parents of final stage: List()
16/01/15 19:22:43 INFO DAGScheduler: Missing parents: List()
16/01/15 19:22:43 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[2] at map at Main.java:30), which has no missing parents
16/01/15 19:22:43 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=151643, maxMem=278019440
16/01/15 19:22:43 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.7 KB, free 265.0 MB)
16/01/15 19:22:43 INFO MemoryStore: ensureFreeSpace(2761) called with curMem=156467, maxMem=278019440
16/01/15 19:22:43 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.7 KB, free 265.0 MB)
16/01/15 19:22:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.101:33359 (size: 2.7 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/01/15 19:22:43 INFO DAGScheduler: Submitting 7 missing tasks from ResultStage 1 (MapPartitionsRDD[2] at map at Main.java:30)
16/01/15 19:22:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 7 tasks
16/01/15 19:22:43 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 7, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 8, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 9, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 10, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.104:33856 (size: 2.7 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 11, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 7) in 106 ms on 192.168.1.104 (1/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 8) in 125 ms on 192.168.1.104 (2/7)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 12, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 13, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 9) in 131 ms on 192.168.1.104 (3/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 10) in 133 ms on 192.168.1.104 (4/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 12) in 32 ms on 192.168.1.104 (5/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 11) in 61 ms on 192.168.1.104 (6/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 13) in 34 ms on 192.168.1.104 (7/7)
16/01/15 19:22:43 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/01/15 19:22:43 INFO DAGScheduler: ResultStage 1 (foreach at Main.java:34) finished in 0.165 s
16/01/15 19:22:43 INFO DAGScheduler: Job 1 finished: foreach at Main.java:34, took 0.177378 s
16/01/15 19:22:43 INFO SparkUI: Stopped Spark web UI at http://192.168.1.101:4040
16/01/15 19:22:43 INFO DAGScheduler: Stopping DAGScheduler
16/01/15 19:22:43 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/01/15 19:22:43 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/01/15 19:22:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/15 19:22:43 INFO Utils: path = /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1/blockmgr-e80c1c60-fe19-4be1-b3f9-259b3f1031a0, already present as root for deletion.
16/01/15 19:22:43 INFO MemoryStore: MemoryStore cleared
16/01/15 19:22:43 INFO BlockManager: BlockManager stopped
16/01/15 19:22:43 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/15 19:22:43 INFO SparkContext: Successfully stopped SparkContext
16/01/15 19:22:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/15 19:22:43 INFO Utils: Shutdown hook called
16/01/15 19:22:43 INFO Utils: Deleting directory /tmp/spark-ef3b8193-e086-4764-993c-0a40534052c1
logs from worker 192.168.1.101:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/15 18:14:15 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/01/15 18:14:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/15 18:14:15 INFO SecurityManager: Changing view acls to: sparkUser
16/01/15 18:14:15 INFO SecurityManager: Changing modify acls to: sparkUser
16/01/15 18:14:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkUser); users with modify permissions: Set(sparkUser)
logs from worker 192.168.1.104:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/15 19:23:23 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/01/15 19:23:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/15 19:23:24 INFO SecurityManager: Changing view acls to: root,sparkUser
16/01/15 19:23:24 INFO SecurityManager: Changing modify acls to: root,sparkUser
16/01/15 19:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, sparkUser); users with modify permissions: Set(root, sparkUser)
16/01/15 19:23:25 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:23:25 INFO Remoting: Starting remoting
16/01/15 19:23:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#192.168.1.104:43937]
16/01/15 19:23:25 INFO Utils: Successfully started service 'driverPropsFetcher' on port 43937.
16/01/15 19:23:26 INFO SecurityManager: Changing view acls to: root,sparkUser
16/01/15 19:23:26 INFO SecurityManager: Changing modify acls to: root,sparkUser
16/01/15 19:23:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, sparkUser); users with modify permissions: Set(root, sparkUser)
16/01/15 19:23:26 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/15 19:23:26 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/15 19:23:26 INFO Slf4jLogger: Slf4jLogger started
16/01/15 19:23:26 INFO Remoting: Starting remoting
16/01/15 19:23:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor#192.168.1.104:49573]
16/01/15 19:23:26 INFO Utils: Successfully started service 'sparkExecutor' on port 49573.
16/01/15 19:23:26 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/01/15 19:23:26 INFO DiskBlockManager: Created local directory at /tmp/spark-6ffb215c-7267-4a93-a766-2486d2331f6b/executor-146bfe64-d7e8-4da4-9144-8003754f0b5b/blockmgr-41031d8c-b069-4147-90c9-2237baed04f1
16/01/15 19:23:26 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
16/01/15 19:23:26 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver#192.168.1.101:38186/user/CoarseGrainedScheduler
16/01/15 19:23:26 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#192.168.1.104:50099/user/Worker
16/01/15 19:23:26 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#192.168.1.104:50099/user/Worker
16/01/15 19:23:26 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
16/01/15 19:23:26 INFO Executor: Starting executor ID 0 on host 192.168.1.104
16/01/15 19:23:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33856.
16/01/15 19:23:26 INFO NettyBlockTransferService: Server created on 33856
16/01/15 19:23:26 INFO BlockManagerMaster: Trying to register BlockManager
16/01/15 19:23:26 INFO BlockManagerMaster: Registered BlockManager
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 0
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 1
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 2
16/01/15 19:23:29 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/01/15 19:23:29 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
16/01/15 19:23:29 INFO CoarseGrainedExecutorBackend: Got assigned task 3
16/01/15 19:23:29 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/01/15 19:23:29 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
16/01/15 19:23:29 INFO Executor: Fetching http://192.168.1.101:34728/jars/JarForSpark.jar with timestamp 1452878555317
16/01/15 19:23:29 INFO Utils: Fetching http://192.168.1.101:34728/jars/JarForSpark.jar to /tmp/spark-6ffb215c-7267-4a93-a766-2486d2331f6b/executor-146bfe64-d7e8-4da4-9144-8003754f0b5b/fetchFileTemp1585609242243689070.tmp
16/01/15 19:23:29 INFO Utils: Copying /tmp/spark-6ffb215c-7267-4a93-a766-2486d2331f6b/executor-146bfe64-d7e8-4da4-9144-8003754f0b5b/3339800781452878555317_cache to /home/sparkUser2/Programs/spark-1.4.1-bin-hadoop2.6/work/app-20160115192235-0016/0/./JarForSpark.jar
16/01/15 19:23:29 INFO Executor: Adding file:/home/sparkUser2/Programs/spark-1.4.1-bin-hadoop2.6/work/app-20160115192235-0016/0/./JarForSpark.jar to class loader
16/01/15 19:23:29 INFO TorrentBroadcast: Started reading broadcast variable 1
16/01/15 19:23:29 INFO MemoryStore: ensureFreeSpace(2538) called with curMem=0, maxMem=278019440
16/01/15 19:23:29 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 265.1 MB)
16/01/15 19:23:29 INFO TorrentBroadcast: Reading broadcast variable 1 took 273 ms
16/01/15 19:23:29 INFO MemoryStore: ensureFreeSpace(4400) called with curMem=2538, maxMem=278019440
16/01/15 19:23:29 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 265.1 MB)
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:161772+161772
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:323544+161772
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:0+161772
16/01/15 19:23:29 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:485316+161772
16/01/15 19:23:29 INFO TorrentBroadcast: Started reading broadcast variable 0
16/01/15 19:23:29 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=6938, maxMem=278019440
16/01/15 19:23:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 265.1 MB)
16/01/15 19:23:30 INFO TorrentBroadcast: Reading broadcast variable 0 took 66 ms
16/01/15 19:23:30 INFO MemoryStore: ensureFreeSpace(188976) called with curMem=21195, maxMem=278019440
16/01/15 19:23:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 184.5 KB, free 264.9 MB)
16/01/15 19:23:30 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/01/15 19:23:30 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/01/15 19:23:30 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/01/15 19:23:30 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/01/15 19:23:30 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/01/15 19:23:30 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 1796 bytes result sent to driver
16/01/15 19:23:30 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1796 bytes result sent to driver
16/01/15 19:23:30 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1796 bytes result sent to driver
16/01/15 19:23:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 4
16/01/15 19:23:31 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:647088+161772
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 5
16/01/15 19:23:31 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 6
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:808860+161772
16/01/15 19:23:31 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:970632+161773
16/01/15 19:23:31 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 1796 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 7
16/01/15 19:23:31 INFO Executor: Running task 0.0 in stage 1.0 (TID 7)
16/01/15 19:23:31 INFO TorrentBroadcast: Started reading broadcast variable 2
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 8
16/01/15 19:23:31 INFO Executor: Running task 1.0 in stage 1.0 (TID 8)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 9
16/01/15 19:23:31 INFO Executor: Running task 2.0 in stage 1.0 (TID 9)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 10
16/01/15 19:23:31 INFO Executor: Running task 3.0 in stage 1.0 (TID 10)
16/01/15 19:23:31 INFO MemoryStore: ensureFreeSpace(2761) called with curMem=210171, maxMem=278019440
16/01/15 19:23:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.7 KB, free 264.9 MB)
16/01/15 19:23:31 INFO TorrentBroadcast: Reading broadcast variable 2 took 42 ms
16/01/15 19:23:31 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=212932, maxMem=278019440
16/01/15 19:23:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.7 KB, free 264.9 MB)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:0+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:161772+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:323544+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:485316+161772
16/01/15 19:23:31 INFO Executor: Finished task 0.0 in stage 1.0 (TID 7). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 1.0 in stage 1.0 (TID 8). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 2.0 in stage 1.0 (TID 9). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 3.0 in stage 1.0 (TID 10). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 11
16/01/15 19:23:31 INFO Executor: Running task 4.0 in stage 1.0 (TID 11)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:647088+161772
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 12
16/01/15 19:23:31 INFO Executor: Running task 5.0 in stage 1.0 (TID 12)
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Got assigned task 13
16/01/15 19:23:31 INFO Executor: Running task 6.0 in stage 1.0 (TID 13)
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:808860+161772
16/01/15 19:23:31 INFO HadoopRDD: Input split: file:/Datasets/somefile.txt:970632+161773
16/01/15 19:23:31 INFO Executor: Finished task 5.0 in stage 1.0 (TID 12). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 4.0 in stage 1.0 (TID 11). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO Executor: Finished task 6.0 in stage 1.0 (TID 13). 1814 bytes result sent to driver
16/01/15 19:23:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
I also tried to stop one of the workers to see what happens and the program successfully completed on the other worker.
and I looked at this post but unfortunately it didn't solved my problem:
Why my tasks only be done in one worker in Spark cluster
Appreciate your help!
It is because of Data Locality - "How close data is to the code processing it"
Spark tries to schedule the available tasks to its best locality levels.
Spark by default tries "PROCESS_LOCAL" mode as the first option and switches on to the lower levels only if it sees that the none of the CPU's are freed after a certain time interval.
Default wait time before switching to lower levels is 3s (see spark.locality.wait parameter).
And looking at the logs, all your tasks are finished within 3 seconds.
16/01/15 19:22:41 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.104:33856 (size: 2.5 KB, free: 265.1 MB)
16/01/15 19:22:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.104:33856 (size: 13.9 KB, free: 265.1 MB)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 192.168.1.104, PROCESS_LOCAL, 1495 bytes)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 2017 ms on 192.168.1.104 (1/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2036 ms on 192.168.1.104 (2/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2027 ms on 192.168.1.104 (3/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2027 ms on 192.168.1.104 (4/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 143 ms on 192.168.1.104 (5/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 199 ms on 192.168.1.104 (6/7)
16/01/15 19:22:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 206 ms on 192.168.1.104 (7/7)
16/01/15 19:22:43 INFO DAGScheduler: ResultStage 0 (foreach at Main.java:33) finished in 2.218 s
16/01/15 19:22:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/15 19:22:43 INFO DAGScheduler: Job 0 finished: foreach at Main.java:33, took 2.289399 s
Would suggest to try with larger files (in GB's) where each tasks takes some time to get the final results.
for more information on Data Locality, please read the section "Data Locality" in Spark Tuning Section

Apache Spark Multi Node Clustering - java.io.FileNotFoundException

I am newbie to Apache Spark and Cluster Computing and I implemented Spark in Standalone Mode (Same Machine with Master and Worker), it worked fine for me.
Then, I downloaded pre-built version of spark, and followed these instructions and placed in every nodes of my cluster: http://spark.apache.org/docs/latest/spark-standalone.html#installing-spark-standalone-to-a-cluster.
My Master node has IP address: 172.17.0.224 and my Slave nodes has IP Address: 172.17.0.221, 172.17.0.222 and 172.17.0.223.
And I edited slaves and spark-env.sh files to add the IP addresses of my slaves and IP address of my master respectively.
I started the master node start-master.sh and started the slave nodes with start-slaves.sh, everything worked fine.
I submitted my spark-job using the command spark-submit --class "Rice" --master spark://172.17.0.224:7077 cs453project/target/scala-2.11/simple-project_2.11-1.0.jar cs453project/input.txt cs453project/ouput2 cs453project/ouput3.
This is the error messages I got:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/25 11:22:27 INFO SparkContext: Running Spark version 1.5.2
15/11/25 11:22:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/25 11:22:28 WARN Utils: Your hostname, node04 resolves to a loopback address: 127.0.1.1; using 172.17.0.224 instead (on interface eth0)
15/11/25 11:22:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/11/25 11:22:28 INFO SecurityManager: Changing view acls to: ujjwal
15/11/25 11:22:28 INFO SecurityManager: Changing modify acls to: ujjwal
15/11/25 11:22:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ujjwal); users with modify permissions: Set(ujjwal)
15/11/25 11:22:28 INFO Slf4jLogger: Slf4jLogger started
15/11/25 11:22:28 INFO Remoting: Starting remoting
15/11/25 11:22:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#172.17.0.224:58478]
15/11/25 11:22:28 INFO Utils: Successfully started service 'sparkDriver' on port 58478.
15/11/25 11:22:28 INFO SparkEnv: Registering MapOutputTracker
15/11/25 11:22:28 INFO SparkEnv: Registering BlockManagerMaster
15/11/25 11:22:28 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bc18e422-d334-4fe5-9663-9439620ec054
15/11/25 11:22:28 INFO MemoryStore: MemoryStore started with capacity 530.3 MB
15/11/25 11:22:29 INFO HttpFileServer: HTTP File server directory is /tmp/spark-7c6e0ad4-52ae-4f5a-9aaa-6ad9fbf48685/httpd-13d8dd4d-6ff1-450d-baac-f2702c7a4e5b
15/11/25 11:22:29 INFO HttpServer: Starting HTTP Server
15/11/25 11:22:29 INFO Utils: Successfully started service 'HTTP file server' on port 49496.
15/11/25 11:22:29 INFO SparkEnv: Registering OutputCommitCoordinator
15/11/25 11:22:29 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/11/25 11:22:29 INFO SparkUI: Started SparkUI at http://172.17.0.224:4040
15/11/25 11:22:29 INFO SparkContext: Added JAR file:/home/ujjwal/cs453project/target/scala-2.11/simple-project_2.11-1.0.jar at http://172.17.0.224:49496/jars/simple-project_2.11-1.0.jar with timestamp 1448479349380
15/11/25 11:22:29 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Connecting to master spark://172.17.0.224:7077...
15/11/25 11:22:29 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20151125112229-0001
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Executor added: app-20151125112229-0001/0 on worker-20151125095922-172.17.0.221-33366 (172.17.0.221:33366) with 2 cores
15/11/25 11:22:29 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151125112229-0001/0 on hostPort 172.17.0.221:33366 with 2 cores, 1024.0 MB RAM
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Executor updated: app-20151125112229-0001/0 is now LOADING
15/11/25 11:22:29 INFO AppClient$ClientEndpoint: Executor updated: app-20151125112229-0001/0 is now RUNNING
15/11/25 11:22:29 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47843.
15/11/25 11:22:29 INFO NettyBlockTransferService: Server created on 47843
15/11/25 11:22:29 INFO BlockManagerMaster: Trying to register BlockManager
15/11/25 11:22:29 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.224:47843 with 530.3 MB RAM, BlockManagerId(driver, 172.17.0.224, 47843)
15/11/25 11:22:29 INFO BlockManagerMaster: Registered BlockManager
15/11/25 11:22:29 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 530.1 MB)
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(14276) called with curMem=157248, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 530.1 MB)
15/11/25 11:22:30 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.224:47843 (size: 13.9 KB, free: 530.3 MB)
15/11/25 11:22:30 INFO SparkContext: Created broadcast 0 from textFile at build.scala:11
15/11/25 11:22:30 INFO FileInputFormat: Total input paths to process : 1
15/11/25 11:22:30 INFO SparkContext: Starting job: count at build.scala:13
15/11/25 11:22:30 INFO DAGScheduler: Got job 0 (count at build.scala:13) with 108 output partitions
15/11/25 11:22:30 INFO DAGScheduler: Final stage: ResultStage 0(count at build.scala:13)
15/11/25 11:22:30 INFO DAGScheduler: Parents of final stage: List()
15/11/25 11:22:30 INFO DAGScheduler: Missing parents: List()
15/11/25 11:22:30 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at map at build.scala:12), which has no missing parents
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(3424) called with curMem=171524, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 530.1 MB)
15/11/25 11:22:30 INFO MemoryStore: ensureFreeSpace(1934) called with curMem=174948, maxMem=556038881
15/11/25 11:22:30 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1934.0 B, free 530.1 MB)
15/11/25 11:22:30 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.224:47843 (size: 1934.0 B, free: 530.3 MB)
15/11/25 11:22:30 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/11/25 11:22:30 INFO DAGScheduler: Submitting 108 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at map at build.scala:12)
15/11/25 11:22:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 108 tasks
15/11/25 11:22:31 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#172.17.0.221:55861/user/Executor#-498212581]) with ID 0
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.221:49642 with 530.3 MB RAM, BlockManagerId(0, 172.17.0.221, 49642)
15/11/25 11:22:32 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.221:49642 (size: 1934.0 B, free: 530.3 MB)
15/11/25 11:22:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.221:49642 (size: 13.9 KB, free: 530.3 MB)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 172.17.0.221): java.io.FileNotFoundException: File file:/home/ujjwal/cs453project/input.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 1]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.1 in stage 0.0 (TID 5, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 2]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 6, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 3]
15/11/25 11:22:32 INFO TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 4]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 4.1 in stage 0.0 (TID 7, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.1 in stage 0.0 (TID 5) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 5]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.2 in stage 0.0 (TID 8, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 6) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 6]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 9, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.2 in stage 0.0 (TID 8) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 7]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 3.3 in stage 0.0 (TID 10, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 4.1 in stage 0.0 (TID 7) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 8]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 4.2 in stage 0.0 (TID 11, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 9) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 9]
15/11/25 11:22:32 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 12, 172.17.0.221, PROCESS_LOCAL, 2217 bytes)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 3.3 in stage 0.0 (TID 10) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 10]
15/11/25 11:22:32 ERROR TaskSetManager: Task 3 in stage 0.0 failed 4 times; aborting job
15/11/25 11:22:32 INFO TaskSchedulerImpl: Cancelling stage 0
15/11/25 11:22:32 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/11/25 11:22:32 INFO DAGScheduler: ResultStage 0 (count at build.scala:13) failed in 2.216 s
15/11/25 11:22:32 INFO TaskSetManager: Lost task 4.2 in stage 0.0 (TID 11) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 11]
15/11/25 11:22:32 INFO DAGScheduler: Job 0 failed: count at build.scala:13, took 2.373631 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 10, 172.17.0.221): java.io.FileNotFoundException: File file:/home/ujjwal/cs453project/input.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD.count(RDD.scala:1125)
at Rice$.main(build.scala:13)
at Rice.main(build.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: File file:/home/ujjwal/cs453project/input.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/11/25 11:22:32 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 12) on executor 172.17.0.221: java.io.FileNotFoundException (File file:/home/ujjwal/cs453project/input.txt does not exist) [duplicate 12]
15/11/25 11:22:32 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/11/25 11:22:32 INFO SparkContext: Invoking stop() from shutdown hook
15/11/25 11:22:33 INFO SparkUI: Stopped Spark web UI at http://172.17.0.224:4040
15/11/25 11:22:33 INFO DAGScheduler: Stopping DAGScheduler
15/11/25 11:22:33 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/11/25 11:22:33 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
15/11/25 11:22:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/11/25 11:22:33 INFO MemoryStore: MemoryStore cleared
15/11/25 11:22:33 INFO BlockManager: BlockManager stopped
15/11/25 11:22:33 INFO BlockManagerMaster: BlockManagerMaster stopped
15/11/25 11:22:33 INFO SparkContext: Successfully stopped SparkContext
15/11/25 11:22:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/11/25 11:22:33 INFO ShutdownHookManager: Shutdown hook called
15/11/25 11:22:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-7c6e0ad4-52ae-4f5a-9aaa-6ad9fbf48685
Could you please help me understand how can I solve my problem? Thanks!
The path you used is probably only local to the driver. You have to use a path that is accessible to all of the workers. The driver does not send the actual data to the workers - that would be unfortunately slow. The workers will try to read the data using the path you gave them. In this case, they will fail because the don't have the files locally.
#user3180835, as suggested by #Mike Park, in my case, after I copied the file from local linux file system to hdfs, it started working.
hdfs dfs -cp file:///<path_to_local_file> /<hdfs_file_dir>

Resources