Collect failed in ... s due to Stage cancelled because SparkContext was shut down - apache-spark

I want to display the number of elements in each partition, so I write the following:
def count_in_a_partition(iterator):
yield sum(1 for _ in iterator)
If I use it like this
print("number of element in each partitions: {}".format(
my_rdd.mapPartitions(count_in_a_partition).collect()
))
I get the following:
19/02/18 21:41:15 INFO DAGScheduler: Job 3 failed: collect at /project/6008168/tamouze/testSparkCedar.py:435, took 30.859710 s
19/02/18 21:41:15 INFO DAGScheduler: ResultStage 3 (collect at /project/6008168/tamouze/testSparkCedar.py:435) failed in 30.848 s due to Stage cancelled because SparkContext was shut down
19/02/18 21:41:15 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/02/18 21:41:16 INFO MemoryStore: MemoryStore cleared
19/02/18 21:41:16 INFO BlockManager: BlockManager stopped
19/02/18 21:41:16 INFO BlockManagerMaster: BlockManagerMaster stopped
19/02/18 21:41:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_14 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_14 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_3 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_3 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 INFO SparkContext: Successfully stopped SparkContext
....
noting that my_rdd.take(1) return :
[(u'id', u'text', array([-0.31921682, ...,0.890875]))]
How can I solve this issue?

You have to use glom() function for that. Let’s take an example.
Let's create a DataFrame first.
rdd=sc.parallelize([('a',22),('b',1),('c',4),('b',1),('d',2),('e',0),('d',3),('a',1),('c',4),('b',7),('a',2),('f',1)] )
df=rdd.toDF(['key','value'])
df=df.repartition(5,"key") # Make 5 Partitions
The number of partitions -
print("Number of partitions: {}".format(df.rdd.getNumPartitions()))
Number of partitions: 5
Number of rows/elements on each partition. This can give you an idea of skew -
print('Partitioning distribution: '+ str(df.rdd.glom().map(len).collect()))
Partitioning distribution: [3, 3, 2, 2, 2]
See how actually are rows distributed on the partitions. Behold that if the dataset is big, then your system could crash because of Out of Memory OOM issue.
print("Partitions structure: {}".format(df.rdd.glom().collect()))
Partitions structure: [
#Partition 1 [Row(key='a', value=22), Row(key='a', value=1), Row(key='a', value=2)],
#Partition 2 [Row(key='b', value=1), Row(key='b', value=1), Row(key='b', value=7)],
#Partition 3 [Row(key='c', value=4), Row(key='c', value=4)],
#Partition 4 [Row(key='e', value=0), Row(key='f', value=1)],
#Partition 5 [Row(key='d', value=2), Row(key='d', value=3)]
]

Related

Permission denied error when setting up local Spark instance and running pyspark

I am setting up a local Spark instance on Windows to use with PySpark as described in this guide (but with spark-3.0.0 / hadoop 2.7 instead): https://phoenixnap.com/kb/install-spark-on-windows-10.
I can startup Spark with:
C:\Spark\spark-3.0.0-bin-hadoop2.7\bin>spark-shell.cmd
and connect to it with http://localhost:4040/ in my browser (I see the Spark GUI).
But when am running the Python pyspark example with
C:\Spark\spark-3.0.0-bin-hadoop2.7\examples>run-example SparkPi
it throws an Permission Denied error like in this trace:
21/03/08 10:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/03/08 10:51:04 INFO SparkContext: Running Spark version 3.0.0
21/03/08 10:51:04 INFO ResourceUtils: ==============================================================
21/03/08 10:51:04 INFO ResourceUtils: Resources for spark.driver:
21/03/08 10:51:04 INFO ResourceUtils: ==============================================================
21/03/08 10:51:04 INFO SparkContext: Submitted application: Spark Pi
21/03/08 10:51:04 INFO SecurityManager: Changing view acls to: #####
21/03/08 10:51:04 INFO SecurityManager: Changing modify acls to: #####
21/03/08 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/03/08 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/03/08 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(#####); groups with view permissions: Set(); users with modify permissions: Set(#####); groups with modify permissions: Set()
21/03/08 10:51:05 INFO Utils: Successfully started service 'sparkDriver' on port 63213.
21/03/08 10:51:05 INFO SparkEnv: Registering MapOutputTracker
21/03/08 10:51:05 INFO SparkEnv: Registering BlockManagerMaster
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/03/08 10:51:05 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/03/08 10:51:05 INFO DiskBlockManager: Created local directory at C:\Users\#####\AppData\Local\Temp\blockmgr-dce03954-27a7-484d-8e54-f552b21433f7
21/03/08 10:51:05 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
21/03/08 10:51:05 INFO SparkEnv: Registering OutputCommitCoordinator
21/03/08 10:51:05 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/03/08 10:51:05 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://WORKSTATION.DOMAIN.EXT:4040
21/03/08 10:51:05 INFO SparkContext: Added JAR file:///C:/Spark/spark-3.0.0-bin-hadoop2.7/examples/jars/scopt_2.12-3.7.1.jar at spark://WORKSTATION.DOMAIN.EXT:63213/jars/scopt_2.12-3.7.1.jar with timestamp 1615197065578
21/03/08 10:51:05 INFO SparkContext: Added JAR file:///C:/Spark/spark-3.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.12-3.0.0.jar at spark://WORKSTATION.DOMAIN.EXT:63213/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615197065579
21/03/08 10:51:05 INFO Executor: Starting executor ID driver on host WORKSTATION.DOMAIN.EXT
21/03/08 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63260.
21/03/08 10:51:05 INFO NettyBlockTransferService: Server created on WORKSTATION.DOMAIN.EXT:63260
21/03/08 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/03/08 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:05 INFO BlockManagerMasterEndpoint: Registering block manager NLLR4000250910.solon.prd:63260 with 366.3 MiB RAM, BlockManagerId(driver, WORKSTATION.DOMAIN.EXT, 63260, None)
21/03/08 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, NLLR4000250910.solon.prd, 63260, None)
21/03/08 10:51:06 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
21/03/08 10:51:06 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
21/03/08 10:51:06 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
21/03/08 10:51:06 INFO DAGScheduler: Parents of final stage: List()
21/03/08 10:51:06 INFO DAGScheduler: Missing parents: List()
21/03/08 10:51:06 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
21/03/08 10:51:06 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 366.3 MiB)
21/03/08 10:51:06 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 366.3 MiB)
21/03/08 10:51:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on WORKSTATION.DOMAIN.EXT:63260 (size: 1816.0 B, free: 366.3 MiB)
21/03/08 10:51:06 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1200
21/03/08 10:51:06 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
21/03/08 10:51:06 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
21/03/08 10:51:06 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, WORKSTATION.DOMAIN.EXT, executor driver, partition 0, PROCESS_LOCAL, 7393 bytes)
21/03/08 10:51:06 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, WORKSTATION.DOMAIN.EXT, executor driver, partition 1, PROCESS_LOCAL, 7393 bytes)
21/03/08 10:51:06 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/08 10:51:06 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/08 10:51:06 INFO Executor: Fetching spark://WORKSTATION.DOMAIN.EXT:63213/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615197065579
21/03/08 10:51:06 ERROR Utils: Aborting task
java.io.IOException: Failed to connect to WORKSTATION.DOMAIN.EXT/192.168.#.#:63213
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:359)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:535)
at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:869)
at org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:860)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:860)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:404)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedSocketException: Permission denied: no further information: WORKSTATION.DOMAIN.EXT/192.168.#.#:63213
Caused by: java.net.SocketException: Permission denied: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Unknown Source)
[snip]
When running it on a different machine with seemingly the same config where it works fine, I get this trace on the part where the Exception is thrown on the other trace:
[snip]
21/03/08 08:00:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/08 08:00:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/08 08:00:22 INFO Executor: Fetching spark://WORKSTATION.DOMAIN.EXT:63646/jars/spark-examples_2.12-3.0.0.jar with timestamp 1615186820489
21/03/08 08:00:22 INFO TransportClientFactory: Successfully created connection to WORKSTATION.DOMAIN.EXT/10.121.#.#:63646 after 86 ms (0 ms spent in bootstraps)
21/03/08 08:00:22 INFO Utils: Fetching spark://WORKSTATION.DOMAIN.EXT:63646/jars/spark-examples_2.12-3.0.0.jar to C:\Users\#####\AppData\Local\Temp\spark-54a13d9f-9064-4f34-ba81-af49b18d9a0c\userFiles-24c3eabc-02a4-4aca-8abb-424431c6442f\fetchFileTemp5258763437798623210.tmp
21/03/08 08:00:24 INFO Executor: Adding file:/C:/Users/#####/AppData/Local/Temp/spark-54a13d9f-9064-4f34-ba81-af49b18d9a0c/userFiles-24c3eabc-02a4-4aca-8abb-424431c6442f/spark-examples_2.12-3.0.0.jar to class loader
[snip]
At first it seemed to me as a Firewall issue, but adding the executing java.exe as exeption to the firewall didn't solve the issue.
Does anyone know what I should try next to get this issue resolved?
Finally I could solve it by setting my SPARK_LOCAL_IP to localhost in my environment variables: Go to your Windows environment variables and set SPARK_LOCAL_IP=localhost

FileNotFoundException on submitting Spark Jobs to remote

I've created an environment where I've set up 3 Docker containers, 1 for Airflow using the puckel/docker-airflow image with spark and hadoop additionally installed. The other two containers are basically imitating spark master and worker (used gettyimages/spark Docker image to create this). All 3 containers are connected to each other via a bridge network, so all containers are able to communicate with each other.
What I'm trying to do next is to submit spark job from the Airflow container to the Spark cluster (master).
As an initial example, I'm using the wordcount sample script. I created a sample.txt file in the airflow container at path usr/local/airflow/sample.txt. I've bashed into the Airflow container and I'm using the command given below to run the wordcount.py on spark master located at the ip which I found after inspecting the bridge network.
spark-submit --master spark://ipaddress:7077 --files usr/local/airflow/sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
After submitting the script, from the logs, I can see that a connection has been established with the master (from airflow container), and it also copied the file specified by --files to the master and worker, but then it just errors out saying,
java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
As per my understanding (could be wrong), but when we specify files to copy to master using --files you can access it directly via the file name (sample.txt in my case). So what I'm trying to figure out is if a job has been submitted and the file has been copied to master, then why is it searching in the location file:/usr/local/airflow/sample.txt? How do I make it refer to the correct path?
I apologize as this question has been asked a couple of times, but I've read all the related question on stackoverflow, but I'm still unable to resolve this. I'd really appreciate y'alls help on this.
Thanks.
The full log below,
user#machine:/usr/local/airflow# spark-submit --master spark://172.22.0.2:7077 --files sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py ./sample.txt
20/07/25 03:23:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/25 03:23:35 INFO SparkContext: Running Spark version 2.4.1
20/07/25 03:23:35 INFO SparkContext: Submitted application: PythonWordCount
20/07/25 03:23:35 INFO SecurityManager: Changing view acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing view acls groups to:
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls groups to:
20/07/25 03:23:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/25 03:23:35 INFO Utils: Successfully started service 'sparkDriver' on port 33457.
20/07/25 03:23:35 INFO SparkEnv: Registering MapOutputTracker
20/07/25 03:23:36 INFO SparkEnv: Registering BlockManagerMaster
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/25 03:23:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-dd1957de-6907-484d-a3d8-2b3b88e0c7ca
20/07/25 03:23:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/25 03:23:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/25 03:23:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/25 03:23:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://0508a77fcaad:4040
20/07/25 03:23:37 INFO SparkContext: Added file file:///usr/local/airflow/sample.txt at spark://0508a77fcaad:33457/files/sample.txt with timestamp 1595647417081
20/07/25 03:23:37 INFO Utils: Copying /usr/local/airflow/sample.txt to /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/userFiles-74f8cfe4-8a19-4d2e-8fa1-1f0bd1f0ef12/sample.txt
20/07/25 03:23:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.0.2:7077...
20/07/25 03:23:37 INFO TransportClientFactory: Successfully created connection to /172.22.0.2:7077 after 32 ms (0 ms spent in bootstraps)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200725032338-0003
20/07/25 03:23:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45057.
20/07/25 03:23:38 INFO NettyBlockTransferService: Server created on 0508a77fcaad:45057
20/07/25 03:23:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200725032338-0003/0 on worker-20200725025003-172.22.0.4-8881 (172.22.0.4:8881) with 2 core(s)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20200725032338-0003/0 on hostPort 172.22.0.4:8881 with 2 core(s), 1024.0 MB RAM
20/07/25 03:23:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMasterEndpoint: Registering block manager 0508a77fcaad:45057 with 366.3 MB RAM, BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200725032338-0003/0 is now RUNNING
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.020/07/25 03:23:38 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/usr/local/airflow/spark-warehouse').
20/07/25 03:23:38 INFO SharedState: Warehouse path is 'file:/usr/local/airflow/spark-warehouse'.
20/07/25 03:23:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/25 03:23:47 INFO FileSourceStrategy: Pruning directories with:
20/07/25 03:23:47 INFO FileSourceStrategy: Post-Scan Filters:
20/07/25 03:23:47 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/07/25 03:23:47 INFO FileSourceScanExec: Pushed Filters:
20/07/25 03:23:51 INFO CodeGenerator: Code generated in 2187.926234 ms
20/07/25 03:23:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.9 KB, free 366.1 MB)
20/07/25 03:23:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.8 KB, free 366.1 MB)
20/07/25 03:23:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 0508a77fcaad:45057 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:23:55 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
20/07/25 03:23:55 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/07/25 03:23:57 INFO SparkContext: Starting job: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40
20/07/25 03:23:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.22.0.4:59324) with ID 0
20/07/25 03:23:58 INFO DAGScheduler: Registering RDD 5 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39)
20/07/25 03:23:58 INFO DAGScheduler: Got job 0 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40) with 1 output partitions
20/07/25 03:23:58 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40)
20/07/25 03:23:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39), which has no missing parents
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.2 KB, free 366.0 MB)
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.1 KB, free 366.0 MB)
20/07/25 03:23:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 0508a77fcaad:45057 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:23:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
20/07/25 03:23:58 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) (first 15 tasks are for partitions Vector(0))
20/07/25 03:23:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/07/25 03:23:58 INFO BlockManagerMasterEndpoint: Registering block manager 172.22.0.4:45435 with 366.3 MB RAM, BlockManagerId(0, 172.22.0.4, 45435, None)
20/07/25 03:23:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.22.0.4:45435 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:24:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.22.0.4:45435 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:24:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 1]
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 2]
20/07/25 03:24:12 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 3]
20/07/25 03:24:12 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
20/07/25 03:24:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/07/25 03:24:12 INFO TaskSchedulerImpl: Cancelling stage 0
20/07/25 03:24:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/07/25 03:24:12 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) failed in 13.690 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
20/07/25 03:24:12 INFO DAGScheduler: Job 0 failed: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40, took 14.579961 s
Traceback (most recent call last):
File "/opt/spark-2.4.1/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/25 03:24:13 INFO SparkUI: Stopped Spark web UI at http://0508a77fcaad:4040
20/07/25 03:24:13 INFO StandaloneSchedulerBackend: Shutting down all executors
20/07/25 03:24:13 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/07/25 03:24:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/25 03:24:16 INFO MemoryStore: MemoryStore cleared
20/07/25 03:24:16 INFO BlockManager: BlockManager stopped
20/07/25 03:24:16 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/25 03:24:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/25 03:24:16 INFO SparkContext: Successfully stopped SparkContext
20/07/25 03:24:16 INFO ShutdownHookManager: Shutdown hook called
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-2dfb2222-d56c-4ee1-ab62-86e71e5e751b
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/pyspark-2ee74d07-6606-4edc-8420-fe46212c50e5
Change your spark-submit like below for submitting your spark job.
spark-submit \
--master spark://ipaddress:7077 \
--deploy-mode cluster # add this if you want to pass file name to wordcount.py
--files usr/local/airflow/sample.txt \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
OR
spark-submit \
--master spark://ipaddress:7077 \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py /usr/local/airflow/sample.txt

Optimization of spark cluster configuration( Low configuration virtual machine cluster)

When I use Spark's standalone mode to process a large number of datasets,the log said:
ERROR TaskSchedulerImpl:70 - Lost executor 1 on : Executor heartbeat timed out after 381181 ms
I search the internet, they say I should set parameters with spark submit:
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout 10000000 --py-files id.py id.py --name id
Error message in log:
Error: Invalid argument to --conf: spark.worker.timeout
Questions:
How to set timeout parameter?
Thanks to meniluca's answer, I lost the symbols in instructions
After adjusting the timeout, the log displays
2019-12-05 19:42:27 WARN Utils:87 - Suppressing exception in finally: broken pipe (Write failed)
java.net.SocketException: broken pipe (Write failed)
2019-12-05 21:13:09 INFO SparkContext:54 - Invoking stop() from shutdown hook
Exception in thread "serve-DataFrame" java.net.SocketException: Connection reset
Suppressed: java.net.SocketException: broken pipe (Write failed)
then,I change thessh,add ServerAliveInterval 60 while ~/.ssh/ config
ServerAliveInterval 60
the error stil exits, then I try to increase the driver memory, error still exists, and show that the connection is disconnected
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout=10000000 --driver-memory 1g --py-files id.py id.py --name id
2019-12-06 10:38:49 INFO ContextCleaner:54 - Cleaned accumulator 374
Exception in thread "serve-DataFrame" java.net.SocketException: broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:413)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:412)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
2019-12-06 11:06:12 WARN HeartbeatReceiver:66 - Removing executor 1 with no recent heartbeats: 149103 ms exceeds timeout 120000 ms
2019-12-06 11:06:12 ERROR TaskSchedulerImpl:70 - Lost executor 1 on 219.226.109.129: Executor heartbeat timed out after 149103 ms
2019-12-06 11:06:13 INFO SparkContext:54 - Invoking stop() from shutdown hook
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 1 (epoch 6)
2019-12-06 11:06:13 WARN HeartbeatReceiver:66 - Removing executor 0 with no recent heartbeats: 155761 ms exceeds timeout 120000 ms
2019-12-06 11:06:13 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 219.226.109.131: Executor heartbeat timed out after 155761 ms
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Requesting to kill executor(s) 1
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 1 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 1 (epoch 6)
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Actual list of executor(s) to be killed is 1
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.129
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 0 (epoch 7)
2019-12-06 11:06:13 INFO AbstractConnector:318 - Stopped Spark#490228e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 0 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 0 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 0 (epoch 7)
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.131
2019-12-06 11:06:13 INFO SparkUI:54 - Stopped Spark web UI at http://Master:4040
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.129:42501 with 413.9 MB RAM, BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.131:42164 with 413.9 MB RAM, BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:14 INFO StandaloneSchedulerBackend:54 - Shutting down all executors
2019-12-06 11:06:14 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-12-06 11:06:14 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.129:42501 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:15 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-12-06 11:06:15 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.131:42164 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:16 INFO MemoryStore:54 - MemoryStore cleared
2019-12-06 11:06:16 INFO BlockManager:54 - BlockManager stopped
2019-12-06 11:06:16 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-12-06 11:06:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-12-06 11:06:17 ERROR TransportResponseHandler:144 - Still have 1 requests outstanding when connection from Master/219.226.109.130:7077 is closed
2019-12-06 11:06:17 INFO SparkContext:54 - Successfully stopped SparkContext
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Shutdown hook called
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/localPyFiles-dd95954c-2e77-41ca-969d-a201269f5b5b
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bcd56b4a-fb32-4b58-a1d5-71abc5218d32
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0/pyspark-d04b799f-a116-44d5-b6a5-811cc8c03743
Question
Is SSH related to broken pipe?
Is increasing driver memory helpful to this problem?
I see the configuration posts on the Internet, but they're are highly configured. Since I use my computer to built clusters on virtual machine,
, the master has two cores , the slave has one core. How to adjust the configuration ?
Please try with
--conf spark.worker.timeout=10000000
you are missing the equal character between the configuration name and value.
java.net.SocketException: broken pipe (Write failed) occurs when something is wrong with the access port.
I suggest you to change the master which is at port 8080. The port can be changed either in the configuration file or via command-line options.
sbin/start-master.sh
Same can be tried with worker node as well if the above does not fix issue.
To see which ports are being used you can use :
sudo netstat -ltup

How to deploy spark on Kubeedge?

I tried to use k8s deployment mode to deploy spark-2.4.3 on Kubeedge 1.1.0 but failed (docker version 19.03.4,k8s version 1.16.1).
SPARK_DRIVER_BIND_ADDRESS=10.4.20.34
SPARK_IMAGE=spark:2.4.3
SPARK_MASTER="k8s://http://127.0.0.1:8080"
CMD=(
"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--conf "spark.kubernetes.container.image=${SPARK_IMAGE}"
--conf "spark.executor.instances=1"
--conf "spark.kubernetes.executor.limit.cores=1"
--deploy-mode client
--master ${SPARK_MASTER}
--name spark-pi
--class org.apache.spark.examples.SparkPi
--driver-memory 1G
--executor-memory 1G
--num-executors 1
--executor-cores 1
file://${PWD}/spark-examples_2.11-2.4.3.jar
)
${CMD[#]}
Node status is normal.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
edge-node-001 Ready edge 6d1h v1.15.3-kubeedge-v1.1.0-beta.0.178+c6a5aa738261e7-dirty
ubuntu-ms-7b89 Ready master 6d4h v1.16.1
But I got some errors
19/11/17 21:45:12 INFO k8s.ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/11/17 21:45:12 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46571.
19/11/17 21:45:12 INFO netty.NettyBlockTransferService: Server created on 10.4.20.34:46571
19/11/17 21:45:12 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/17 21:45:12 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.4.20.34:46571 with 366.3 MB RAM, BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#451882b2{/metrics/json,null,AVAILABLE,#Spark}
19/11/17 21:45:42 INFO k8s.KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
19/11/17 21:45:42 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Missing parents: List()
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
19/11/17 21:45:42 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
19/11/17 21:45:42 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 366.3 MB)
19/11/17 21:45:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.4.20.34:46571 (size: 1256.0 B, free: 366.3 MB)
19/11/17 21:45:42 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
19/11/17 21:45:42 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/11/17 21:45:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:27 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:42 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:47:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Is it possible to deploy spark on Kubeedge in Kubernetes deployment mode? Or I should try standalone deployment mode?
I'm so confused.

First query to cassandra tables through Thrift server takes too long

I am trying to query cassandra table through Thrift server. I have setup my spark cluster having one master and one worker in the same node.
I am starting thrift server with following command without having any custom configuration.
$SPARK_HOME/sbin/start-thriftserver.sh --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 --conf spark.cassandra.connection.host=127.0.0.1 --master spark://<spark-master>:7077
I have created following table in cassandra and inserted not more than 10 records in it and configured in hive metastore.
CREATE TABLE IF NOT EXISTS places_for_research(
research_id uuid,
tenant_id uuid,
country text,
place_id uuid,
PRIMARY KEY((tenant_id,research_id),country,place_id)
);
Now when I query this table from beeline, first time it takes around 19 seconds and on subsequent execution it reduces this time to half second.
Following is the query which I execute from beeline which return 2 records.
select * from places_for_research where tenant_id='340276cb-389b-4f57-a2cf-6ff5ec3e4d91' and research_id='95dafbe7-78d0-4509-9553-899dfaa7b858';
Wondering what is causing so much time for first request. How can I optimise first request performance?
Following is the thrift server logs for your ref
17/11/03 20:12:50 INFO SparkExecuteStatementOperation: Running query 'select * from places_for_research where tenant_id='340276cb-389b-4f57-a2cf-6ff5ec3e4d91' and research_id='95dafbe7-78d0-4509-9553-899dfaa7b858'' with 9d9a5c7c-2766-48c3-ab58-348b461b6577
17/11/03 20:12:50 INFO SparkSqlParser: Parsing command: select * from places_for_research where tenant_id='340276cb-389b-4f57-a2cf-6ff5ec3e4d91' and research_id='95dafbe7-78d0-4509-9553-899dfaa7b858'
17/11/03 20:12:51 INFO HiveMetaStore: 2: get_table : db=default tbl=places_for_research
17/11/03 20:12:51 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=places_for_research
17/11/03 20:12:51 INFO HiveMetaStore: 2: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/11/03 20:12:51 INFO ObjectStore: ObjectStore, initialize called
17/11/03 20:12:51 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/11/03 20:12:51 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/11/03 20:12:51 INFO ObjectStore: Initialized ObjectStore
17/11/03 20:12:52 INFO CatalystSqlParser: Parsing command: array<string>
17/11/03 20:12:52 INFO HiveMetaStore: 2: get_table : db=default tbl=places_for_research
17/11/03 20:12:52 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=places_for_research
17/11/03 20:12:52 INFO CatalystSqlParser: Parsing command: array<string>
17/11/03 20:12:53 INFO ClockFactory: Using native clock to generate timestamps.
17/11/03 20:12:53 WARN NettyUtil: Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead.
17/11/03 20:12:54 INFO Cluster: New Cassandra host /127.0.0.1:9042 added
17/11/03 20:12:54 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
17/11/03 20:12:55 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(tenant_id), IsNotNull(research_id), EqualTo(tenant_id,340276cb-389b-4f57-a2cf-6ff5ec3e4d91), EqualTo(research_id,95dafbe7-78d0-4509-9553-899dfaa7b858)]
17/11/03 20:12:55 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(tenant_id), IsNotNull(research_id), EqualTo(tenant_id,340276cb-389b-4f57-a2cf-6ff5ec3e4d91), EqualTo(research_id,95dafbe7-78d0-4509-9553-899dfaa7b858)]
17/11/03 20:12:57 INFO CodeGenerator: Code generated in 652.925772 ms
17/11/03 20:12:57 INFO SparkContext: Starting job: run at AccessController.java:0
17/11/03 20:12:57 INFO DAGScheduler: Got job 0 (run at AccessController.java:0) with 1 output partitions
17/11/03 20:12:57 INFO DAGScheduler: Final stage: ResultStage 0 (run at AccessController.java:0)
17/11/03 20:12:57 INFO DAGScheduler: Parents of final stage: List()
17/11/03 20:12:57 INFO DAGScheduler: Missing parents: List()
17/11/03 20:12:57 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at run at AccessController.java:0), which has no missing parents
17/11/03 20:12:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 12.8 KB, free 366.3 MB)
17/11/03 20:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.3 KB, free 366.3 MB)
17/11/03 20:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.110:57001 (size: 6.3 KB, free: 366.3 MB)
17/11/03 20:12:58 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/11/03 20:12:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at run at AccessController.java:0)
17/11/03 20:12:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/03 20:12:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.110, executor 0, partition 0, ANY, 8403 bytes)
17/11/03 20:13:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.110:57005 (size: 6.3 KB, free: 366.3 MB)
17/11/03 20:13:05 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
17/11/03 20:13:09 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 11709 ms on 192.168.1.110 (executor 0) (1/1)
17/11/03 20:13:09 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/11/03 20:13:10 INFO DAGScheduler: ResultStage 0 (run at AccessController.java:0) finished in 11.734 s
17/11/03 20:13:10 INFO DAGScheduler: Job 0 finished: run at AccessController.java:0, took 12.189787 s
17/11/03 20:13:10 INFO CodeGenerator: Code generated in 63.249603 ms
17/11/03 20:13:10 INFO SparkExecuteStatementOperation: Result Schema: StructType(StructField(tenant_id,StringType,true), StructField(research_id,StringType,true), StructField(country,StringType,true), StructField(place_id,StringType,true))
Thanks.
The Spark Thrift Server is lazy which means it doesn't actually start any machinery for doing queries until after the first query is launched. The delay you see is the actual starting up and requesting of remote resources. This will always take some non-zero amount of time but you could possibly avoid this by always having your thrift server immediately queried with a dummy request after being started up.

Resources