Spark job failed to write to Alluxio due to DeadlineExceededException - apache-spark

I am running a Spark job writing to an Alluxio cluster with 20 workers (Alluxio 1.6.1). Spark job failed to write its output due to alluxio.exception.status.DeadlineExceededException. The worker is still alive from Alluxio WebUI. How can I avoid this failure?
alluxio.exception.status.DeadlineExceededException: Timeout writing to WorkerNetAddress{host=spark-74-44.xxxx, rpcPort=51998, dataPort=51999, webPort=51997, domainSocketPath=} for request type: ALLUXIO_BLOCK
id: 3209355843338240
tier: 0
worker_group {
host: "spark6-64-156.xxxx"
rpc_port: 51998
data_port: 51999
web_port: 51997
socket_path: ""
}

This error indicates that your Spark job timed out while trying to write data to an Alluxio worker. The worker could be under high load, or have a slow connection to your UFS.
The default timeout is 30 seconds. To increase the timeout, configure alluxio.user.network.netty.timeout on the Spark side.
For example, to increase the timeout to 5 minutes, use the --conf option to spark-submit
$ spark-submit --conf 'spark.executor.extraJavaOptions=-Dalluxio.user.network.netty.timeout=5min' \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.network.netty.timeout=5min' \
...
You can also set these properties in your spark-defaults.conf file to have them automatically applied to all jobs.
Source: https://www.alluxio.org/docs/1.6/en/Configuration-Settings.html#spark-jobs

Related

Spark Submit Executor AllocationManager Warning

I'm running a Spark job on EMR Cluster but i keep getting this in logs
Is this an important warning and how can i fix it i think it has to do with Cluster Scaling
Also after looking in Spark Job History I found that executors got removed before the job finished
I run the job with :
spark-submit --master yarn --deploy-mode client --executor-cores 4 --num-executors 7 myJob.py
And also the job takes over 1h is it normal ?
the Job is : i'm reading csv file ( 1gb) and filling some empty fields and then returning new csv file

Spark executor failure on mesos agent while using mesos dispatcher in cluster mode

I launched dispatcher as follows and the launch was successful as seen from the logs
./sbin/start-mesos-dispatcher.sh --master mesos://10.0.0.6:5050
Rest server was activated on port 7078
I submitted the job to the dispatcher as follows
./bin/spark-submit \
--class com.ibm.cds.spark.samples.HelloSpark \
--master mesos://10.0.0.6:7078 \
--deploy-mode cluster \
--verbose \
https://github.com/../helloSpark.jar
On the spark slave, I get the following error in mesos agent sandbox - stderr.
17/11/22 09:22:06 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://10.0.0.6:5050.
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestProtocolException: Malformed response received from server
at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:268)
at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.s
Question:
why is the executor submitting the launch of application to mesos-master? In spark-submit (above), I clearly give spark master address (at port 7078). why is this not taken?
How can I avoid this error?
using mesos version 1.4.1
removed all entries in spark-defaults.conf except the below.
spark.eventLog.enabled true
It works fine now, meaning, I dont get this error.
looks like having a spark.master called out in spark-defaults.conf was causing this issue.

Flume is not able to send the event when submitting the job on cluster with yarn-client

I am using Horton Works Cluster (2 Node cluster) to run the spark and flume , So when I am running the job with --master "local[*]" , Flume is able to send the events and Spark is also able to receive and on checking at localhost:4040 I can see the events are being received from the flume. (We are pumping 100 Events/Sec from flume using flume-ng-sql source with an approx size of ~1KB each)
Where as when I run the same example with --master "yarn-client" , I am getting the below error in flume and spark is not getting any events as well.
2015-08-13 18:24:24,927 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to send events
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:403)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 55555 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:222)
at org.apache.flume.sink.AbstractRpcSink.verifyConnection(AbstractRpcSink.java:283)
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:360)
... 3 more
Caused by: java.io.IOException: Error connecting to localhost/127.0.0.1:55555
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:168)
... 10 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:496)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:452)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:365)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
^
Also below observation has been observed in cluster:
-- Memory consumption using yarn is pretty much higher than compared to that being used in case of local.
-- Also when I am pumping 100 events per 30 second then Flume and spark are able to connect and process the same using yarn-client as well as local..
Below is the command which I am using for flume and spark.
Flume:
sudo -u hdfs flume-ng agent --conf conf/ -f conf/flume_mysql_spark.conf -n agent1 -Dflume.root.logger=INFO,console > flumelog.txt
Spark:
sudo -u hdfs spark-submit --master "yarn-client" --class "org.paladion.atm.FlumeEventCount" target/atm-1.1-jar-with-dependencies.jar > sparklog.txt
sudo -u hdfs spark-submit --master "local[*]" --class "org.paladion.atm.FlumeEventCount" target/atm-1.1-jar-with-dependencies.jar > sparklog.txt
Kindly l;et me know what could be wrong over here?
It got solves as below:
1 - If running as local give IP of local machine in Flume as well as spark.
2 - If running as cluster (yarn-client or yarn-cluster) give IP of the machine in cluster where you want to send the events (other than the one where you are executing the program so may be give IP of node which is not a master node) machine in Flume as well as spark.
Let me know if I am wrong and this could have worked for some other reason and any better solution is there for the same.

How to prevent Spark Executors from getting Lost when using YARN client mode?

I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following error and slowly all executors get removed from UI and my job fails
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on myhost2.com: remote Rpc client disassociated
I use the following command to schedule Spark job in yarn-client mode
./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 12 /home/myuser/myspark-1.0.jar
What is the problem here? I am new to Spark.
I had a very similar problem. I had many executors being lost no matter how much memory we allocated to them.
The solution if you're using yarn was to set --conf spark.yarn.executor.memoryOverhead=600, alternatively if your cluster uses mesos you can try --conf spark.mesos.executor.memoryOverhead=600 instead.
In spark 2.3.1+ the configuration option is now --conf spark.yarn.executor.memoryOverhead=600
It seems like we were not leaving sufficient memory for YARN itself and containers were being killed because of it. After setting that we've had different out of memory errors, but not the same lost executor problem.
You can follow this AWS post to calculate memory overhead (and other spark configs to tune): best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr
When I had the same issue, deleting logs and free up more hdfs space worked.

Spark 2 application on the same time

I am using spark streaming and saving the processed output in a data.csv file
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(1000))
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
At the same time i would like to read output of NetworkWordCount data.csv along with another newfile and process it again simultaneously
My Question here is
Is it possible to run two spark applications at the same time?
Is it possible to submit a spark application through the code itself
I am using mac, Now i am submitting spark application from the spark folder with the following command
bin/spark-submit --class "com.abc.test.SparkStreamingTest" --master spark://xyz:7077 --executor-memory 20G --total-executor-cores 100 ../workspace/Test/target/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar 1000
or just without spark:ip:port and executor memory, total executor core
bin/spark-submit --class "com.abc.test.SparkStreamingTest" --master local[4] ../workspace/Test/target/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar
and the other application which read the textfile for batch processing like follows
bin/spark-submit --class "com.abc.test.BatchTest" --master local[4] ../workspace/Batch/target/BatchTesting-0.0.1-SNAPSHOT-jar-with-dependencies.jar
when i run the both the applictions SparkStreamingTest and BatchTest separately both works fine , but when i tried to run both simultaneously, i get the following error
Currently i am using spark stand alone mode
WARN AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
Any help is much appriciated.. i am totally out of my mind
From http://spark.apache.org/docs/1.1.0/monitoring.html
If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Your apps should be able to run. It's just a warning to tell you about port conflicts. It's because you run the two Spark apps in the same time. But don't worry about it. Spark will try 4041, 4042 until it finds an available port. So in your case, you will find two Web UIs: ip:4040, ip:4041 for these two apps.

Resources