Unable to open native connection with spark sometimes - cassandra

I'm running a Spark job with Spark version 1.4 and Cassandra 2.18. I telnet from master and it works to cassandra machine. Sometimes the job runs fine and sometimes I get the following exception. Why would this happen only sometimes?
"Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 172.28.0.162): java.io.IOException: Failed to open native connection to Cassandra at {172.28.0.164}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:155)
"
It sometimes also gives me this exception along with the upper one:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.28.0.164:9042 (com.datastax.driver.core.TransportException: [/172.28.0.164:9042] Connection has been closed))

I had the second error "NoHostAvailableException" happen to me quite a few times this week as I was porting Python spark to Java Spark.
I was having issues with the driver thread being nearly out of memory and the GC was taking up all my cores (98% of all 8 core), pausing the JVM all the time.
In python when this happens it's much more obvious (to me) so it took me a bit of time to realize what was going on, so I got this error quite a few times.
I had two theory on the root cause, but the solution was not having the GC go crazy.
First theory, was that because it was pausing so often, I just couldn't connect to Cassandra.
Second theory: Cassandra was running on the same machine as Spark and the JVM was taking 100% of all CPU so Cassandra just couldn't answer in time and it looked to the driver like there were no Cassandra host.
Hope this helps!

Related

Change ulmit value in Spark

I am running Spark codes in EC2 instance. I am running into the "Too many open files" issue (logs below), and I searched online and seems I need to set ulimit to a higher number. Since I am running the Spark job in AWS, and I don't know where the config file is, how can I pass that value in my Spark code?
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 255 in stage 19.1 failed 4 times, most recent failure: Lost task 255.3 in stage 19.1 (TID 749786, 172.31.20.34, executor 207): java.io.FileNotFoundException: /media/ebs0/yarn/local/usercache/data-platform/appcache/application_1559339304634_2088/blockmgr-90a63e4a-dace-4246-a158-270b0b34c1f9/20/broadcast_13 (Too many open files)
Apart from changing the ulimit you should also look for connection leakages. For eg: Check if your i/o connections are properly closed. We saw Too many open files exception even with 655k ulimit on every node. Later we found the connection leakages in the code.

How to avoid ExecutorFailure Error in Spark

How to avoid Executor Failures while Spark jobs are executing .
We are using Spark 1.6 version as part of Cloudera CDH 5.10.
Normally I am getting below error.
ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 127100 ms
There could be various reasons behind the slow tasks execution then it gets timeout, you need to drill down to find the rootcause.
Sometimes tuning default timeout configuration parameters also helps. Go to spark UI configuration tab and find out values for below parameters then increase timeout parameters in spark-submit.
spark.worker.timeout
spark.network.timeout
spark.akka.timeout
Running job with speculative execution spark.speculation=true also helps, if one or more tasks are running slowly in a stage, they will be re-launched.
Explore more about spark 1.6.0 configuration properties.

How to run spark distributed in cluster mode, but take file locally?

Is it possible to have spark take a local file as input, but process it distributed?
I have sc.textFile(file:///path-to-file-locally) in my code, and I know that the exact path to the file is correct. Yet, I am still getting
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 14, spark-slave11.ydcloud.net): java.io.FileNotFoundException: File file:/<path to file> does not exist
I am running spark distributed, and not locally. Why the error exist?
It is possible but when you declare local path as an input it has to be present on each worker machine and the driver. So it means you have to distribute it first either manually or using built-in tools like SparkFiles.
The files must be located at a centralized location, which is accessible to all the nodes. This can be achieved by using a distributed file system, dse provides a replacement for HDFS called CFS(Cassandra File System). The cfs is available when dse is started in analytic mode using -k option.
For further details of setting up and using cfs, you can have a look at the following link http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/ana/anaCFS.html

Spark Streaming stops after sometime due to Executor Lost

I am using spark 1.3 for spark streaming application. When i start my application . I can see in spark UI that few of the jobs have failed tasks. On investigating the job details . I see few of the task were failed due to Executor Lost Exception either ExecutorLostFailure (executor 11 lost) or Resubmitted (resubmitted due to lost executor) .
In application logs from yarn the only Error shown is Lost executor 11 on <machineip> remote Akka client disassociated . I dont see any other exception or error being thrown.
The application stops after couple of hours. Logs shows all the executor are lost when application fails.
Can anyone suggest or point to link on how to resolve this issue.
There are many potential options for why you're seeing executor loss. One thing I have observed in the past is that Java garbage collection can take very long periods under heavy load. As a result the executor is 'lost' when the GC takes too long, and returns shortly thereafter.
You can determine if this is the issue by turning on executor GC logging. Simply add the following configuration:
--conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy"
See this great guide from Intel/DataBricks here for more details on GC tuning: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Spark: DecoderException: java.lang.OutOfMemoryError

I am running a Spark streaming application on a cluster with 3 worker nodes. Once in a while jobs are failing due to the following exception:
Job aborted due to stage failure: Task 0 in stage 4508517.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4508517.0 (TID 1376191, 172.31.47.126): io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:165)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
... 10 more
I am submitting the job in client mode without any special parameters.
Both master and workers have 15 g of memory. Spark Version is 1.4.0.
Is this solvable by tuning configuration?
I'm facing the same problem and found out that its probably caused by a memory leak in netty version 4.0.23.Final which is used by Spark 1.4 (see https://github.com/netty/netty/issues/3837)
It is solved at least in Spark 1.5.0 (see https://issues.apache.org/jira/browse/SPARK-8101) which uses netty 4.0.29.Final.
So an upgrade to the latest Spark version should solve the problem. I will try it the next days.
Additionally Spark Jobserver in the current version forces netty 4.0.23.Final, so it needs a fix too.
EDIT: I upgraded to Spark 1.6 with netty 4.0.29.Final but still getting a direct buffer OOM using Spark Jobserver.

Resources