Spark scheduler thersholds - apache-spark

I'm running on top of Spark some analysis tool that creates plenty of overhead, so computations takes a lot more time. When I run it I get this error:
16/08/30 23:36:37 WARN TransportChannelHandler: Exception in connection from /132.68.60.126:36922
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/08/30 23:36:37 ERROR TaskSchedulerImpl: Lost executor 0 on 132.68.60.126: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I guess this happens because the scheduler thinks the executor failed, so it starts another one.
The workload is a simple string search (grep), both master and slave are local so there aren't suppose to be any failures. When running without the overheads things are fine.
The question is - can I configure those timeout thresholds somewhere?
Thanks!

Solved it with spark.network.timeout 10000000 on spark-defaults.conf.

I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they
lost the connection from the master. I added the below configurations
in spark-defaults.conf file.
spark.network.timeout 10000000
spark.executor.heartbeatInterval 10000000
basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the
dataframe that is used for processing (in the step which used to get
stuck)
Note:- These are work arounds, I still see the same error in error logs but the my job does not get terminated.

Related

Apache pulsar get timeout in unpredictable way

I installed Apache pulsar standalone. Pulsar get timeout sometimes. It's not related to high throuput neither to a particular topic (following log). Pulsar-admin brokers healthcheck returns OK or timeout also. How to investigate on it ?
10:46:46.365 [pulsar-ordered-OrderedExecutor-7-0] WARN org.apache.pulsar.broker.service.BrokerService - Got exception when reading persistence policy for persistent://nnx/agent_ns/action_up-53da8177-b4b9-4b92-8f75-efe94dc2309d: null
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
at org.apache.pulsar.broker.service.BrokerService.lambda$getManagedLedgerConfig$32(BrokerService.java:922) ~[org.apache.pulsar-pulsar-broker-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.mledger.util.SafeRun$2.safeRun(SafeRun.java:49) [org.apache.pulsar-managed-ledger-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
I am glad you were able to resolve the issue my adding more cores. The issue was a connection timeout while trying to access some topic metadata that is stored inside of ZookKeeper as indicated by the following line in the stack trace:
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
Increasing the cores must of freed up enough threads to allow the ZK node to respond to this request.
You can check the connection to the server looks like connection issue if you are using any TLScertificate file path check if you have the right certificate.
The Problem is we don't have lot of solutions in internet for apache pulsar but if you are following the apache pulsar doc might help and also we have apache pulsar git hub and sample projects.

Connecting to remote Dataproc master in SparkSession

I created a 3 node (1 master, 2 workers) Apache Spark cluster in on Google Cloud Dataproc. I'm able to submit jobs to the cluster when connecting through ssh with the master, however I can't get it work remotely. I can't find any documentation about how to do this except for a similar issue on AWS but that isn't working for me.
Here is what I am trying
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://<master-node-ip>:7077')
sc = pyspark.SparkContext(conf=conf)
I get the error
19/11/13 13:33:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/11/13 13:33:53 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <master-node-ip>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /<master-node-ip>:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /<master-node-ip>:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
I added a firewall rule to allow ingress traffic on tcp:7077. But that doesn't solve it.
Utimately I would like to setup a VM on compute engine that can run this code while connecting over internal ip adresses (in a VPC I created) to run jobs on dataproc without using gcloud dataproc jobs submit. I tried it both over internal and external IP but neither are working.
Does anyone know how I can get it working?
So there is a few things to unpack here.
The first thing I want to make sure you understand is when exposing your distributed computing framework to ingress traffic you should be very careful. If Dataproc exposed a Spark-Standalone cluster on port 7077, you would want to make sure that you lock down that ingress traffic. Sounds like you know that by wanting a VM on a shared VPC, but this is pretty important even when testing if you open up firewalls.
The main problem it looks like you're having though is that you appear to be trying to connect as if it was a Spark-Standalone cluster. Dataproc actually uses Spark on YARN. To connect, you will need to set the Spark Cluster Manager type to "yarn" and correctly configure your local machine to talk to a remote YARN cluster, either by setting up a yarn-site.xml and having the HADOOP_CONF_DIR point to it or by directly setting YARN properties like yarn.resourcemanager.address via spark-submit --conf.
Also note this is similar to this question once you know that Dataproc uses YARN: Scala Spark connect to remote cluster

cassandra node not starting

I have 4 node cassandra cluster. out of which 2 are up but 2 are down.
when I start them they immediately getting down.
when I check using service cassandra status
I am getting could not access pidfile for cassandra
and in system.log file, the error is:
ERROR [main] 2017-09-15 15:44:46,277 CassandraDaemon.java:752 - Exception encountered during startup
java.lang.NullPointerException: null
at org.apache.cassandra.gms.Gossiper.isSafeForStartup(Gossiper.java:756) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:553) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:666) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:394) [apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:601) [apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:735) [apache-cassandra-3.10.jar:3.10]
INFO [StorageServiceShutdownHook] 2017-09-15 15:44:46,281 HintsService.java:221 - Paused hints dispatch
INFO [StorageServiceShutdownHook] 2017-09-15 15:44:46,282 Gossiper.java:1506 - Announcing shutdown
From source code of Gossiper(link), I suspect that your nodes are stuck in boostrapping phase. Other nodes see them as already bootstrapped, but they failed to finish joining the cluster.
What could help is forcibly removing stuck nodes from your cluster, by using nodetool removenode in any other instance that was able to boot.
After that, you should clean data on the stuck instances by wiping out datadirectory (located in data/ or under system folder, if you installed from OS package) and start instances one-by-one.
If you send the gossipinfo and status of your cluster, it might help to figure out what the real problem is.
For further reference see official guide
One cause can be misconfigured ip addresses/seeds:
Double-check the cluster name, the seeds are the same on all nodes in their cassandra.yaml file and that all of the ip addresses are correct.
In cassandra.yaml file , comment these values and let it use default
data_file_directories: db/cassandra/data
commitlog_directory: db/cassandra/commitlog

Networking problems with submitting Spark application

I want to submit Spark application from my laptop to a Spark cluster in standalone mode. They are in the same network and have routable IP addresses.
I can get it to work if I disable firewall on my laptop. But, if I have firewall enabled, it fails while trying to connect to the master node. The following error message seems to indicate that the driver can't get any resources.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Next, I try setting the spark.driver.port in my driver program. I then open up that particular port on the driver machine. This gets past the first error, but gives me a new error.
WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 172.31.51.158): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems like the master and driver are able to use the spark.driver.port to get something started, but failed later? It still seems like it's some kind of networking problem.

Cassandra NoHostsAvailableException

I am getting the following issue:
2015-05-06 00:00:24,368 648892 INFO [STDOUT] (ajp-0.0.0.0-8109-1:)
00:00:24,365 ERROR [DispatcherServlet]
com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried:
/:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout during
read), /:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout during
read)) com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried:
/:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout during
read), /:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout during
read)) at
com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
at
com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
The Cluster configuration is 2 Nodes with RF = 1. Running a query from an application which uses secondary index. The query is running fine on the TEST cluster but it is giving the above exception on STAGE cluster.
I was able to reproduce the issue from my local environment and added .withSocketOptions(socket.setReadTimeoutMillis(80000)) in the code which resolved the issue. But if the number of rows increase, I have to increase the above value. There is no error present in the Cassandra system logs and the application is able to connect to the cluster but it is timing out only for the STAGE cluster. Any idea why this might be happening? Configurations for both clusters is the same.
This is caused by the change detailed in JAVA-425 where a query that hits a read timeout marks a host in an unavailable state. This behavior has been reverted in 2.0.10 which was released Monday and will be included in 2.1.6 when it is released in the coming weeks. The new behavior is if a query hits the configured read timeout, the host will remain available instead of being in an unavailable state until the driver can reconnect.
Could you try again at java-driver 2.0.10 and let me know if the problem persists? If you are on driver version 2.1.X you may need to continue using an increasingly higher timeout until 2.1.6 is released.
As far as the issue only existing on your STAGE cluster, the performance of your query is going to be very dependent and your dataset and environment, especially with secondary indexes (which in general are only recommended for specific use cases). I'd take a look at When to use an index to determine if Secondary Indexes are right for your use case.

Resources