Fault Tolerance in Apache Livy

Fault Tolerance in Apache Livy - apache-spark

Anyone having some insights regarding achieving fault tolerance in Apache Livy. Say for instance the Livy server fails how we can achieve HA.

Actually, using multiple Livy servers behind a load balancer doesn't work right now due to this bug: https://issues.apache.org/jira/browse/LIVY-541
In contrast, for deployments that require high availability, Livy supports session recovery using Zookeeper, which ensures that a Spark cluster remains available if the Livy server fails. After a restart, the Livy server can connect to existing sessions and roll back to the state before failing.

If you want Livy sessions to persis restart then just set these properties in livy.conf
livy.server.recovery.mode = recovery
livy.server.recovery.state-store = filesystem
livy.server.recovery.state-store.url = file:///home/livy
You can use hdfs:// as well for store.

Related

Why shouldn't local mode in Spark be used for production?

The official documentation and all sorts of books and articles repeat the recommendation that Spark in local mode should not be used for production purposes. Why not? Why is it a bad idea to run a Spark application on one machine for production purposes? Is it simply because Spark is designed for distributed computing and if you only have one machine there are much easier ways to proceed?

Local mode in Apache Spark is intended for development and testing purposes, and should not be used in production because:
Scalability: Local mode only uses a single machine, so it cannot
handle large data sets or handle the processing needs of a production
environment.
Resource Management: Spark’s standalone cluster manager or a cluster
manager like YARN, Mesos, or Kubernetes provides more advanced
resource management capabilities for production environments compared
to local mode.
Fault Tolerance: Local mode does not have the ability to recover from
failures, while a cluster manager can provide fault tolerance by
automatically restarting failed tasks on other nodes.
Security: Spark’s cluster manager provides built-in security features
such as authentication and authorization, which are not present in
local mode.
Therefore, it is recommended to use a cluster manager for production environments to ensure scalability, resource management, fault tolerance, and security.

I have the same question. I am certainly not an authority on the subject, but because no-one has answered this question, I'll try to list the reasons I've encountered while using Spark local mode in Java. So far:
Spark uses System.exit() calls in certain occassions, such as an out of memory error or when the local dir does not have write permissions, so if such a call is triggered, the entire JVM shuts down (including your own application from within which Spark runs, see e.g., https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala#L45, https://github.com/apache/spark/blob/b22946ed8b5f41648a32a4e0c4c40226141a06a0/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L63). Moreover, it circumvents your own shutdownHook, so there is no way to gracefully handle such system exits in your own application. On a cluster, it is usually fine if a certain machine restarts, but, if all Spark components are contained in a single JVM, the assumption that we can shut down the entire JVM upon a Spark failure does not (always) hold.

How to use authentication on Spark Thrift server on Spark standalone cluster

I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help

Datastax Spark Sql Thriftserver with Spark Application

I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?

The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs

Issue to start coordinator

I have install 3 arangodb servers. But i have always the same listening port 8529 no 8530 for coordinator so i cannot create a cluster.
tcp 0 0 0.0.0.0:8529 0.0.0.0:* LISTEN 13142/arangod
So when i try to create a cluster via the web interface, i have the following error
ERROR bootstrapping DB servers failed: Could not connect to 'tcp://10.0.0.18:8530' 'connect() failed with #111 - Connection refused'
How can i start and/or configure the corrdinator to have a listen on my servers?
Regards

Dispatcher based clusters
Please note that dispatcher based setups as you asked are intended for evaluation purposes only.
To start a cluster from the dispatcher webfrontend you need to configure all nodes to start the arangod daemon in dispatcher mode:
[cluster]
disable-dispatcher-kickstarter = no
disable-dispatcher-frontend = no
To start a cluster on a single machine you only need to install ArangoDB and reconfigure it once; it will then use the same installation to start the dispatcher and dbserver nodes.
One should know that the initial cluster startup may take a while.
Another side note is that authentication is not supported in this scenario, so you may need to turn it off.
You should now find the log output of the dbserver and coordinator instances under /var/log/arangodb/cluster/ so you can get the actual informations of what went wrong.
Script based cloud install clusters
A better way to get a cluster running in the cloud may be to use one of the scripts we prepared for Digital Ocean, Google Compute Engine, AWS or Azure.
ArangoDB Clusters based on Mesosphere DCOS
The currently recommended way of running an ArangoDB cluster is to use Mesosphere DCOS, as Max describes in these slides using some example configurations.
ArangoDB is an official Mesosphere partner and we offer an official DCOS subcommand to manage an ArangoDB Cluster on Mesosphere DCOS.
Mesosphere adds additional services on top of Mesos and eases management of the Mesos cluster via the dcos-cli.
If you want to use a raw Apache Mesos Cluster, you can use the Mesos framework directly to schedule the neccessary tasks to create ArangoDB cluster.
Meanwhile there is a better article about Running ArangoDB on DC/OS.

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?

When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string