Unable to Connect Remote Spark Session with YARN mode on Kubeflow - apache-spark

The main problem is that we are unable to run spark in client mode.
Whenever we try to connect to spark on YARN mode from kubeflow notebook we have the following error:
`Py4JJavaError: An error occurred while calling o81.showString.
: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:932)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:930)`
It seems we have exact same issue here:
Up to now:
we have managed to submit spark on notebook.
Also, it is possible to connect cluster mode from kubeflow notebook.
We have also managed to run spark session with python shell on one of the worker server on kubernetes. We are able to connect remote edge node which managed by Cloudera.
We have checked that there is no network issue between hadoop clusters and kubernetes clusters.
However, we still have no access interactive spark on jupyter notebook.

Related

Azuresynapse Spark, run jupyter notebook with localhost running on e.g. localhost

I have an Apache Spark pool running on web.azuresynapse.net and can e.g. use it to run spark jobs from my "Synapse Analytics workspace"
I can develop and run python notebooks using the spark pool from there.
From what I see when I run a job, livy is supported (though) the link provided from azure synapse is for some reason inaccessible.
How could I connect to that pool using e.g. livy from e.g. my local jupyter notebook and use the pool? Or the only way to run spark code is to use a pipeline?

Not able to connect to Snowflake from EMR Cluster using Pyspark

I am trying to connect to Snowflake from EMR cluster using pyspark.
I am using these two jars in spark-submit.
snowflake-jdbc-3.5.2.jar
spark-snowflake_2.11-2.7.0-spark_2.4.jar
But it failing with connect time out error.
I have correct proxy configured for the EMR cluster. From the same EC2 (EMR Master)
I am able to connect to Snowflake using snowsql and python connector.
I am not sure why it is getting timed out for pyspark.
You can use our SnowCD tool to check the connectivity diagnostics. This is related to network issues.
https://docs.snowflake.com/en/user-guide/snowcd.html
Below were my commands when I tried running through EMR shell.
pyspark --packages net.snowflake:snowflake-jdbc:3.6.27,net.snowflake:spark-snowflake_2.12:2.4.14-spark_2.4
spark-submit --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4 SparkConnPythonWithCert.py
Spark-shell --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4

Apache livy error - "Unable to connect to server" spark://[hostname]:7077

I'm trying to submit a Spark batch job using Apache Livy.
I am running my Spark cluster and Livy service on Openshift.
the Spark cluster is in standalone cluster mode, Livy runs spark in cluster deploy mode.
I've configured Livy to the Spark cluster master service:
livy.spark.master = "spark://spark.spark.svc:7077
when i try to run a scala spark job, using the POST /batches method and get the exception:
Unable to connect to server spark://spark.spark.svc:7077
Anyone experienced this issue?

Persisting to Kerberized HDFS from Spark cluster

my current set-up:
Spark version 2.3.1 (Cluster running on Windows) uses Spark secret (basic).
Hdfs (Cluster running on Linux) Kerberized.
Not ideal! but there's a good reason why I can't use the same set of machines for both clusters.
I am able to read/ write to Hdfs from a standalone Spark application but when I try to run similar code on the Spark cluster I get an authentication error.
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via: [TOKEN, KERBEROS]; Host Details....
Where is your other cluster node? Which user running spark in cluster mode? Does that user has permission to access keytab? I think it can be permission issue or some typo.

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Resources