spark cluster on kubernetes - apache-spark

I'm trying to deploy spark on kubernetes.
Before that I've started Spark on standalone mode and it works, but now I've some troubles.
I'm using the same images to cluster mode as I used to standalone mode.
These images have been build based on miniconda because I needed to customize pacakges
Dockerfile spark master
---
FROM continuumio/miniconda3:latest
COPY master.sh /
COPY elasticsearch-hadoop-8.1.1/dist/* /spark/jars/
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/python
ENV PYSPARK_PYTHON=/opt/conda/bin/python
ENV PYTHONPATH=/opt/conda/bin/python
ENV SPARK_MASTER_PORT 7077
ENV SPARK_MASTER_WEBUI_PORT 8080
ENV SPARK_MASTER_LOG /spark/logs
RUN conda install -c conda-forge findspark
RUN conda install -c conda-forge openjdk
RUN conda install -c conda-forge elasticsearch==8.1.0
RUN conda install -c conda-forge pyspark==3.2.1
RUN conda install -c conda-forge pyarrow==8.0.0
RUN conda install -c conda-forge prophet
RUN conda install -c conda-forge pandas
EXPOSE 8080 7077 6066
CMD ["/bin/bash", "/master.sh"]
and the second one for worker... very similar
master.sh
export SPARK_MASTER_HOST=${SPARK_MASTER_HOST:-`hostname -i`}
export SPARK_HOME=/spark
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
mkdir -p $SPARK_MASTER_LOG
ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out
cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master \
--ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT --deploy-mode cluster >> $SPARK_MASTER_LOG/spark-master.out
my yaml for making pod
kind: Deployment
apiVersion: apps/v1
metadata:
name: spark-master-deployment
namespace: spark
spec:
replicas: 1
selector:
matchLabels:
app: spark-master
template:
metadata:
labels:
app: spark-master
spec:
containers:
- name: spark-master
image: spark:master_conda_pyspark_pyarrow_script_v3
imagePullPolicy: Always
command: ["bash"]
args: ["-c",
"source /opt/conda/lib/python3.9/site-packages/pyspark/sbin/spark-config.sh &&\
source /opt/conda/lib/python3.9/site-packages/pyspark/bin/load-spark-env.sh &&\
mkdir -p $SPARK_MASTER_LOG &&\
ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out &&\
cd /opt/conda/lib/python3.9/site-packages/pyspark/bin &&\
echo $SPARK_MASTER_HOST $SPARK_MASTER_PORT $SPARK_MASTER_WEBUI_PORT &&\
/opt/conda/lib/python3.9/site-packages/pyspark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master \
--host $SPARK_MASTER_HOST \
--port $SPARK_MASTER_PORT \
--webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out"
]
ports:
- containerPort: 8080
- containerPort: 7077
resources:
requests:
memory: '512Mi'
cpu: '300m'
limits:
memory: '1Gi'
cpu: '600m'
env:
- name: INIT_DAEMON_STEP
value: 'setup_spark'
- name: SPARK_LOCAL_HOSTNAME
value: 'spark-master'
- name: SPARK_WORKLOAD
value: 'master'
#- name: SPARK_PUBLIC_DNS
#value: '10.242.130.225'
- name: SPARK_MASTER_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: SPARK_MASTER_PORT
value: '7077'
- name: SPARK_MASTER_WEBUI_PORT
value: '8080'
- name: SPARK_HOME
value: '/opt/conda/lib/python3.9/site-packages/pyspark'
- name: JAVA_HOME
value: '/opt/conda'
---
apiVersion: v1
kind: Service
metadata:
name: spark-master-service
namespace: spark
spec:
selector:
app: spark-master
ports:
- name: http
port: 8080
protocol: TCP
- name: http-cos
port: 7077
protocol: TCP
I can establish spark session with spark master from jupyter netbook
#import
import pyspark
from pyspark.sql import SparkSession
from time import time
import numpy as np
import operator
import os
import mlflow
from random import random
from operator import add
from pyspark import SparkContext
from pyspark import StorageLevel
from pyspark.sql.types import StructType, StructField, TimestampType, StringType, FloatType
#set pyspark enviroment
os.environ["JAVA_HOME"]="/usr/lib/jvm/java-11-openjdk-amd64"
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'
os.environ["PYSPARK_SUBMIT_ARGS"] = '--master spark://spark-master-service:7077 pyspark-shell'
#SK start spark session
spark = SparkSession.builder.master("spark://spark-master-service:7077").config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true").config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true").config("spark.jars", '/usr/local/bin/elasticsearch-spark-30_2.12-8.1.0.jar').config("spark.sql.execution.arrow.pyspark.enabled", "true").appName('SparkTest').getOrCreate()
I got the bellow error
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [8], in <cell line: 2>()
1 #SK start spark session
----> 2 spark = SparkSession.builder.master("spark://spark-master-service:7077").config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true").config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true").config("spark.jars", '/usr/local/bin/elasticsearch-spark-30_2.12-8.1.0.jar').config("spark.sql.execution.arrow.pyspark.enabled", "true").appName('SparkTestarrow').getOrCreate()
File /usr/local/spark/python/pyspark/sql/session.py:228, in SparkSession.Builder.getOrCreate(self)
226 sparkConf.set(key, value)
227 # This SparkContext may be an existing one.
--> 228 sc = SparkContext.getOrCreate(sparkConf)
229 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
230 # by all sessions.
231 session = SparkSession(sc)
File /usr/local/spark/python/pyspark/context.py:392, in SparkContext.getOrCreate(cls, conf)
390 with SparkContext._lock:
391 if SparkContext._active_spark_context is None:
--> 392 SparkContext(conf=conf or SparkConf())
393 return SparkContext._active_spark_context
File /usr/local/spark/python/pyspark/context.py:144, in SparkContext.__init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
139 if gateway is not None and gateway.gateway_parameters.auth_token is None:
140 raise ValueError(
141 "You are trying to pass an insecure Py4j gateway to Spark. This"
142 " is not allowed as it is a security risk.")
--> 144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
147 conf, jsc, profiler_cls)
File /usr/local/spark/python/pyspark/context.py:339, in SparkContext._ensure_initialized(cls, instance, gateway, conf)
337 with SparkContext._lock:
338 if not SparkContext._gateway:
--> 339 SparkContext._gateway = gateway or launch_gateway(conf)
340 SparkContext._jvm = SparkContext._gateway.jvm
342 if instance:
File /usr/local/spark/python/pyspark/java_gateway.py:108, in launch_gateway(conf, popen_kwargs)
105 time.sleep(0.1)
107 if not os.path.isfile(conn_info_file):
--> 108 raise RuntimeError("Java gateway process exited before sending its port number")
110 with open(conn_info_file, "rb") as info:
111 gateway_port = read_int(info)
RuntimeError: Java gateway process exited before sending its port number
Can You keep on eye what's wrong in my configuration?

Related

Java gateway process exited before sending the driver its port number

I have installed Java 8 through the website. Meanwhile, for apache-spark, python, scala were installed using Homebrew. Previously, I have also installed Java 8 through homebrew but then I keep getting error PyJ4.
But now, when I Java was installed separately, the runtime error keep coming up.
I have been reading the solution of this error but most of them keep suggesting to set a JAVA_HOME and Pyspark_home. How do I do that in Mac ? I tried in windows and it was successfully run without error but what do I need to do in Mac ?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test Spark').getOrCreate()
sc = spark.sparkContext
/Users/ainaazazi/opt/anaconda3/lib/python3.9/site-packages/pyspark/bin/spark-class: line 71: /Users/ainaazazi/opt/anaconda3/bin/java: No such file or directory
/Users/ainaazazi/opt/anaconda3/lib/python3.9/site-packages/pyspark/bin/spark-class: line 96: CMD: bad array subscript
head: illegal line count -- -1
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/var/folders/01/g4x6z44s45q86gcnbs0czn9w0000gn/T/ipykernel_4609/3137907738.py in <module>
1 from pyspark.sql import SparkSession
2
----> 3 spark = SparkSession.builder.appName('Test Spark').getOrCreate()
4
5 sc = spark.sparkContext
~/opt/anaconda3/lib/python3.9/site-packages/pyspark/sql/session.py in getOrCreate(self)
226 sparkConf.set(key, value)
227 # This SparkContext may be an existing one.
--> 228 sc = SparkContext.getOrCreate(sparkConf)
229 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
230 # by all sessions.
~/opt/anaconda3/lib/python3.9/site-packages/pyspark/context.py in getOrCreate(cls, conf)
390 with SparkContext._lock:
391 if SparkContext._active_spark_context is None:
--> 392 SparkContext(conf=conf or SparkConf())
393 return SparkContext._active_spark_context
394
~/opt/anaconda3/lib/python3.9/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
142 " is not allowed as it is a security risk.")
143
--> 144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
~/opt/anaconda3/lib/python3.9/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
337 with SparkContext._lock:
338 if not SparkContext._gateway:
--> 339 SparkContext._gateway = gateway or launch_gateway(conf)
340 SparkContext._jvm = SparkContext._gateway.jvm
341
~/opt/anaconda3/lib/python3.9/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
106
107 if not os.path.isfile(conn_info_file):
--> 108 raise RuntimeError("Java gateway process exited before sending its port number")
109
110 with open(conn_info_file, "rb") as info:
RuntimeError: Java gateway process exited before sending its port number
Installed apache-spark through Homebrew and Java 8 through website
You appear to have set JAVA_HOME=/Users/ainaazazi/opt/anaconda3/bin/java, which is not correct.
The method you use to install Java shouldn't matter. For example, I use sdkman and others use asdf-java. But your real error is on the first line of what you've posted, not the last line.
Spark supports Java 11; 8 is end of life. So upgrade, then debug your installation by running $JAVA_HOME/bin/java -version directly, first, which is what Spark needs.
How do I do that in Mac ?
It's an environment variable, so search how you can set those. For example, edit your bash/zshrc file.
You can also edit the spark-env.sh file to fix Spark itself (assuming you've installed it with homebrew or downloaded directly, not used pip/conda).

Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM

I am trying to create SparkContext in jupyter notebook but I am getting following Error:
Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM
Here is my code
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("Groceries")
sc = SparkContext(conf = conf)
Py4JError Traceback (most recent call last)
<ipython-input-20-5058f350f58a> in <module>
1 conf = SparkConf().setMaster("local").setAppName("My App")
----> 2 sc = SparkContext(conf = conf)
~/Documents/python38env/lib/python3.8/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
--> 146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
147 conf, jsc, profiler_cls)
148 except:
~/Documents/python38env/lib/python3.8/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
224 self._encryption_enabled = self._jvm.PythonUtils.isEncryptionEnabled(self._jsc)
225 os.environ["SPARK_AUTH_SOCKET_TIMEOUT"] = \
--> 226 str(self._jvm.PythonUtils.getPythonAuthSocketTimeout(self._jsc))
227 os.environ["SPARK_BUFFER_SIZE"] = \
228 str(self._jvm.PythonUtils.getSparkBufferSize(self._jsc))
~/Documents/python38env/lib/python3.8/site-packages/py4j/java_gateway.py in __getattr__(self, name)
1528 answer, self._gateway_client, self._fqn, name)
1529 else:
-> 1530 raise Py4JError(
1531 "{0}.{1} does not exist in the JVM".format(self._fqn, name))
1532
Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM
Python's pyspark and spark cluster versions are inconsistent and this error is reported.
Uninstall the version that is consistent with the current pyspark, then install the same version as the spark cluster. My spark version is 3.0.2 and run the following code:
pip3 uninstall pyspark
pip3 install pyspark==3.0.2
We need to uninstall the default/exsisting/latest version of PySpark from PyCharm/Jupyter Notebook or any tool that we use.
Then check the version of Spark that we have installed in PyCharm/ Jupyter Notebook / CMD. Using the command spark-submit --version (In CMD/Terminal).
Then Install PySpark which matches the version of Spark that you have.
For example, I have Spark 3.0.3, so I have installed PySpark 3.0.3
In CMD/PyCharm Terminal,
pip install pyspark=3.0.3
Or check this if you are a PyCharm user.
I have had the same error today and resolved it with the below code:
Execute this in a separate cell before you have your spark session builder
from pyspark import SparkContext,SQLContext,SparkConf,StorageLevel
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())

how to run spark from jupyter on yarn client

I have a single cluster deployed using cloudera manager and spark parcel installed,
when typing pyspark in shell, it works yet the running the below code on jupyter throws exception
code
import sys
import py4j
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('SPARK APP')
sc = SparkContext(conf=conf)
# sc= SparkContext.getOrCreate()
# sc.stop()
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print (rdd)
Exception
/usr/lib/python3.6/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
187 self._accumulatorServer = accumulators._start_update_server(auth_token)
188 (host, port) = self._accumulatorServer.server_address
--> 189 self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port, auth_token)
190 self._jsc.sc().register(self._javaAccumulator)
191
TypeError: 'JavaPackage' object is not callable
after searching abit, spark used version 1.6 is not compatible with python 3.7, had to run it using python 2.7

I am trying to run my first word count application in spark through jupyter. But I am getting error in the initialization of SparkContext

I am trying to run my first word count application in spark through jupyter. But I am getting error in the initialization of SparkContext.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Spark Count")
sc = SparkContext(conf=conf)
Below is the error:
ValueError Traceback (most recent call last)
<ipython-input-13-6b825dbb354c> in <module>()
----> 1 sc = SparkContext(conf=conf)
/home/master/Desktop/Apps/spark-2.1.0-bin-hadoop2.7/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
113 """
114 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
116 try:
117 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/home/master/Desktop/Apps/spark-2.1.0-bin-hadoop2.7/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
270 " created by %s at %s:%s "
271 % (currentAppName, currentMaster,
--> 272 callsite.function, callsite.file, callsite.linenum))
273 else:
274 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by <module> at /usr/local/lib/python3.3/site-packages/IPython/utils/py3compat.py:186
I think you already have a SparkContext object that is created by Jupyter automatically. You shouldn't have to create a new one.
Just type sc in a cell and execute it. It should display a reference to an existing context
Hope that helps!
In fact the error indicated it already:
ValueError: Cannot run multiple SparkContexts at once

Can't write to a file that I own and is marked as writeable?

I am working on Debian Jessie. As user opuser I have created a file and I own it:
opuser#mymachine: $ ls -lash /webapps/myapp/run/gunicorn.sock
0 srwxrwxrwx 1 opuser webapps 0 Sep 1 18:50 /webapps/myapp/run/gunicorn.sock
Now if I try to open the file to write to it:
opuser#mymachine: $ vi /webapps/myapp/run/gunicorn.sock
vi shows an error at the bottom: "~/run/gunicorn.sock" [Permission Denied].
Why can't I open a file to write to it when I own it, and the file permissions show that it is world-writeable?
UPDATED:
The file was created by running gunicorn, and the reason I'm debugging this is that the gunicorn user can't write to it either:
gunicorn openprescribing.wsgi:application --name myapp_prod --workers 3 --bind=unix:/webapps/webapps/run/gunicorn.sock --user opuser --group webapps --log-level=debug
Here's the full error:
[2015-09-01 11:18:36 +0000] [9439] [DEBUG] Current configuration:
proxy_protocol: False
worker_connections: 1000
statsd_host: None
max_requests_jitter: 0
post_fork: <function post_fork at 0x7efebefd2230>
pythonpath: None
enable_stdio_inheritance: False
worker_class: sync
ssl_version: 3
suppress_ragged_eofs: True
syslog: False
syslog_facility: user
when_ready: <function when_ready at 0x7efebefc6ed8>
pre_fork: <function pre_fork at 0x7efebefd20c8>
cert_reqs: 0
preload_app: False
keepalive: 2
accesslog: None
group: 999
graceful_timeout: 30
do_handshake_on_connect: False
spew: False
workers: 3
proc_name: myapp_prod
sendfile: True
pidfile: None
umask: 0
on_reload: <function on_reload at 0x7efebefc6d70>
pre_exec: <function pre_exec at 0x7efebefd27d0>
worker_tmp_dir: None
post_worker_init: <function post_worker_init at 0x7efebefd2398>
limit_request_fields: 100
on_exit: <function on_exit at 0x7efebefd2e60>
config: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
proxy_allow_ips: ['127.0.0.1']
pre_request: <function pre_request at 0x7efebefd2938>
post_request: <function post_request at 0x7efebefd2a28>
user: 999
forwarded_allow_ips: ['127.0.0.1']
worker_int: <function worker_int at 0x7efebefd2500>
threads: 1
max_requests: 0
limit_request_line: 4094
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
certfile: None
worker_exit: <function worker_exit at 0x7efebefd2b90>
chdir: /webapps/myapp/myapp
paste: None
default_proc_name: myapp.wsgi:application
errorlog: -
loglevel: debug
logconfig: None
syslog_addr: udp://localhost:514
syslog_prefix: None
daemon: False
ciphers: TLSv1
on_starting: <function on_starting at 0x7efebefc6c08>
worker_abort: <function worker_abort at 0x7efebefd2668>
bind: ['unix:/webapps/myapp/run/gunicorn.sock']
raw_env: []
reload: False
check_config: False
limit_request_field_size: 8190
nworkers_changed: <function nworkers_changed at 0x7efebefd2cf8>
timeout: 30
ca_certs: None
django_settings: None
tmp_upload_dir: None
keyfile: None
backlog: 2048
logger_class: gunicorn.glogging.Logger
statsd_prefix:
[2015-09-01 11:18:36 +0000] [9439] [INFO] Starting gunicorn 19.3.0
Traceback (most recent call last):
File "/home/anna/.virtualenvs/myapp/bin/gunicorn", line 11, in <module>
sys.exit(run())
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/app/wsgiapp.py", line 74, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/app/base.py", line 189, in run
super(Application, self).run()
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 171, in run
self.start()
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 130, in start
self.LISTENERS = create_sockets(self.cfg, self.log)
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/sock.py", line 211, in create_sockets
sock = sock_type(addr, conf, log)
File "/home/anna/.virtualenvs/myapp/local/lib/python2.7/site-packages/gunicorn/sock.py", line 104, in __init__
os.remove(addr)
OSError: [Errno 13] Permission denied: '/webapps/myapp/run/gunicorn.sock'
The node you are trying to open is a socket. More preciselly a unix domain socket (the s in the permissions flags signals this). Sockets are not open(2)ed the normal way (that's the reason vi(1) fails.) They have to be acquired with the socket(PF_UNIX, ...) system call (see unix(7)) and then bind(2)ed to a proper path in the filesystem (this is what makes them to appear in the filesystem's hierarchy).
Once you have got a socket of this kind working, you have to connect(2) it to another socket (or to accept(2) as it has been bound to a filesystem node) to allow communications flow from one socket to the other.
For an introduction to sockets api programming (and UNIX domain sockets) read the famous R.W.Stevens book Unix Network Programming, Volume 1: The Sockets Networking API (3rd Edition).
Your file type is a socket. It is read once / write once. Not sure you can open that with a regular text editor.

Resources