Multiple spark session in one job submit on Kubernetes - apache-spark

can we use multiple starts and stop spark sessions in Kubernetes in one submit a job?
like: if I submit one job using this
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
In my python code, can I starts and stop spark sessions?
examples:
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
## some python code.
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
is it possible or not?

Related

Spark Structured Streaming + pyspark app returns "Initial job has not accepted any resources"

RunCode
spark-submit --master spark://{SparkMasterIP}:7077
--deploy-mode cluster --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,
com.datastax.spark:spark-cassandra-connector_2.12:3.2.0,
com.github.jnr:jnr-posix:3.1.15
--conf spark.dynamicAllocation.enabled=false
--conf com.datastax.spark:spark.cassandra.connectiohost={SparkMasterIP==CassandraIP},
spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions test.py
Source Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SQLContext
# Spark Bridge local to spark_master == Connect master
spark = SparkSession.builder \
.master("spark://{SparkMasterIP}:7077") \
.appName("Spark_Streaming+kafka+cassandra") \
.config('spark.cassandra.connection.host', '{SparkMasterIP==CassandraIP}') \
.config('spark.cassandra.connection.port', '9042') \
.getOrCreate()
# Parse Schema of json
schema = StructType() \
.add("col1", StringType()) \
.add("col2", StringType()) \
.add("col3", StringType()) \
.add("col4", StringType()) \
.add("col5", StringType()) \
.add("col6", StringType()) \
.add("col7", StringType())
# Read Stream From {TOPIC} at BootStrap
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "{KAFKAIP}:9092") \
.option('startingOffsets','earliest') \
.option("subscribe", "{TOPIC}") \
.load() \
.select(from_json(col("value").cast("String"), schema).alias("parsed_value")) \
.select("parsed_value.*")
df.printSchema()
# write Stream at cassandra
ds = df.writeStream \
.trigger(processingTime='15 seconds') \
.format("org.apache.spark.sql.cassandra") \
.option("checkpointLocation","./checkPoint") \
.options(table='{TABLE}',keyspace="{KEY}") \
.outputMode('append') \
.start()
ds.awaitTermination()
Error Code
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I was checked Spark UI, workers have no problem.
here is my Spark status
[![enter image description here][2]][2]
my plan is
kafka(DBIP)--readStream-->LOCAL(DriverIP)--writeStream-->Spark&kafka&casaandra(MasterIP)
DBIP, DriverIP, MasterIP is different IP.
LOCAL have no spark, so i use pyspark on python_virtualenv
Edit
You app can't run because there are no resources available in your Spark cluster.
If you look closely at the Spark UI screenshot you posted, all the cores are used on all the 3 workers. That means there are no cores left for any other apps so any new submitted app will have to wait until resources are available before it can be scheduled. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

spark dataframe not successfully written in elasticsearch

I am writing data from my spark-dataframe into ES. i did print the schema and the total count of records and it seems all ok until the dump gets started. Job runs successfully and no issue /error raised in spark job but the index doesn't have the supposed amount of data it should have.
i have 1800k records needs to dump and sometimes it dumps only 500k , sometimes 800k etc.
Here is main section of code.
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config('spark.yarn.executor.memoryOverhead', '4096') \
.enableHiveSupport() \
.getOrCreate()
final_df = spark.read.load("/trans/MergedFinal_stage_p1", multiline="false", format="json")
print(final_df.count()) # It is perfectly ok
final_df.printSchema() # Schema is also ok
## Issue when data gets write in DB ##
final_df.write.mode("ignore").format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
My resources are also ok.
Command to run spark job.
time spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 6g --executor-memory 3g --num-executors 16 --executor-cores 2 main_es.py

How to use pyspark on yarn-cluster mode by code

This is my code.
spark = SparkSession.builder. \
master("yarn"). \
config("hive.metastore.uris", settings.HIVE_URL). \
config("spark.executor.memory", '5g'). \
config("spark.driver.memory", '2g').\
config('spark.submit.deployMode', 'cluster').\
enableHiveSupport(). \
appName(app_name). \
getOrCreate()
I want to use this way to submit a task to a yarn-cluster.
But it throws an exception:
Cluster deploy mode is not applicable to Spark shells.
What should I do?
I need use cluster mode, not client.

AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

I'm running Jupyterhub on EKS and wants to leverage EKS IRSA functionalities to run Spark workloads on K8s. I had prior experience of using Kube2IAM, however now I'm planning to move to IRSA.
This error is not because of IRSA, as service accounts are getting attached perfectly fine to Driver and Executor pods and I can access S3 via CLI and SDK from both. This issue is related to accessing S3 using Spark on Spark 3.0/ Hadoop 3.2
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
I'm using following versions -
APACHE_SPARK_VERSION=3.0.1
HADOOP_VERSION=3.2
aws-java-sdk-1.11.890
hadoop-aws-3.2.0
Python 3.7.3
I tested with different version as well.
aws-java-sdk-1.11.563.jar
Please help to give a solution if someone has come across this issue.
PS: This is not an IAM Policy error as well, because IAM policies are perfectly fine.
Finally all the issues are solved with below jars -
hadoop-aws-3.2.0.jar
aws-java-sdk-bundle-1.11.874.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.874)
Anyone who's trying to run Spark on EKS using IRSA this is the correct spark config -
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("pyspark-data-analysis-1") \
.config("spark.kubernetes.driver.master","k8s://https://xxxxxx.gr7.ap-southeast-1.eks.amazonaws.com:443") \
.config("spark.kubernetes.namespace", "jupyter") \
.config("spark.kubernetes.container.image", "xxxxxx.dkr.ecr.ap-southeast-1.amazonaws.com/spark-ubuntu-3.0.1") \
.config("spark.kubernetes.container.image.pullPolicy" ,"Always") \
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") \
.config("spark.kubernetes.authenticate.executor.serviceAccountName", "spark") \
.config("spark.kubernetes.executor.annotation.eks.amazonaws.com/role-arn","arn:aws:iam::xxxxxx:role/spark-irsa") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
.config("spark.kubernetes.authenticate.submission.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt") \
.config("spark.kubernetes.authenticate.submission.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token") \
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.fast.upload","true") \
.config("spark.executor.instances", "1") \
.config("spark.executor.cores", "3") \
.config("spark.executor.memory", "10g") \
.getOrCreate()
Can check out this blog (https://medium.com/swlh/how-to-perform-a-spark-submit-to-amazon-eks-cluster-with-irsa-50af9b26cae) with:
Spark 2.4.4
Hadoop 2.7.3
AWS SDK 1.11.834
The example spark-submit is
/opt/spark/bin/spark-submit \
--master=k8s://https://4A5<i_am_tu>545E6.sk1.ap-southeast-1.eks.amazonaws.com \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
--conf spark.kubernetes.container.image=vitamingaugau/spark:spark-2.4.4-irsa \
--conf spark.kubernetes.namespace=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
--conf spark.kubernetes.authenticate.executor.serviceAccountName=spark-pi \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
--conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.4.jar 20000

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

Resources