KubernetesPodOperator Not Sending Arguments as expected - apache-spark

I have airflow running KubernetesPodOperator in order to do a Spark-submit call:
spark_image = f'{getenv("REGISTRY")}/myApp:{getenv("TAG")}'
j2g = KubernetesPodOperator(
dag=dag,
task_id='myApp',
name='myApp',
namespace='data',
image=spark_image,
cmds=['/opt/spark/bin/spark-submit'],
configmaps=["data"],
arguments=[
'--master k8s://https://10.96.0.1:443',
'--deploy-mode cluster',
'--name myApp',
f'--conf spark.kubernetes.container.image={spark_image}',
'local:///app/run.py'
],
However, I'm getting the following error:
Error: Unrecognized option: --master k8s://https://10.96.0.1:443
Which is weird, because when I bin/bash to a running pod and execute the spark-submit command, it works.
Any idea how to pass the arguments as expected?

Solution from GitHub ticket: Parameter should be sent like:
'--master=k8s://https://10.96.0.1:443',

Related

Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir

Elasticsearch pyspark connection in insecure mode

My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks
After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)
I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'

Livy always run on local mode

I am trying to run Pyspark (or Spark) job via Livy server with "spark.master=yarn".
What I have done:
1) In spark-defaults.conf:
spark.master yarn
spark.submit.deployMode client
2) In livy.conf:
livy.spark.master = yarn
livy.spark.deployMode = client
3) I send request via CURL with "conf": {"spark.master": "yarn"}
Example:
curl -X POST -H "Content-Type: application/json" localhost:8998/batches --data '{"file": "hdfs:///user/grzegorz/hello-world.py", "name": "MY", "conf": {"spark.master": "yarn"} }'
{"id":3,"state":"running","appId":null,"appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: "]}
And what I am always getting in logs:
18/01/02 14:45:07.880 qtp1758624236-28 INFO BatchSession$: Creating batch session 3: [owner: null, request: [proxyUser: None, file: hdfs:///user/grzegorz/hello-world.py, name: MY, conf: spark.master -> yarn]]
18/01/02 14:45:07.883 qtp1758624236-28 INFO SparkProcessBuilder: Running '/usr/local/share/spark/spark-2.0.2/bin/spark-submit' '--name' 'MY' '--conf' 'spark.master=local' 'hdfs:///user/grzegorz/hello-world.py'
I hope somebody have any ideas how to get it over. Thank you in advance.

where to check my yarn+spark's applications' log?

I wrote an application with yarn+spark, for simplicity, I list the following
object testKafkaSparkStreaming extends Logging {
private class Parser extends Logging{
def parse(row: String): Row =
{
val row = "/_dc.gif| |20160616063934| |39.190.5.69| |729252016040907094857083a3c7c62e"
logInfo("pengcz starting parse " + row)
}
}
def main(args: Array[String]) {
...
logInfo("main starting parse " + row)
...
}
}
When I executed:
spark-submit --master local[*] --class $CLASS $JAR ...
I can see the two log infos in the console
But when I executed:
spark-submit --master yarn --class $CLASS $JAR ...
And I opened the yarn web ui of my own application:
http://192.168.36.172:8088/cluster/app/application_1465894400511_3624
And I click logs under the page, I got:
But the page has not information including my two logs
What should I do to find my own logs?
Any advice will be appreciated!
You can use this command to see the Yarn Logs of an Application :
yarn logs -applicationId <Your Yarn Application ID> // i.e. application_1465894400511_3624

Yarn support on Ooyala Spark JobServer

Just started experimenting with the JobServer and would like to use it in our production environment.
We usually run spark jobs individually in yarn-client mode and would like to shift towards the paradigm offered by the Ooyala Spark JobServer.
I am able to run the WordCount examples shown in the official page.
I tried running submitting our custom spark job to the Spark JobServer and I got this error:
{
"status": "ERROR",
"result": {
"message": "null",
"errorClass": "scala.MatchError",
"stack": ["spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:220)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)",
"scala.concurrent.impl.Future $PromiseCompletingRunnable.run(Future.scala:24)",
"akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)",
"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)",
"scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)",
"scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java 1339)",
"scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)",
"scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)"]
}
I had made the necessary code modifications like extending SparkJob and implementing the runJob() method.
This is the dev.conf file that I used:
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
jar-store-rootdir = /tmp/jobserver/jars
jobdao = spark.jobserver.io.JobFileDAO
filedao {
rootdir = /tmp/spark-job-server/filedao/data
}
context-creation-timeout = "60 s"
}
contexts {
my-low-latency-context {
num-cpu-cores = 1
memory-per-node = 512m
}
}
context-settings {
num-cpu-cores = 2
memory-per-node = 512m
}
home = "/data/softwares/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041"
}
spray.can.server {
parsing.max-content-length = 200m
}
spark.driver.allowMultipleContexts = true
YARN_CONF_DIR=/home/spark/conf/
Also how can I give run-time parameters for the spark job, such as --files, --jars ?
For example, I usually run our custom spark job like this:
./spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/bin/spark-submit --class com.demo.SparkDriver --master yarn-cluster --num-executors 3 --jars /tmp/api/myUtil.jar --files /tmp/myConfFile.conf,/tmp/mySchema.txt /tmp/mySparkJob.jar
Number of executors and extra jars are passed in a different way, through the config file (see dependent-jar-uris config setting).
YARN_CONF_DIR should be set in the environment and not in the .conf file.
As for other issues, the google group is the right place to ask. You may want to search it for yarn-client issues, as several other folks have figured out how to get it to work.

Resources