I am trying to run Pyspark (or Spark) job via Livy server with "spark.master=yarn".
What I have done:
1) In spark-defaults.conf:
spark.master yarn
spark.submit.deployMode client
2) In livy.conf:
livy.spark.master = yarn
livy.spark.deployMode = client
3) I send request via CURL with "conf": {"spark.master": "yarn"}
Example:
curl -X POST -H "Content-Type: application/json" localhost:8998/batches --data '{"file": "hdfs:///user/grzegorz/hello-world.py", "name": "MY", "conf": {"spark.master": "yarn"} }'
{"id":3,"state":"running","appId":null,"appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: "]}
And what I am always getting in logs:
18/01/02 14:45:07.880 qtp1758624236-28 INFO BatchSession$: Creating batch session 3: [owner: null, request: [proxyUser: None, file: hdfs:///user/grzegorz/hello-world.py, name: MY, conf: spark.master -> yarn]]
18/01/02 14:45:07.883 qtp1758624236-28 INFO SparkProcessBuilder: Running '/usr/local/share/spark/spark-2.0.2/bin/spark-submit' '--name' 'MY' '--conf' 'spark.master=local' 'hdfs:///user/grzegorz/hello-world.py'
I hope somebody have any ideas how to get it over. Thank you in advance.
Related
Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir
Please help how to pass --props file and --source-class file to LIVY API POST .
spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \
--master yarn \
--deploy-mode cluster \
--conf spark.sql.shuffle.partitions=100 \
--driver-class-path $HADOOP_CONF_DIR \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field tst \
--target-base-path /user/hive/warehouse/stock_ticks_mor \
--target-table test \
--props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--continuous
I have converted the configs you are using in a json file to be passed to LIVY API
{
"className": "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
"proxyUser": "root",
"driverCores": 1,
"executorCores": 2,
"executorMemory": "1G",
"numExecutors": 4,
"queue": "default",
"name": "stock_ticks_mor",
"file": "hdfs://tmp/hudi-utilities-bundle_2.12-0.8.0.jar",
"conf": {
"spark.sql.shuffle.partitions": "100",
"spark.jars.packages": "org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.2",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.task.cpus": "1",
"spark.executor.cores": "1"
},
"args": [
"--props","/var/demo/config/kafka-source.properties",
"--table-type","MERGE_ON_READ",
"--source-class", "org.apache.hudi.utilities.sources.JsonKafkaSource",
"--target-base-path","/user/hive/warehouse/stock_ticks_mor",
"--target-table","test",
"--schemaprovider-class","org.apache.hudi.utilities.schema.FilebasedSchemaProvider",
"--continuous"
]
}
You can submit this json to the LIVY endpoint like
curl -H "X-Requested-By: admin" -H "Content-Type: application/json" -X POST -d #config.json http://localhost:8999/batches
For reference : https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502
I have airflow running KubernetesPodOperator in order to do a Spark-submit call:
spark_image = f'{getenv("REGISTRY")}/myApp:{getenv("TAG")}'
j2g = KubernetesPodOperator(
dag=dag,
task_id='myApp',
name='myApp',
namespace='data',
image=spark_image,
cmds=['/opt/spark/bin/spark-submit'],
configmaps=["data"],
arguments=[
'--master k8s://https://10.96.0.1:443',
'--deploy-mode cluster',
'--name myApp',
f'--conf spark.kubernetes.container.image={spark_image}',
'local:///app/run.py'
],
However, I'm getting the following error:
Error: Unrecognized option: --master k8s://https://10.96.0.1:443
Which is weird, because when I bin/bash to a running pod and execute the spark-submit command, it works.
Any idea how to pass the arguments as expected?
Solution from GitHub ticket: Parameter should be sent like:
'--master=k8s://https://10.96.0.1:443',
I know this query has been answered in so many post but those have not helped me. I did research, and tried, but still facing issue in making an API call to import test execution result.
Approach I took:
Created Test(Test Details: Cucumber), Test Precondition, Test Set, Test Plan and Test Execution
Exported Test using "Xray - Export to Cucumber" option
Added this in my BDD-Cucumber framework, executed and it has generated me cucumber.json file after execution
Trying API call using postman
/api/v1/import/execution/cucumber
curl --location --request POST 'https://xray.cloud.xpand-it.com/api/v1/import/execution/cucumber' \
--header 'Authorization: Bearer $token’ \
--header 'Content-Type: application/json' \
--data-binary '#/Users/aranjan/Downloads/cucumber.json'
Error:
{ "error": "Error creating Test Execution - Team is required."}
Now, this means it is trying to create new instead of update existing
Then, I used
/api/v1/import/execution/cucumber/multipart
curl --location --request POST 'https://xray.cloud.xpand-it.com/api/v1/import/execution/cucumber/multipart' \
--header 'Authorization: Bearer $token’ \
--form 'info=#/Users/aranjan/Downloads/xrayresultimport.json' \
--form 'result=#/Users/aranjan/Downloads/cucumber.json'
Error:
{ "error": "Unexpected field (result)"}
xrayresultimport.json
{
"fields": {
"project": {
"key": "HYP"
},
"customfield_10962": [
"Team","TeamQAAuto"
],
"issuetype": {
"id": "10722"
}
}
}
/api/v1/import/execution
curl --location --request POST 'https://xray.cloud.xpand-it.com/api/v1/import/execution' \
--header 'Authorization: Bearer $token’ \
--header 'Content-Type: application/json' \
--data-raw '{
"testExecutionKey": "HYP-3313",
"info" : {
"startDate" : "2020-09-25T11:47:35+01:00",
"finishDate" : "2020-09-25T11:53:00+01:00",
"testPlanKey" : "HYP-3341"
},
"tests" : [
{
"testKey" : "HYP-3330",
"start" : "2020-09-25T11:47:35+01:00",
"finish" : "2020-09-25T11:50:56+01:00",
"comment" : "Successful execution",
"status" : "PASSED"
}
]
}'
{ "error": "Error updating Test Execution - Issue update failed!"}
Agenda:
I want to import the execution result in my existing Test Execution.
I request you to guide me here.
Thanks in advance.
Currently, if you use the multipart endpoint it will always create new Test Executions.
The multipart request has a minor typo: you should have "results" instead of "result". An example would be something like this:
curl -H "Content-Type: multipart/form-data" -X POST -F info=#xrayresultimport.json -F results=#cucumber.json -H "Authorization: Bearer $token" https://xray.cloud.xpand-it.com/api/v2/import/execution/cucumber/multipart
That should make it work :)
Note: concerning the last example you gave, there could be several causes for it, including restrictions on Jira side. That would be analyzed by the Xray support team.
My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks
After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)
I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'