How do I use spark xml data source in .net? - apache-spark

Is there a way to use spark-xml (https://github.com/databricks/spark-xml) in a spark .net/c# job?

I was able to use spark-xml data source from .Net.
Here is the test program:
using Microsoft.Spark.Sql;
namespace MySparkApp
{
class Program
{
static void Main(string[] args)
{
SparkSession spark = SparkSession
.Builder()
.AppName("spark-xml-example")
.GetOrCreate();
DataFrame df = spark.Read()
.Option("rowTag", "book")
.Format("xml")
.Load("books.xml");
df.Show();
df.Select("author", "_id")
.Write()
.Format("xml")
.Option("rootTag", "books")
.Option("rowTag", "book")
.Save("newbooks.xml");
spark.Stop();
}
}
}
Checkout https://github.com/databricks/spark-xml and build an assembly jar using 'sbt assembly' command, copy the assembly jar to the dotnet project workspace.
Build project: dotnet build
Submit Spark job:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--jars scala-2.11/spark-xml-assembly-0.10.0.jar \
--master local bin/Debug/netcoreapp3.1/microsoft-spark-2.4.x-0.10.0.jar \
dotnet bin/Debug/netcoreapp3.1/sparkxml.dll

Related

How to connect to Glue catalog from an EMR spark-submit step

We use pyspark in an EMR cluster to run queries against our glue database. We use two methods to execute the python scripts namely through a Zeppelin notebook and through EMR steps. Connecting to the glue database works greats in Zeppelin, but not in EMR steps. When we run a query against the glue database, we get the following error:
pyspark.sql.utils.AnalysisException: Database '{glue_database_name}' does not exist.
This is the spark configuration used in the executed .py file:
spark = SparkSession\
.builder\
.appName("dfp.sln.kunderelation.work")\
.config("spark.sql.broadcastTimeout", "36000")\
.config("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")\
.config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")\
.config("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")\
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.sources.ignoreDataLocality.enabled", "true")
The step is submitted using boto3 with the following configuration:
response = emr_client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{ 'Name': name,
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--deploy-mode', 'client', '--master', 'yarn', "{path to .py script}"]
}
},
]
)
Both types of EC2 users have been given glue and s3 privileges.
What do we need to setup to connect to the glue database?

Passing in VM arguments to Spark application via Argo

I have a spark application which is being triggered from argo yaml via dockerized image.
The argo workflow yaml is as :
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: test-argo-spark
namespace: argo
spec:
entrypoint: sparkapp
templates:
- name: sparkapp
container:
name: main
command:
args: [
"/bin/sh",
"-c",
"/opt/spark/bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=/test-spark:latest \
--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' \
--conf spark.app.name=test-spark-job \
--conf spark.jars.ivy=/tmp/.ivy \
--conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \
--conf spark.kubernetes.namespace=argo \
--conf spark.executor.instances=1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--packages org.postgresql:postgresql:42.1.4 \
--conf spark.kubernetes.driver.pod.name=custom-app \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \
--class SparkMain \
local:///opt/app/test-spark.jar"
]
image: /test-spark:latest
imagePullPolicy: IfNotPresent
resources: {}
This is calling in the spark submit which will invoke the jar file that has code residing in
The java code is as :
public class SparkMain {
public static void main(String args[]) {
SparkSession spark = SparkHelper.getSparkSession("SparkMain_Application");
System.out.println("Spark Java appn " + spark.logName());
}
}
The Spark session is being created as follows:
public static SparkSession getSparkSession(String appName) {
String sparkMode = System.getProperty("spark_mode");
if (sparkMode == null) {
sparkMode = "cluster";
}
if (sparkMode.equalsIgnoreCase("cluster")) {
return createSparkSession(appName);
} else if (sparkMode.equalsIgnoreCase("local")) {
return createLocalSparkSession(appName);
} else {
throw new RuntimeException("Invalid spark_mode option " + sparkMode);
}
}
If we see here there is a system property which needs to be passed in ,
String sparkMode = System.getProperty("spark_mode");
Can anyone tell how can we pass in these VM args from argo yaml when calling in the Spark submit.
Also how can we pass in multiple properties for a program

Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir

Elasticsearch pyspark connection in insecure mode

My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks
After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)
I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'

spark-submit parse [application-arguments] error in yarn-cluster

When i submit like this, and in my application
spark-submit --master spark://ip:7077 --class mainClass myjar.jar 10}}
it will print out
10}}
but when i submit like this
spark-submit --master yarn --deploy-mode cluster --class mianClass myjar.jar 10}}
it only print out
10
So where is the "}}"? The same result when "10}}}" "10}}}}"
My code:
def main(args: Array[String]): Unit = {
println("start")
println(args(0))
println("args")
}
Spark version is 2.1.0

Resources