spark-submit parse [application-arguments] error in yarn-cluster - apache-spark

When i submit like this, and in my application
spark-submit --master spark://ip:7077 --class mainClass myjar.jar 10}}
it will print out
10}}
but when i submit like this
spark-submit --master yarn --deploy-mode cluster --class mianClass myjar.jar 10}}
it only print out
10
So where is the "}}"? The same result when "10}}}" "10}}}}"
My code:
def main(args: Array[String]): Unit = {
println("start")
println(args(0))
println("args")
}
Spark version is 2.1.0

Related

Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir

How do I use spark xml data source in .net?

Is there a way to use spark-xml (https://github.com/databricks/spark-xml) in a spark .net/c# job?
I was able to use spark-xml data source from .Net.
Here is the test program:
using Microsoft.Spark.Sql;
namespace MySparkApp
{
class Program
{
static void Main(string[] args)
{
SparkSession spark = SparkSession
.Builder()
.AppName("spark-xml-example")
.GetOrCreate();
DataFrame df = spark.Read()
.Option("rowTag", "book")
.Format("xml")
.Load("books.xml");
df.Show();
df.Select("author", "_id")
.Write()
.Format("xml")
.Option("rootTag", "books")
.Option("rowTag", "book")
.Save("newbooks.xml");
spark.Stop();
}
}
}
Checkout https://github.com/databricks/spark-xml and build an assembly jar using 'sbt assembly' command, copy the assembly jar to the dotnet project workspace.
Build project: dotnet build
Submit Spark job:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--jars scala-2.11/spark-xml-assembly-0.10.0.jar \
--master local bin/Debug/netcoreapp3.1/microsoft-spark-2.4.x-0.10.0.jar \
dotnet bin/Debug/netcoreapp3.1/sparkxml.dll

Spark, kerberos, yarn-cluster -> connection to hbase

Facing one issue with Kerberos enabled Hadoop cluster.
We are trying to run a streaming job on yarn-cluster, which interacts with Kafka (direct stream), and hbase.
Somehow, we are not able to connect to hbase in the cluster mode. We use keytab to login to hbase.
This is what we do:
spark-submit --master yarn-cluster --keytab "dev.keytab" --principal "dev#IO-INT.COM" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j_executor_conf.properties -XX:+UseG1GC" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j_driver_conf.properties -XX:+UseG1GC" --conf spark.yarn.stagingDir=hdfs:///tmp/spark/ --files "job.properties,log4j_driver_conf.properties,log4j_executor_conf.properties" service-0.0.1-SNAPSHOT.jar job.properties
To connect to hbase:
def getHbaseConnection(properties: SerializedProperties): (Connection, UserGroupInformation) = {
val config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM_VALUE);
config.set("hbase.zookeeper.property.clientPort", 2181);
config.set("hadoop.security.authentication", "kerberos");
config.set("hbase.security.authentication", "kerberos");
config.set("hbase.cluster.distributed", "true");
config.set("hbase.rpc.protection", "privacy");
config.set("hbase.regionserver.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
config.set("hbase.master.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
UserGroupInformation.setConfiguration(config);
var ugi: UserGroupInformation = null;
if (SparkFiles.get(properties.keytab) != null
&& (new java.io.File(SparkFiles.get(properties.keytab)).exists)) {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
SparkFiles.get(properties.keytab));
} else {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
properties.keytab);
}
val connection = ConnectionFactory.createConnection(config);
return (connection, ugi);
}
and we connect to hbase:
….
foreachRDD { rdd =>
if (!rdd.isEmpty()) {
//var ugi: UserGroupInformation = Utils.getHbaseConnection(properties)._2
rdd.foreachPartition { partition =>
val connection = Utils.getHbaseConnection(propsObj)._1
val table = …
partition.foreach { json =>
}
table.put(puts)
table.close()
connection.close()
}
}
}
Keytab file is not getting copied to yarn staging/temp directory, we are not getting that in SparkFiles.get… and if we pass keytab with --files, spark-submit is failing because it’s there in --keytab already.
error is:
This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)

Unable to log from Spark Streaming

I am trying to log output of spark streaming as shown in the code below
dStream.foreachRDD { rdd =>
if (rdd.count() > 0) {
#transient lazy val log = Logger.getLogger(getClass.getName)
log.info("Process Starting")
rdd.foreach { item =>
log.info("Output :: "+item._1 + "," + item._2 + "," + System.currentTimeMillis())
}
}
The code is executed on a yarn cluster using the following command
./bin/spark-submit --class "StreamingApp" --files file:/home/user/log4j.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/home/user/log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/home/user/log4j.properties" --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 /home/user/Abc.jar
When i view the logs from the yarn cluster, I can find the logs written before foreach i.e. log.info("Process Starting") but the logs inside foreach are not printing.
I had also tried creating a separate serializable class as below
object LoggerObj extends Serializable{
#transient lazy val log = Logger.getLogger(getClass.getName)
}
and using the same inside foreach as follows
dStream.foreachRDD { rdd =>
if (rdd.count() > 0) {
LoggerObj.log.info("Process Starting")
rdd.foreach { item =>
LoggerObj.log.info("Output :: "+item._1 + "," + item._2 + "," + System.currentTimeMillis())
}
}
but still the same issue, only the logs outside of foreach are being printed.
The log4j.properties is given below
log4j.rootLogger=INFO,stdout,FILE
log4j.rootCategory=INFO,FILE
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/Rt.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=500MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.Holder=INFO,FILE
I was able to fix it by putting "log4j.properties" file under each worker nodes.

where to check my yarn+spark's applications' log?

I wrote an application with yarn+spark, for simplicity, I list the following
object testKafkaSparkStreaming extends Logging {
private class Parser extends Logging{
def parse(row: String): Row =
{
val row = "/_dc.gif| |20160616063934| |39.190.5.69| |729252016040907094857083a3c7c62e"
logInfo("pengcz starting parse " + row)
}
}
def main(args: Array[String]) {
...
logInfo("main starting parse " + row)
...
}
}
When I executed:
spark-submit --master local[*] --class $CLASS $JAR ...
I can see the two log infos in the console
But when I executed:
spark-submit --master yarn --class $CLASS $JAR ...
And I opened the yarn web ui of my own application:
http://192.168.36.172:8088/cluster/app/application_1465894400511_3624
And I click logs under the page, I got:
But the page has not information including my two logs
What should I do to find my own logs?
Any advice will be appreciated!
You can use this command to see the Yarn Logs of an Application :
yarn logs -applicationId <Your Yarn Application ID> // i.e. application_1465894400511_3624

Resources