Unable to log from Spark Streaming - apache-spark

I am trying to log output of spark streaming as shown in the code below
dStream.foreachRDD { rdd =>
if (rdd.count() > 0) {
#transient lazy val log = Logger.getLogger(getClass.getName)
log.info("Process Starting")
rdd.foreach { item =>
log.info("Output :: "+item._1 + "," + item._2 + "," + System.currentTimeMillis())
}
}
The code is executed on a yarn cluster using the following command
./bin/spark-submit --class "StreamingApp" --files file:/home/user/log4j.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/home/user/log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/home/user/log4j.properties" --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 /home/user/Abc.jar
When i view the logs from the yarn cluster, I can find the logs written before foreach i.e. log.info("Process Starting") but the logs inside foreach are not printing.
I had also tried creating a separate serializable class as below
object LoggerObj extends Serializable{
#transient lazy val log = Logger.getLogger(getClass.getName)
}
and using the same inside foreach as follows
dStream.foreachRDD { rdd =>
if (rdd.count() > 0) {
LoggerObj.log.info("Process Starting")
rdd.foreach { item =>
LoggerObj.log.info("Output :: "+item._1 + "," + item._2 + "," + System.currentTimeMillis())
}
}
but still the same issue, only the logs outside of foreach are being printed.
The log4j.properties is given below
log4j.rootLogger=INFO,stdout,FILE
log4j.rootCategory=INFO,FILE
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/Rt.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=500MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.Holder=INFO,FILE

I was able to fix it by putting "log4j.properties" file under each worker nodes.

Related

Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir

spark-submit parse [application-arguments] error in yarn-cluster

When i submit like this, and in my application
spark-submit --master spark://ip:7077 --class mainClass myjar.jar 10}}
it will print out
10}}
but when i submit like this
spark-submit --master yarn --deploy-mode cluster --class mianClass myjar.jar 10}}
it only print out
10
So where is the "}}"? The same result when "10}}}" "10}}}}"
My code:
def main(args: Array[String]): Unit = {
println("start")
println(args(0))
println("args")
}
Spark version is 2.1.0

Spark, kerberos, yarn-cluster -> connection to hbase

Facing one issue with Kerberos enabled Hadoop cluster.
We are trying to run a streaming job on yarn-cluster, which interacts with Kafka (direct stream), and hbase.
Somehow, we are not able to connect to hbase in the cluster mode. We use keytab to login to hbase.
This is what we do:
spark-submit --master yarn-cluster --keytab "dev.keytab" --principal "dev#IO-INT.COM" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j_executor_conf.properties -XX:+UseG1GC" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j_driver_conf.properties -XX:+UseG1GC" --conf spark.yarn.stagingDir=hdfs:///tmp/spark/ --files "job.properties,log4j_driver_conf.properties,log4j_executor_conf.properties" service-0.0.1-SNAPSHOT.jar job.properties
To connect to hbase:
def getHbaseConnection(properties: SerializedProperties): (Connection, UserGroupInformation) = {
val config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM_VALUE);
config.set("hbase.zookeeper.property.clientPort", 2181);
config.set("hadoop.security.authentication", "kerberos");
config.set("hbase.security.authentication", "kerberos");
config.set("hbase.cluster.distributed", "true");
config.set("hbase.rpc.protection", "privacy");
config.set("hbase.regionserver.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
config.set("hbase.master.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
UserGroupInformation.setConfiguration(config);
var ugi: UserGroupInformation = null;
if (SparkFiles.get(properties.keytab) != null
&& (new java.io.File(SparkFiles.get(properties.keytab)).exists)) {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
SparkFiles.get(properties.keytab));
} else {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
properties.keytab);
}
val connection = ConnectionFactory.createConnection(config);
return (connection, ugi);
}
and we connect to hbase:
….
foreachRDD { rdd =>
if (!rdd.isEmpty()) {
//var ugi: UserGroupInformation = Utils.getHbaseConnection(properties)._2
rdd.foreachPartition { partition =>
val connection = Utils.getHbaseConnection(propsObj)._1
val table = …
partition.foreach { json =>
}
table.put(puts)
table.close()
connection.close()
}
}
}
Keytab file is not getting copied to yarn staging/temp directory, we are not getting that in SparkFiles.get… and if we pass keytab with --files, spark-submit is failing because it’s there in --keytab already.
error is:
This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)

where to check my yarn+spark's applications' log?

I wrote an application with yarn+spark, for simplicity, I list the following
object testKafkaSparkStreaming extends Logging {
private class Parser extends Logging{
def parse(row: String): Row =
{
val row = "/_dc.gif| |20160616063934| |39.190.5.69| |729252016040907094857083a3c7c62e"
logInfo("pengcz starting parse " + row)
}
}
def main(args: Array[String]) {
...
logInfo("main starting parse " + row)
...
}
}
When I executed:
spark-submit --master local[*] --class $CLASS $JAR ...
I can see the two log infos in the console
But when I executed:
spark-submit --master yarn --class $CLASS $JAR ...
And I opened the yarn web ui of my own application:
http://192.168.36.172:8088/cluster/app/application_1465894400511_3624
And I click logs under the page, I got:
But the page has not information including my two logs
What should I do to find my own logs?
Any advice will be appreciated!
You can use this command to see the Yarn Logs of an Application :
yarn logs -applicationId <Your Yarn Application ID> // i.e. application_1465894400511_3624

Yarn support on Ooyala Spark JobServer

Just started experimenting with the JobServer and would like to use it in our production environment.
We usually run spark jobs individually in yarn-client mode and would like to shift towards the paradigm offered by the Ooyala Spark JobServer.
I am able to run the WordCount examples shown in the official page.
I tried running submitting our custom spark job to the Spark JobServer and I got this error:
{
"status": "ERROR",
"result": {
"message": "null",
"errorClass": "scala.MatchError",
"stack": ["spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:220)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)",
"scala.concurrent.impl.Future $PromiseCompletingRunnable.run(Future.scala:24)",
"akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)",
"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)",
"scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)",
"scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java 1339)",
"scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)",
"scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)"]
}
I had made the necessary code modifications like extending SparkJob and implementing the runJob() method.
This is the dev.conf file that I used:
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
jar-store-rootdir = /tmp/jobserver/jars
jobdao = spark.jobserver.io.JobFileDAO
filedao {
rootdir = /tmp/spark-job-server/filedao/data
}
context-creation-timeout = "60 s"
}
contexts {
my-low-latency-context {
num-cpu-cores = 1
memory-per-node = 512m
}
}
context-settings {
num-cpu-cores = 2
memory-per-node = 512m
}
home = "/data/softwares/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041"
}
spray.can.server {
parsing.max-content-length = 200m
}
spark.driver.allowMultipleContexts = true
YARN_CONF_DIR=/home/spark/conf/
Also how can I give run-time parameters for the spark job, such as --files, --jars ?
For example, I usually run our custom spark job like this:
./spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/bin/spark-submit --class com.demo.SparkDriver --master yarn-cluster --num-executors 3 --jars /tmp/api/myUtil.jar --files /tmp/myConfFile.conf,/tmp/mySchema.txt /tmp/mySparkJob.jar
Number of executors and extra jars are passed in a different way, through the config file (see dependent-jar-uris config setting).
YARN_CONF_DIR should be set in the environment and not in the .conf file.
As for other issues, the google group is the right place to ask. You may want to search it for yarn-client issues, as several other folks have figured out how to get it to work.

Resources