How to log in foreachRDD in streaming application? - apache-spark

we are working on a application consuming kafka messages using spark streaming ..running on a hadoop/yarn spark cluster ....i have log4j properties confiured both on driver and the worker's ...but i still don't see the log messages inside the foreachRDD ..i do see the "start for each rdd" and "end of for each rdd"
val broadcaseLme=sc.broadcast(lme)
logInfo("start for each rdd: ")
val lines: DStream[MetricTypes.InputStreamType] = myConsumer.createDefaultStream()
lines.foreachRDD(rdd => {
if ((rdd != null) && (rdd.count() > 0) && (!rdd.isEmpty()) ) {
**logInfo("filteredLines: " + rdd.count())**
**logInfo("start loop")**
rdd.foreach{x =>
val lme = broadcastLme.value
lme.aParser(x).get
}
logInfo("end loop")
} })
logInfo("end of for each rdd ")
lines.print(10)
I'm running the application on the cluster using this
spark-submit --verbose --class DevMain --master yarn-cluster --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.p‌​roperties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j‌​.properties" --files "hdfs://hdfs-name-node:8020/user/hadoopuser/log4j.properties‌​" hdfs://hdfs-name-node:8020/user/hadoopuser/streaming_2.10-1.‌​0.0-SNAPSHOT.jar hdfs://hdfs-name-node:8020/user/hadoopuser/enriched.properti‌​es
I'm new to spark could some one please help why i'm not seeing the log messages inside the foreachrdd this is the log4j.properties
log4j.rootLogger=WARN, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%p] %d %c %M - %m%n
log4j.appender.rolling.maxFileSize=100MB
log4j.appender.rolling.maxBackupIndex=10
log4j.appender.rolling.file=${spark.yarn.app.container.log.dir}/titanium-spark-enriched.log
#log4j.appender.rolling.encoding=URF-8
log4j.logger.org.apache,spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.com.x1.projectname=INFO
#log4j.appender.console=org.apache.log4j.ConsoleAppender
#log4j.appender.console.target=System.err
#log4j.appender.console.layout=org.apache.log4j.PatternLayout
#log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
#log4j.logger.org.spark-project.jetty=WARN
#log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
#log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
#log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
#log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
#log4j.appender.RollingAppender.File=./logs/spark/enriched.log
#log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
#log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
#log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
#log4j.rootLogger=INFO, RollingAppender, console

The problem with the Spark Streaming application seems to be with a missing start after you've built your streaming computation, i.e.
ssc.start()
Quoting the scaladoc of StreamingContext:
After creating and transforming DStreams, the streaming computation can be started and stopped using context.start() and context.stop(), respectively.
context.awaitTermination() allows the current thread to wait for the termination of the context by stop() or by an exception.

Related

Dataproc: limit log size for long-running / streaming Spark jobs

I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics.
Couple of questions :
Does Spark put all the log (incl. INFO, WARN etc) into stderr ?
What I notice is that stdout is empty, while all the logging is put in to stderr
Is there a way for me to expire the data in stderr (i.e. expire the older logs) ?
Since I've a long running streaming job, the stderr gets filled up over time and nodes/VMs become unavailable.
Pls advice.
Here is output of the yarn logs command :
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:25:34,876 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:25:35,144 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:25:35 +0000 2022
LogLength:43251469683
LogContents:
applianceName=usa-isn0784-rt01, tenantName=NOV, mstatsTimeBlock=1663507200, tenantId=2, vsnId=0, mstatsTotSentOctets=11596, mstatsTotRecvdOctets=24481, mstatsTotSessDuration=300000, mstatsTotSessCount=1, mstatsType=sdwan-acc-ckt-app-stats, appId=https, site=usa-isn0784-rt01, accCkt=WAN-DIA, siteId=442, accCktId=1, user=10.126.117.196, risk=3, productivity=3, family=general-internet, subFamily=web, bzTag=Unknown,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
***********************************************************************
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:26:01,439 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:26:01,696 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:26:02 +0000 2022
LogLength:44309782124
LogContents:
, tenantId=3, vsnId=0, mstatsTotSentOctets=48210, mstatsTotRecvdOctets=242351, mstatsTotSessDuration=300000, mstatsTotSessCount=34, mstatsType=dest-stats, destIp=165.225.216.24, mstatsAttribs=,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
22/09/19 23:26:02 WARN org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
Update :
Based on #Dagang's note, i'm using the RollingFileAppender in the log4j.properties .. and the new log file is getting created. However - some data is still getting into std err.
Here is the updated code :
spark-submit
gcloud dataproc jobs submit pyspark process-appstat.py \
--cluster $CLUSTER \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.dynamicAllocation.executorIdleTimeout=120s#spark.shuffle.service.enabled=true#spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-executor.properties#spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j-driver.properties\
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.3.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar,gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar,gs://dataproc-spark-jars/bson-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar \
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani-noacl.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/alarm-compression-user-test.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/appstats-user-test.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/insights-user-test.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/intfutil-user-test.p12,gs://dataproc-spark-configs/metrics.properties,gs://dataproc-spark-configs/params.cfg,gs://kafka-certs/appstat-anomaly-user.p12,gs://kafka-certs/appstat-anomaly-user-test.p12,gs://kafka-certs/appstat-agg-user.p12,gs://kafka-certs/appstat-agg-user-test.p12,gs://kafka-certs/alarmblock-user.p12,gs://kafka-certs/alarmblock-user-test.p12,gs://kafka-certs/versa-alarmblock-test-user.p12,gs://kafka-certs/versa-bandwidth-test-user.p12,gs://kafka-certs/versa-appstat-test-user.p12,gs://kafka-certs/versa-alarmblock-user.p12,gs://kafka-certs/versa-bandwidth-user.p12,gs://kafka-certs/versa-appstat-user.p12,gs://dataproc-spark-configs/log4j-executor.properties,gs://dataproc-spark-configs/log4j-driver.properties \
--region $REGION \
--py-files streams.zip,utils.zip \
-- isdebug=$isdebug
log4j-executor.properties:
--------------------------
# Set everything to be logged to the console
# log4j.rootCategory=INFO, console
# log4j.appender.console=org.apache.log4j.ConsoleAppender
# log4j.appender.console.target=System.err
# log4j.appender.console.layout=org.apache.log4j.PatternLayout
# log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# logging to rolling_file, using RolligFileAppender
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-executor.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
log4j-driver.properties:
-------------------------
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-driver.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
any ideas on what needs to be done for this ?
Question on -> ${spark.yarn.app.container.log.dir}
What location does this get translated to ?
when i logon worker node and check this, i get the following :
karanalang#versa-structured-stream-v1-w-0:~$ echo $spark.yarn.app.container.log.dir
.yarn.app.container.log.dir
In yarn-site.xml:
Here are the relevant configs:
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/hadoop/yarn/nm-local-dir</value>
<description>
Directories on the local machine in which to application temp files.
</description>
</property>
<property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>gs://dataproc-temp-us-east1-939354532596-4ln8c3y1/fe57047f-13d9-4b9b-8bce-baa4a911aa65/yarn-logs</value>
<description>
The remote path, on the default FS, to store logs.
</description>
</property>
However the logs are in the location below:
root#versa-structured-stream-v1-w-0:/# find . -name versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0002/container_1664926662510_0002_01_000001/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000179/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000250/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000299/versa-ss-executor.log
where is the location - ./var/log/hadoop-yarn/userlogs - taken from (it is not in yarn-site.sml)?
Short answer:
You can use a custom log4j config with RollingFileAppender to limit the log size for long-running jobs.
Long answer:
The default log4j config for Spark on Dataproc is at /etc/spark/conf/log4j.properties. It configures root logger to stderr at INFO level. But at runtime driver logs (in client mode) will be directed by the Dataproc agent to GCS and streamed back to the client, and executor logs (and driver logs in cluster mode) will be redirected by YARN to the stderr file in the container's YARN log dir. Consider using /etc/spark/conf/log4j.properties as the template for your custom config.
In your custom config, you can configure logs to be written to a RollingFileAppender, e.g.,
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/my_app.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
...
Note that for executors (and drivers in cluster mode), the value of log4j.appender.rolling_file.File needs to be a path under ${spark.yarn.app.container.log.dir}, see this question and this doc.
Upload your log4j config(s) to a GCS bucket, driver and executor may or may not share the same config. In your case, you might want to update executor log4j config only, just use the default for driver.
Then submit the job with the custom log4j config with one of the following ways:
The file name must be log4j.properties, driver and executor will share the same config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/log4j.properties
The file name doesn't have to be log4j.properties, driver and executor can have different config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/my-log4j.properties \
--properties 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:my-log4j.properties'
The expectation is that there will be rolling logs under the YARN container log dirs (configurable through yarn.nodemanager.log-dirs with default value /var/log/hadoop-yarn/userlogs on Dataproc) for the Spark executors, they will be automatically aggregated and stored in GCS and Cloud Logging.

Spark streaming with kafka(ssl enabled)

I have seen multiple questions regarding this topic but unable to find an answer for it? I have tried to add comments and have not got any response yet.
Requirement -> spark-streaming with Kafka SSL enabled
Below are the version details
I am running spark locally.
Spark -> version 2.4.6
kafka version -> 2.2.1
code snippet:
import sys
import logging
from datetime import datetime
try:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 20)
brokers = "b1:p,b2:p"
topic = "topic1"
kafkaParams = {"security.protocol":"SSL","metadata.broker.list": brokers}
kvs = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams)
lines = kvs.map(lambda x: x[1])
print(lines)
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
except ImportError as e:
print("Error importing Spark Modules :", e)
sys.exit(1)
When I am submitting it as below.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.10:2.0.0 --master local pysparkKafka.py
This is due to the reference I took from this
Enabling SSL between Apache spark and Kafka broker
I am getting the below error
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.6 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.6.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Now when I submit it as below. This is due to seeing the above error and picking a jar from maven.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --master local pysparkKafka.py
I am getting the below error with the SSL port of the brokers.
20/07/24 11:58:46 INFO VerifiableProperties: Verifying properties
20/07/24 11:58:46 INFO VerifiableProperties: Property group.id is overridden to
20/07/24 11:58:46 WARN VerifiableProperties: Property security.protocol is not valid
20/07/24 11:58:46 INFO VerifiableProperties: Property zookeeper.connect is overridden to
20/07/24 11:58:46 INFO SimpleConsumer: Reconnect due to socket error: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
20/07/24 11:58:47 INFO SimpleConsumer: Reconnect due to socket error: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
Traceback (most recent call last):
Note -> The above submit works well without SSL. But my requirement is to enable SSL.
Any help is greatly appreciated. Thanks.

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

Spark Streaming - Netcat messages are not received in Spark streaming

i am trying to test spark streaming. i have stand alone cloudera quickstart vm. started the spark-shell with the following command:
spark-shell --master yarn-client --conf spark.ui.port=23123
In the spark-shell i have executed the following statements:
sc.stop()
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
val conf = new SparkConf().setAppName("Spark Streaming")
val ssc = new StreamingContext(conf,org.apache.spark.streaming.Seconds(10))
val lines = ssc.socketTextStream("localhost",44444)
lines.print
In another terminal started the netcat service with the following command:
nc -lk 44444
In the spark-shell started the streaming context
ssc.start()
till now everything is fine. But, whatever the messages typed in the Netcat service are not received in Spark streaming.don't know where it is going wrong.
try spark-shell --master local[2] --conf spark.ui.port=23123 to see if it works.
If it works, then in your script, there is only one executor working, which is receiving message, but no executor is processing message.

Spark output when submitting via Yarn cluster vs. client

I am new to Spark and just got it running on my cluster (Spark 2.0.1 on a 9 node cluster running Community version of MapR). I submit the wordcount example via
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md
and get the following output
17/04/07 13:21:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
: 68
help: 1
when: 1
Hadoop: 3
...
Looks like everything is working properly. When I add --deploy-mode cluster I get the following output:
17/04/07 13:23:52 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
So no errors but I am not seeing the wordcount results. What am I missing? I see the job in my History Server and it says it completed successfully. Also I checked my user directory in DFS but no new files were written except for this empty directory: /user/myuser/.sparkStaging
Code (wordcount.py example shipped with Spark):
from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
The reason for your output not printing is:
When you run in spark-client mode then the node on which you are initiating the job is the DRIVER and when you collect the result it is collected on that node and you print it.
In yarn-cluster mode your driver is some other node not the one through which you initiated the job. So when you call the .collect function the result is collected on that and printed on that node. You can find the result being printed in the sys-out of the driver.
A better approach would be to write the output somewhere in HDFS.
The reason for your spark.yarn.jars warning is:
In order to run a spark job yarn needs some binaries available on all the nodes of the cluster if these binaries are not available then as a part of job preparation, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
To solve this :
By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable(chmod 777) location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set spark.yarn.jars to hdfs:///some/path.
After placing your jars run your code like :
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md --conf spark.yarn.jars="hdfs:///some/path"
Source : http://spark.apache.org/docs/latest/running-on-yarn.html

Resources