How do I manage print output in Spark jobs? - apache-spark

I'd like to view the output of print statements in my Spark applications, which uses Python/PySpark. Am I correct that these outputs aren't considered part of logging? I changed my conf/log4j.properties file to output to a specific file but just the INFO and other logs are being written to the designated log file.
How do I go about directing the output from print statements to a file? Do I have to do the typical redirection like this: /usr/bin/spark-submit --master yarn --deploy-mode client --queue default /home/hadoop/app.py > /home/hadoop/output?

Related

how to get spark-sbumit logging result

When i submit spark job too terminal, it's has logging result like image in terminal.
How can i get it and set it to value or object?
You can pipe spark-submit command to the log file like this:
spark-submit ... > log.txt
You can see logs from application and you can always write logs to .text files
yarn logs -applicationId <application ID> [OPTIONS]
Application ID can be found through cluster UI

Monitoring spark structured streaming with influxdb and grafana

I want to monitor spark structured streaming with influxdb and grafana.
I wrote a file called "streamingspark.py" that reads data from kafka topic, then writes data to another kafkak topic. Then, I sent the data to influxdb through logstash. Finally, I received the data from grafana and visualize it.
Upon completion, I wanted to monitor spark structured streaming.
Based on the instructions laid out in this website:
https://www.linkedin.com/pulse/monitoring-spark-streaming-influxdb-grafana-christian-g%C3%BCgi/
I added the following code to my spark dockerfile.
RUN wget https://repo1.maven.org/maven2/com/izettle/metrics-influxdb/1.1.8/metrics-influxdb-1.1.8.jar &&
mv metrics-influxdb-1.1.8.jar /opt/spark/jars
RUN wget https://repo1.maven.org/maven2/com/palantir/spark/influx/spark-influx-sink/0.4.0/spark-influx-sink-0.4.0.jar &&
mv spark-influx-sink-0.4.0.jar /opt/spark/jars
and have added
metrics-influxdb-1.1.8.jar
spark-influx-sink-0.4.0.jar
following files to this directory. /opt/spark/jars
/opt/spark/conf
On this directory, I have added metrics.properties,
code inside metrics.properties file
Then, in the following directory [opt/spark/bin]
I tried...
spark-submit --master spark://spark-master:7077 --deploy-mode cluster /opt/spark/conf/metrics.properties --conf spark.metrics.conf=metrics.properties --jars /opt/spark/jars/metrics-influxdb-1.1.8.jar, /opt/spark/jars/spark-influx-sink-0.4.0.jar --conf spark.driver.extraClassPath=spark-influx-sink-0.4.0.jar:metrics-influxdb-1.1.8.jar --conf spark.executor.extraClassPath=spark-influx-sink.jar:metrics-influxdb-1.1.8.jar /spark-work/streamingspark.py
Then received ...
error message after spar-submit code
Question
--class: I'm still not sure what to put in as --class. When I wanted to run "streamingspark.py" file, I just used the following code.
./bin/spark-submit --master spark://spark-master:7077 /spark-work/streamingspark.py
Is it feasible to send metrics data to influxdb and grafana on current settings that I set up. (The jar files and metrics.properties)
This is my first writing a questoins to stacksoverflow...I'm sorry if the format of my question is ridiculous...I want to apologize beforehand.
GO IU!

spark-shell avoid typing spark.sql(""" query """)

I use spark-shell a lot and often it is to run sql queries on database. And only way to run sql queries is by wrapping them in spark.sql(""" query """).
Is there a way to switch to spark-sql directly and avoid the wrapper code? E.g. when using beeline, we get a direct sql interface.
spark-sql CLI is available with Spark package
$SPARK_HOME/bin/spark-sql
$ spark-sql
spark-sql> select 1-1;
0
Time taken: 6.368 seconds, Fetched 1 row(s)
spark-sql> select 1=1;
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql>
Notes:
Spark SQL CLI cannot talk to the Thrift JDBC server
Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in conf/
spark-sql --help
Usage: ./bin/spark-sql [options] [cli option]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
CLI options:
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?
If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

Resources