How to get basic Spark program running on Kubernetes

How to get basic Spark program running on Kubernetes - apache-spark

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:
https://github.com/bitnami/charts/tree/main/bitnami/spark
I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://<master-ip>:<master-port>").getOrCreate()
df = spark.read.json('people.json')
Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.
When I run the 3rd line as simply 'people.json' then it starts running but errors out with:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If I run it as '/sparkdata/people.json' then I get
pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json
Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.

Your people.json file needs to be accessible to your driver + executor pods. This can be achieved in multiple ways:
having some kind of network/cloud drive that each pod can access
mounting volumes on your pods, and then uploading the data to those volumes using --files in your spark-submit.
The latter option might be the simpler to set up. This page discusses in more detail how you could do this, but we can shortly go to the point. If you add the following arguments to your spark-submit you should be able to get your people.json on your driver + executors (you just have to choose sensible values for the $VAR variables in there):
--files people.json \
--conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
You can always verify the existence of your data by going inside of the pods themselves like so:
kubectl exec -it <driver/executor pod name> bash
(now you should be inside of a bash process in the pod)
cd <mount-path-you-chose>
ls -al
That last ls -al command should show you a people.json file in there (after having done your spark-submit of course).
Hope this helps!

Related

Monitoring spark structured streaming with influxdb and grafana

I want to monitor spark structured streaming with influxdb and grafana.
I wrote a file called "streamingspark.py" that reads data from kafka topic, then writes data to another kafkak topic. Then, I sent the data to influxdb through logstash. Finally, I received the data from grafana and visualize it.
Upon completion, I wanted to monitor spark structured streaming.
Based on the instructions laid out in this website:
https://www.linkedin.com/pulse/monitoring-spark-streaming-influxdb-grafana-christian-g%C3%BCgi/
I added the following code to my spark dockerfile.
RUN wget https://repo1.maven.org/maven2/com/izettle/metrics-influxdb/1.1.8/metrics-influxdb-1.1.8.jar &&
mv metrics-influxdb-1.1.8.jar /opt/spark/jars
RUN wget https://repo1.maven.org/maven2/com/palantir/spark/influx/spark-influx-sink/0.4.0/spark-influx-sink-0.4.0.jar &&
mv spark-influx-sink-0.4.0.jar /opt/spark/jars
and have added
metrics-influxdb-1.1.8.jar
spark-influx-sink-0.4.0.jar
following files to this directory. /opt/spark/jars
/opt/spark/conf
On this directory, I have added metrics.properties,
code inside metrics.properties file
Then, in the following directory [opt/spark/bin]
I tried...
spark-submit --master spark://spark-master:7077 --deploy-mode cluster /opt/spark/conf/metrics.properties --conf spark.metrics.conf=metrics.properties --jars /opt/spark/jars/metrics-influxdb-1.1.8.jar, /opt/spark/jars/spark-influx-sink-0.4.0.jar --conf spark.driver.extraClassPath=spark-influx-sink-0.4.0.jar:metrics-influxdb-1.1.8.jar --conf spark.executor.extraClassPath=spark-influx-sink.jar:metrics-influxdb-1.1.8.jar /spark-work/streamingspark.py
Then received ...
error message after spar-submit code
Question
--class: I'm still not sure what to put in as --class. When I wanted to run "streamingspark.py" file, I just used the following code.
./bin/spark-submit --master spark://spark-master:7077 /spark-work/streamingspark.py
Is it feasible to send metrics data to influxdb and grafana on current settings that I set up. (The jar files and metrics.properties)
This is my first writing a questoins to stacksoverflow...I'm sorry if the format of my question is ridiculous...I want to apologize beforehand.
GO IU!

spark-shell avoid typing spark.sql(""" query """)

I use spark-shell a lot and often it is to run sql queries on database. And only way to run sql queries is by wrapping them in spark.sql(""" query """).
Is there a way to switch to spark-sql directly and avoid the wrapper code? E.g. when using beeline, we get a direct sql interface.

spark-sql CLI is available with Spark package
$SPARK_HOME/bin/spark-sql
$ spark-sql
spark-sql> select 1-1;
0
Time taken: 6.368 seconds, Fetched 1 row(s)
spark-sql> select 1=1;
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql>
Notes:
Spark SQL CLI cannot talk to the Thrift JDBC server
Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in conf/
spark-sql --help
Usage: ./bin/spark-sql [options] [cli option]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
CLI options:
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)

How to deploy Spark application jar file to Kubernetes cluster?

I am currently trying to deploy a spark example jar on a Kubernetes cluster running on IBM Cloud.
If I try to follow these instructions to deploy spark on a kubernetes cluster, I am not able to launch Spark Pi, because I am always getting the error message:
The system cannot find the file specified
after entering the code
bin/spark-submit \
--master k8s://<url of my kubernetes cluster> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///examples/jars/spark-examples_2.11-2.3.0.jar
I am in the right directory with the spark-examples_2.11-2.3.0.jar file in the examples/jars directory.

Ensure your.jar file is present inside the container image.
Instruction tells that it should be there:
Finally, notice that in the above example we specify a jar with a
specific URI with a scheme of local://. This URI is the location of
the example jar that is already in the Docker image.
In other words, local:// scheme is removed from local:///examples/jars/spark-examples_2.11-2.3.0.jar and the path /examples/jars/spark-examples_2.11-2.3.0.jar is expected to be available in a container image.

Please make sure this absolute path /examples/jars/spark-examples_2.11-2.3.0.jar is exists.
Or you are trying loading a jar file in current directory, In this case it should be an relative path like local://./examples/jars/spark-examples_2.11-2.3.0.jar.
I'm not sure if spark-submit accepts relative path or not.

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?

When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?

If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string