Apache Spark: Running jobs in parallel in standalone mode - apache-spark

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.
We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
We also added in the spark-defaults.conf :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32
On the machine: 10.20.10.228
./start-master.sh --webui-port 8585
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
On the machine 10.20.10.229:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
On the machine: 10.20.10.228:
We start the Spark shell:
spark-shell --master spark://10.20.10.228:7077
Then we make configurations:
val df = spark.read.format("jdbc").option
("url", "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option
("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option
("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228:9292;ParentSet=MASTE
R", "muh_hareket_20",
false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);
The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They dont work in parallel.
How can we make jobs work in parallel?
I really appreciate your help. We have done everything that we could.

Related

Spark job not showing up on standalone cluster GUI

I am playing with running spark jobs in my lab and have a three node standalone cluster. When I execute a new job on the master node via CLI
spark-submit sparktest.py --master spark://myip:7077
while the job completes as expected it does not show up at all on the cluster GIU. After some investigation, I added the --master to the submit command but to no avail. During job execution as well as after completion when I navigate to http://mymasternodeip:8080/
none of these jobs are recognized in Running Jobs nor Completed Jobs. Any thoughts as to why the jobs dont show up would be appreciated.
You should specify --master flag first then remaining flags/options. If not master will be considered as local.
spark-submit --master spark://myip:7077 sparktest.py
Make sure that you don't override master config in your code while creating SparkSession object. Provide same master url in code also or don't add it.

How to configure Spark spark_worker_opts for Jupyter notebooks

I use Pyspark with Spark 2.4 in the standalone mode on Linux for processing a lot of incoming data via Kafka using a Jupyter notebook (currently for testing). I want to add these options to this notebook in order to prevent the /tmp/ directory to be filled with dozens of gigabytes after few hours:
spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=120
But these conf locations do not work:
Spark’s default configuration (spark/conf/spark-env.sh) seems not be used by Juypter notebooks at all:
SPARK_WORKER_OPTS="spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=120"
So, I created a sperate kernel configuration in ~/.local/share/jupyter/kernels/python3-spark1/kernel.json that I can select in Jupyterhub and that is really used for the RAM adjustments what I can see in htop:
"env": {
"PYSPARK_SUBMIT_ARGS": "--master local[*]
--conf spark.worker.cleanup.enabled=true --conf=spark.worker.cleanup.appDataTtl=120 driver-memory 145g --executor-memory 50g pyspark-shell"
but the /tmp still fills with dozens of gigs.
I also tried the “magic” in a jupyter cell but it also did not work.
Do you know how to configure the Jupyter notebooks for this Spark adjustments properly?
Configuration properties that apply only to the worker in the form "-Dx=y"
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=60 -Dspark.worker.cleanup.appDataTtl=120"
If that not work you can try any of the below options.
Option-1: Updating default.conf
In Worker node set the following configuration option in the /spark/conf/spark-defaults.conf file:
spark.worker.cleanup.enabled: Enables periodic cleanup of worker and application directories. Disabled by default. Set to true to enable it. Note: that this only affects standalone mode, as YARN works differently.
spark.worker.cleanup.interval: The frequency, in seconds, that the worker cleans up old application work directories. The default is 30 minutes.
spark.worker.cleanup.appDataTtl: The number of seconds to retain application work directories on each worker. The default is 7 days.
Then stop and start the workers.
sbin/stop-worker.sh - Stops all worker instances on the machine the script is executed on.
sbin/start-worker.sh - Starts a worker instance on the machine the script is executed on.
Option-2: If you setup a spark cluster using docker-compose then set environment in Docker compose file
spark-worker-x:
image: spark-worker
container_name: spark-worker-x
environment:
- SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=60 -Dspark.worker.cleanup.appDataTtl=120"

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?
When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

how to execute spark program efficient in cluster

I have 2 node hadoop cluster. Each with 16GB RAM and 512GB Harddisk.
I have written spark program like below one
Code :
val input = sc.wholeTextFiles("folderpath/*")
do some operations on input.
convert it to dataframe. then register temptable. execute insert command to insert the dataframe value to hive table.
Then I open host 1 (which is my namenode of the cluster) terminal & I run spark submit command like
>spark-submit --class com.sample.parser --master yarn Parser.jar.
But it takes more than 50 mins to process 25 files which totals around 1gb.And when I check spark UI, executor list has only my host 2. host 1 is listed as driver.
So practically only one node is executing the program(host 2). Why?
Is there a way that I can have my driver also to execute the program. so that it runs little faster? Am I doing something wrong? Basically I want my driver node also to be part of executor(Both machines have 8 cores).
Thanks in Advance.
spark-submit by default runs in client(local) mode, in order to submit spark job in cluster mode use --deploy-mode as:
spark-submit \
--class com.sample.parser \
--master yarn \
--deploy-mode cluster \
Parser.jar
--deploy-mode: Whether to deploy your driver on the worker nodes
(cluster) or locally as an external client (client) (default: client)
also, experiment with --num-executors <n> - with different <n> values...and see if it make any difference with perfomance of your app.

Spark driver always fail to bind to submit host in cluster mode

Hi I'm trying to deploy Spark streaming job using standalone cluster. All the jars are installed locally on each node and I run spark-submit inside one of the nodes. The driver is then started in one of the workers randomly but always try to bind to the node where I submitted the job. And if it happens to be on a different node, the driver always fails. I tried to set spark.driver.host to different values but didn't help.
Anyone with the same problem? Or is there any better ways to submit spark jobs, ideally in Standalone cluster.
spark-env.sh
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_HOSTNAME=local_host_name
export SPARK_LOG_DIR=/var/log/spark
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOCAL_DIRS=/var/run/spark/tmp
export STANDALONE_SPARK_MASTER_HOST=master_host_name
spark-defaults.conf
spark.master spark://master_host_name:6066
spark.io.compression.codec lz4
I run it with spark-submit --deploy-mode cluster --supervise
Thanks a lot

Resources