How can I run Pyspark interactively in Jupyter using YARN-client mode? - apache-spark

Now I have succeeded in running Pyspark in Jupyter in local mode by the second method as mentioned in this blog.
Here is the code:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("local", "First App")
I want to run it interactively in YARN-client mode,how can I do it?
Let's go futher,how to run in different modes,e.g.standalone mode and YARN-cluster mode.

Accrding to the Docs :
Master URLs accepts yarn parameter based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable
So I can simply use:
sc = SparkContext("yarn-client", "First App")

Related

How to spark-submit a script directly on terminal instead of using a script file?

I have some jobs that use a the following command to execute some tasks:
pyspark --master yarn --deploy-mode cluster --py-files file.py --name file file.py
The script on my python file is very simple:
from pyspark import SparkContext;
from pyspark.sql import HiveContext;
sc =SparkContext();
hive_context = HiveContext(sc);
table_1 = hive_context.sql("SELECT * FROM table_1");
table_1.write.insertInto("table_to_insert", overwrite=True);
My question is: can I run this command directly with the script instead of using a file? Something like:
"pyspark --master yarn --deploy-mode cluster --py-script 'from pyspark import SparkContext; from pyspark.sql import HiveContext; sc =SparkContext(); hive_context = HiveContext(sc); table_1 = hive_context.sql("SELECT * FROM table_1"); table_1.write.insertInto("table_to_insert", overwrite=True);'"
Is this possible?
Many thanks for your support!

Setting PySpark executor.memory and executor.core within Jupyter Notebook

I am initializing PySpark from within a Jupyter Notebook as follows:
from pyspark import SparkContext
#
conf = SparkConf().setAppName("PySpark-testing-app").setMaster("yarn")
conf = (conf.set("deploy-mode","client")
.set("spark.driver.memory","20g")
.set("spark.executor.memory","20g")
.set("spark.driver.cores","4")
.set("spark.num.executors","6")
.set("spark.executor.cores","4"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext.getOrCreate(sc)
However, when I launch YARN GUI and look into "RUNNING Applications" I see my session being allocated with 1 container, 1 vCPU, and 1GB of RAM, i.e. the default values!
Can I get the desired, passing values as listed above?
Jupyter notebook will launch the pyspark with yarn-client mode, the driver memory and some configs cannot be setted with class 'sparkConf'. you must set it in command line.
Take a look at official doc's explains at memory's setting:
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file.
there is another way that can make it.
import os
memory = '20g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
So, other config should be taked with same way like above.
Execute
%%configure -f
{
"driverMemory" : "20G",
"executorMemory": "20G"
}
On the top of all cells (before Spark initializes)

Setting spark.local.dir in Pyspark/Jupyter

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3.
I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up.
I want to now edit spark.local.dir to point to a directory that has space.
How can I set this parameter?
Most solutions I found suggested setting it when using spark-submit. However, I am not using spark-submit and just running it as a script from Jupyter.
Edit: I'm using Sparkmagic to work with an EMR backend.I think spark.local.dir needs to be set in the config JSON, but I am not sure how to specify it there.
I tried adding it in session_configs but it didn't work.
The answer depends on where your SparkContext comes from.
If you are starting Jupyter with pyspark:
PYSPARK_DRIVER_PYTHON='jupyter'\
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
PYSPARK_PYTHON="python" \
pyspark
then your SparkContext is already initialized when you receive your Python kernel in Jupyter. You should therefore pass a parameter to pyspark (at the end of the command above): --conf spark.local.dir=...
If you are constructing a SparkContext in Python
If you have code in your notebook like:
import pyspark
sc = pyspark.SparkContext()
then you can configure the Spark context before creating it:
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.local.dir', '...')
sc = pyspark.SparkContext(conf=conf)
Configuring Spark from the command line:
It's also possible to configure Spark by editing a configuration file in bash. The file you want to edit is ${SPARK_HOME}/conf/spark-defaults.conf. You can append to it as follows (creating it if it doesn't exist):
echo 'spark.local.dir /foo/bar' >> ${SPARK_HOME}/conf/spark-defaults.conf

Invalid Spark URL in local spark session

since updating to Spark 2.3.0, tests which are run in my CI (Semaphore) fail due to a allegedly invalid spark url when creating the (local) spark context:
18/03/07 03:07:11 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#LXC_trusty_1802-d57a40eb:44610
at org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)
at org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
at org.apache.spark.executor.Executor.<init>(Executor.scala:155)
at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:59)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
The spark session is created as following:
val sparkSession: SparkSession = SparkSession
.builder
.appName(s"LocalTestSparkSession")
.config("spark.broadcast.compress", "false")
.config("spark.shuffle.compress", "false")
.config("spark.shuffle.spill.compress", "false")
.master("local[3]")
.getOrCreate
Before updating to Spark 2.3.0, no problems were encountered in version 2.2.1 and 2.1.0. Also, running the tests locally works fine.
Change the SPARK_LOCAL_HOSTNAME to localhost and try.
export SPARK_LOCAL_HOSTNAME=localhost
This has been resolved by setting sparkSession config "spark.driver.host" to the IP address.
It seems that this change is required from 2.3 onwards.
If you don't want to change the environment variable, you can change the code to add the config in the SparkSession builder (like Hanisha said above).
In PySpark:
spark = SparkSession.builder.config("spark.driver.host", "localhost").getOrCreate()
As mentioned in above answers, You need to change SPARK_LOCAL_HOSTNAME to localhost. In windows, you have to use SET command, SET SPARK_LOCAL_HOSTNAME=localhost
but this SET command is temporary. you may have to run it again and again in every new terminal. but instead, you can use SETX command, which is permanent.
SETX SPARK_LOCAL_HOSTNAME localhost
You can type above command in any place. just open a command prompt and run above command. Notice that unlike SET command, SETX command do not allow equation mark. you need to separate environment variable and the value by a Space.
if Success, you will see a message like "SUCCESS: Specified value was saved"
you can also verify that your variable is successfully added by just typing SET in a different command prompt. (or type SET s , which gives variables, starting with the letter 'S'). you can see that SPARK_LOCAL_HOSTNAME=localhost in results, which will not happen if you use SET command instead of SETX
Change your hostname to have NO underscore.
spark://HeartbeatReceiver#LXC_trusty_1802-d57a40eb:44610 to spark://HeartbeatReceiver#LXCtrusty1802d57a40eb:44610
Ubuntu AS root
#hostnamectl status
#hostnamectl --static set-hostname LXCtrusty1802d57a40eb
#nano /etc/hosts
127.0.0.1 LXCtrusty1802d57a40eb
#reboot
Try to run Spark locally, with as many worker threads as logical cores on your machine :
.master("local[*]")
I would like to complement #Prakash Annadurai answer by saying:
If you want to make the variable settlement last after exiting the terminal, add it to your shell profile (e.g. ~/.bash_profile) with the same command:
export SPARK_LOCAL_HOSTNAME=localhost
For anyone working in Jupyter Notebook. Adding %env SPARK_LOCAL_HOSTNAME=localhost to the very beginning of the cell solved it for me. Like so:
%env SPARK_LOCAL_HOSTNAME=localhost
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
sc = SparkContext(conf = conf)
Setting .config("spark.driver.host", "localhost") fixed the issue for me.
SparkSession spark = SparkSession
.builder()
.config("spark.master", "local")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
.config("spark.hadoop.fs.s3a.buffer.dir", "/tmp")
.config("spark.driver.memory", "2048m")
.config("spark.executor.memory", "2048m")
.config("spark.driver.bindAddress", "127.0.0.1")
.config("spark.driver.host", "localhost")
.getOrCreate();

how to run multi node on the spark-shell?

I have been using spark-submit to test my codes on the multi-nodes system.
(Of course, I specified the master option as the master server address to achieve multi-nodes environment).
However, instead of using spark-submit, I would like to use spark-shell to test my codes on the cluster system. However, I don't know how to configure multi-nodes clusters settings on the spark-shell?
I think that just using spark-shell without changing any setups will results in the local mode.
I tried to search the info and followed the below commands.
scala> sc.stop()
...
scala> import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.{SparkContext, SparkConf}
scala> val sc = new SparkContext(new SparkConf().setAppName("shell").setMaster("my server address"))
...
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#567a2954
However, I am quite sure that I am doing right behavior for the multi-node cluster setup using spark-shell.
Have you tried --master parameter of spark-shell? For Spark Standalone:
./spark-shell --master spark://master-ip:7077
Spark shell is just a driver, it will connect to any cluster you will write in master parameter
Edit:
For YARN use
./spark-shell --master yarn
If you used setMaster("my server address")) and "my server address" is not "local", then it won't run in local mode.
It is fine to set the master address in the code, but in production, you'd set --master parameter on the CLI to spark-shell or spark-submit
You can also write a separate .scala file, and pass that to spark-shell -i <filename>.scala

Resources