Spark 2 application on the same time - apache-spark

I am using spark streaming and saving the processed output in a data.csv file
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(1000))
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
At the same time i would like to read output of NetworkWordCount data.csv along with another newfile and process it again simultaneously
My Question here is
Is it possible to run two spark applications at the same time?
Is it possible to submit a spark application through the code itself
I am using mac, Now i am submitting spark application from the spark folder with the following command
bin/spark-submit --class "com.abc.test.SparkStreamingTest" --master spark://xyz:7077 --executor-memory 20G --total-executor-cores 100 ../workspace/Test/target/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar 1000
or just without spark:ip:port and executor memory, total executor core
bin/spark-submit --class "com.abc.test.SparkStreamingTest" --master local[4] ../workspace/Test/target/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar
and the other application which read the textfile for batch processing like follows
bin/spark-submit --class "com.abc.test.BatchTest" --master local[4] ../workspace/Batch/target/BatchTesting-0.0.1-SNAPSHOT-jar-with-dependencies.jar
when i run the both the applictions SparkStreamingTest and BatchTest separately both works fine , but when i tried to run both simultaneously, i get the following error
Currently i am using spark stand alone mode
WARN AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
Any help is much appriciated.. i am totally out of my mind

From http://spark.apache.org/docs/1.1.0/monitoring.html
If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Your apps should be able to run. It's just a warning to tell you about port conflicts. It's because you run the two Spark apps in the same time. But don't worry about it. Spark will try 4041, 4042 until it finds an available port. So in your case, you will find two Web UIs: ip:4040, ip:4041 for these two apps.

Related

Spark Structure Streaming job failing in cluster mode

I am using spark-sql-2.4.1 v in my application.
While writing data on to hdfs folder I am facing this issue in spark-streaming application
Error:
yarn.Client: Deleted staging directory hdfs://dev/user/xyz/.sparkStaging/application_1575699597805_47
20/02/24 14:02:15 ERROR yarn.Client: Application diagnostics message: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user= xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:350)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)
While writing data on to HDFS folder I am facing this issue in spark-streaming application. When I run in yarn-cluster mode I face this issue i.e.
--master yarn \
--deploy-mode cluster \
But when I run in “yarn-client” mode it runs fine i.e.
--master yarn \
--deploy-mode client \
What is the root cause of this problem?
Fundamental question here, why it is trying to write in "/tmp/hadoop-admin/" instead of respective user directory i.e. hdfs://qa2/user/xyz/?
I have come across this fix:
https://issues.apache.org/jira/browse/SPARK-26825
How can I implement it in my spark-sql application?
The only difference between the working --deploy-mode client and the failing --deploy-mode cluster cases is the location of the driver. In client deploy mode, the driver runs on the machine you execute spark-submit (which is usually an edge node that is configured to use a YARN cluster, but it is not part of it) while in cluster deploy mode the driver runs as part of a YARN cluster (one of the nodes under control of YARN).
It looks like you've got a misconfigured edge node.
I'd not be surprised if a regular Spark SQL-only Spark application would be failing too. I'd not be surprised to hear that it has nothing to do with a streaming query (Spark Structured Streaming) and would fail for any Spark application.

What is happening when starting a Spark application on Kubernetes

I read this: Running Spark on Kubernetes.
I want to know more details about the interaction between Kubernetes Controller/Scheduler and Spark runtime when launching a Spark job on K8s.
Specially, assuming we launch an Spark app by :
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--..............
My question is: the K8s may not be able to allocate 5 executors (or called containers/pods) immediately due to unavailability of cluster resources at the moment the Spark app is launched. Which way does Spark app take? (1) Spark starts running tasks as soon as possible when there is at least one executor is allocated. (2) Spark won't launch any tasks until all of the 5 executors have been allocated.
If you know Hadoop YARN, it would be great if you could also answer the question in the scenario of running Spark app on Hadoop YARN(DynamicAllocation Disabled) and point out the difference.

Spark Standalone: application gets 0 cores

I seem to be unable to assign cores to an application. This leads to the following (apparently common) error message:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have one master and two slaves in a Spark cluster. All are 8-core i7s with 16GB of RAM.
I have left the spark-env.sh virtually virgin on all three, just specifying the master's IP address.
My spark-submit is the following:
nohup ./bin/spark-submit
--jars ./ikoda/extrajars/ikoda_assembled_ml_nlp.jar,./ikoda/extrajars/stanford-corenlp-3.8.0.jar,./ikoda/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11 \
--class ikoda.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--conf spark.cores.max=4 \
--driver-memory 4g –num-executors 2 --executor-memory 2g --executor-cores 2 \
--master spark://192.168.0.141:7077 ./ikoda/ikodaanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
I suspect I am conflating the sparkConf initialization in my code with the spark-submit. I need this as the app involves SparkStreaming which can require reinitializing the SparkContext.
The sparkConf setup is as follows:
val conf = new SparkConf().setMaster(s"spark://$sparkmaster:7077").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
conf.set("spark.jars.packages", "datastax:spark-cassandra-connector:"+datastaxpackageversion)
conf.set("spark.cores.max", sparkcoresmax)
The Spark UI shows the following:
OK, this is definitely a case of programmer error.
But maybe others will make a similar error. The Master had been used as a local Spark previously. I had put some executor settings in spark-defaults.conf and then months later had forgotten about this.
There is a cascading hierarchy whereby SparkConf settings get precedence, then spark-submit settings and then spark-defaults.conf. spark-defaults.conf overrides defaults set by Apache Spark team
Once I removed the settings from spark-defaults, all was fixed.
It is because of the maximum of your physical memory.
your spark memory in spark UI 14.6GB So you must request memory for each executor below the volume
14.6GB, for this you can add config to your spark conf something like this:
conf.set("spark.executor.memory", "10gb")
if you request more than your physical memory spark does't allocate cpu cores to your job and display 0 in Cores in spark UI and run NOTHING.

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources