What is the difference between Spark Client and Spark Driver? - apache-spark

What is the difference between Spark Client and Spark Driver? I keep hearing that Spark Client is the one that does spark-submit but I never had to do that all I do is install Spark Libraries and Initialize spark context under public static void main(String[] args) like the code below
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;
public class JavaWordCount {
public static void main(String[] args) {
// create Spark context with Spark configuration
JavaSparkContext sc = new JavaSparkContext(new SparkConf()
.setAppName("Spark Count"))
.setMaster(master_url)
.setJars(JavaSparkContext.jarOfClass(JavaWordCount.class));
// get threshold
final int threshold = Integer.parseInt(args[1]);
// read in text file and split each document into words
JavaRDD<String> tokenized = sc.textFile(args[0]).flatMap(
new FlatMapFunction() {
public Iterable call(String s) {
return Arrays.asList(s.split(" "));
}
}
);
// count the occurrence of each word
JavaPairRDD<String, Integer> counts = tokenized.mapToPair(
new PairFunction() {
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}
).reduceByKey(
new Function2() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
// filter out words with fewer than threshold occurrences
JavaPairRDD<String, Integer> filtered = counts.filter(
new Function, Boolean>() {
public Boolean call(Tuple2 tup) {
return tup._2 >= threshold;
}
}
);
// count characters
JavaPairRDD<Character, Integer> charCounts = filtered.flatMap(
new FlatMapFunction<Tuple2<String, Integer>, Character>() {
#Override
public Iterable<Character> call(Tuple2<String, Integer> s) {
Collection<Character> chars = new ArrayList<Character>(s._1().length());
for (char c : s._1().toCharArray()) {
chars.add(c);
}
return chars;
}
}
).mapToPair(
new PairFunction<Character, Character, Integer>() {
#Override
public Tuple2<Character, Integer> call(Character c) {
return new Tuple2<Character, Integer>(c, 1);
}
}
).reduceByKey(
new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
System.out.println(charCounts.collect());
}
}
In this code I dont see any spark-submit so which is the client and which is the driver here? Also I dont really understand what is happening with SparkConf.setJars? Why is that needed and what does it do?

The Master is the one who creates the Task and submits it to the Workers and coordinates between different Job Stages,whereas the Driver sits one step above the Master which is involved in creating the context and creating the RDD Graph,which it then submits to the Master.
The Spark Driver is a program that declares Transformations and actions on RDDs of data and submits such requests to the master.It basically creates your SparkContext.
Hence,the Driver prepares the context and "Declares" operations on data using RDD Transformations and Actions.The Driver then submits this serialized RDD graph to the Master.
The Master then creates the Task out of it and submit them to the Worker for the execution.It also coordinates different Job stages.

By Spark Client I'm assuming you're talking about Spark Master, which is the server responsible for scheduling Spark jobs.
The code you have works with the Spark API, but you have to run it, as you run any program. In order to run jobs in Spark, you have to schedule them with the relevant resource manager (Standalone, YARN or Mesos) to actually execute the job in the cluster environment.
After you assemble your JAR, you go to the Spark Master server and schedule the job using the spark-submit script. As the documentation states:
Once a user application is bundled, it can be launched using the
bin/spark-submit script. This script takes care of setting up the
classpath with Spark and its dependencies, and can support different
cluster managers and deploy modes that Spark supports:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
†
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes. application-arguments: Arguments passed to the
main method of your main class, if any
Also I dont really understand what is happening with
SparkContext.setJars? Why is that needed and what does it do?
I assume you mean SparkConf.setJars. What that settings helps do is distribute JARs which aren't packaged into an uber JAR to the worker nodes, if needed. This is identical to passing the --jars flag to spark-submit.

Related

ClassNotFound when submit a Spring-boot-fat-jar to spark

when I build a fat jar using spring-boot, its name is
sparker4m-0.0.1.jar
And I "spark-submit" it to spark cluster.
./bin/spark-submit --deploy-mode client --class org.springframework.boot.loader.JarLauncher ~/Desktop/sparker4m-0.0.1.jar --spark.master=spark://172.16.179.52:7077 --source_path=file:///usr/local/spark-2.3.1-bin-hadoop2.7/hdfs_mock/1530842877616 --out_path=file:///usr/local/spark-2.3.1-bin-hadoop2.7/hdfs_mock/153084287761_out --hadoop.home.dir=/usr/local/hadoop
You can see I use " --class org.springframework.boot.loader.JarLauncher", because it's the main-class of Spring-boot-jar, and the actual Main-class is test.sparker4m.Sparker4mApplication
here is my code:
#SpringBootApplication
public class Sparker4mApplication {
public static void main(String[] args) throws Exception{
ApplicationContext ctx =
SpringApplication.run(Sparker4mApplication.class, args);
SparkConf conf = new SparkConf().setAppName("Test")
.setMaster(happyDay.getVal("spark.master"))
.set("spark.driver.host", Inet4Address.getLocalHost().getHostAddress());
JavaSparkContext sc = new JavaSparkContext(conf);
.........
JavaPairRDD<String, String> transformed = rdd.mapToPair(new PairFunction<Tuple2<String, PortableDataStream>, String, String>() {
#Override
public Tuple2<String, String> call(Tuple2<String, PortableDataStream> tuple2) throws Exception {
String fname = tuple2._1();
PortableDataStream content = tuple2._2();
byte[] bytes = content.toArray();
String result = YUVSimpleTrans.transform(bytes);
return new Tuple2<>(fname, result);
}
});
..........
.........
}
}
Everything goes well, but, when Excecutor runs, exception throws:
Caused by: java.lang.ClassNotFoundException: test.sparker4m.Sparker4mApplication$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
test.sparker4m.Sparker4mApplication$1 is an inner class used by test.sparker4m.Sparker4mApplication and it should be "new PairFunction<Tuple2<String, PortableDataStream>, String, String>()".
I supposed that, spark's Executor process use a individual URLClassLoader to load classes, but my(sprint-boot-fat-jar's) classes are all in sparker4m-0.0.1.jar/BOOT-INF/classes, and my dependency jars are all in sparker4m-0.0.1.jar/BOOT-INF/lib, so the URLClassLoader cannot load the class.
Is there any good idea for this? needs help!
Finally, got solution.
The key's are:
1.It should use an uber-jar with all classes and resoures and not in "jar lib"
2.Must understand how "JarLaucher" works. JarLaucher is offered by "spring-boot-tools", it is used to "createArchive" and URLLoadClass from /BOOT-INF/*(spring-boot 2.x) and start true Main class.
so, if "1" is done, "2" is not needed, because we don't need URLLoadClass from /BOOT-INF/* any more.
The solution is : use maven-shade-plugin(with out spring-boot-maven-plugin) to build an uber-jar, and define your true main class(not JarLaucher) use org.apache.maven.plugins.shade.resource.ManifestResourceTransformer in this plugin.
After all things done, I found a question Submitting spring boot application jar to spark-submit was already fit this out, but the answer didn't point clearly that you shoud not use ${start-class}(JarLaucher) but true main class

Pass parameters to the jar when using spark launcher

I am trying to create an executable jar which is using a spark launcher to run another jar with data transformation task(this jar creates spark session).
I need to pass java parameters(some java arrays) to the jar which is executed by the launcher.
object launcher {
#throws[Exception]
// How do I pass parameters to spark_job_with_spark_session.jar
def main(args: Array[String]): Unit = {
val handle = new SparkLauncher()
.setAppResource("spark_job_with_spark_session.jar")
.setVerbose(true)
.setMaster("local[*]")
.setConf(SparkLauncher.DRIVER_MEMORY, "4g")
.launch()
}
}
How can I do that?
need to pass java parameters(some java arrays)
It is equivalent to executing spark-submit so you cannot pass Java objects directly. Use app args
addAppArgs(String... args)
to pass application arguments, and parse them in your app.
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.meow.woof.meow_spark_launcher.app;
import com.meow.woof.meow_spark_launcher.common.TaskListener;
import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;
/**
*
* #author hahattpro
*/
public class ExampleSparkLauncherApp {
public static void main(String[] args) throws Exception {
SparkAppHandle handle = new SparkLauncher()
.setAppResource("/home/cpu11453/workplace/experiment/SparkPlayground/target/scala-2.11/SparkPlayground-assembly-0.1.jar")
.setMainClass("me.thaithien.playground.ConvertToCsv")
.setMaster("spark://cpu11453:7077")
.setConf(SparkLauncher.DRIVER_MEMORY, "3G")
.addAppArgs("--input" , "/data/download_hdfs/data1/2019_08_13/00/", "--output", "/data/download_hdfs/data1/2019_08_13/00_csv_output/")
.startApplication(new TaskListener());
handle.addListener(new SparkAppHandle.Listener() {
#Override
public void stateChanged(SparkAppHandle handle) {
System.out.println(handle.getState() + " new state");
}
#Override
public void infoChanged(SparkAppHandle handle) {
System.out.println(handle.getState() + " new state");
}
});
System.out.println(handle.getState().toString());
while (!handle.getState().isFinal()) {
//await until job finishes
Thread.sleep(1000L);
}
}
}
Here is example code that work

Spark Streaming Issue. ( JAVA )

I am trying to run a network word count job using spark streaming. The code is as below:
public class myclass {
public static Iterator<String> myfunc(String s){
String ar[]=s.split(" ");
List<String> l = new ArrayList<String>();
for(String a:ar){
l.add(a);
}
return l.iterator();
}
public static void main(String args[]) throws Exception{
SparkConf conf = new SparkConf().setAppName("trial");
JavaStreamingContext jssc = new JavaStreamingContext(conf,Durations.seconds(1));
JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);
JavaPairDStream<String,Integer> tup=lines.flatMap(s -> myfunc(s)).mapToPair(s -> new Tuple2<String,Integer>(s,1)).reduceByKey((a,b)-> a+b);
tup.print();
jssc.start();
jssc.awaitTermination();
jssc.stop();
jssc.close();
}
}
I ran this job using the command :
spark-submit --class myclass --master local <myjar.jar>
In another terminal I ran a local server as :
nc -lk 7777
The job runs fine, but it keeps going on and on..
It keeps on adding a new job for each millisecond. I tried adding lines on the local server it shows no error whatsoever but still the job goes on and on & output is shown.
I went through some online tutorials where with the same code, the result of the code is shown regularly like below :
time 14500000ms
__________________
<output>
But that doesn't happen in my case , it keeps on adding jobs like below :
Added jobs for time 1450000
Added jobs for time 1460000
Added jobs for time 1470000
. . . .
. . . .
What am I doing wrong ?

Spark Jobserver: Very large task size

I'm getting messages along the lines of the following in my Spark JobServer logs:
Stage 14 contains a task of very large size (9523 KB). The maximum recommended task size is 100 KB.
I'm creating my RDD with this code:
List<String> data = new ArrayList<>();
for (int i = 0; i < 2000000; i++) {
data.add(UUID.randomUUID().toString());
}
JavaRDD<String> randomData = sc.parallelize(data).cache();
I understand that the first time I run this is could be big, because the data in the RDD doesn't exist on the executor nodes yet.
I would have thought that it would be quick on subsequent runs though (I'm using Spark JobServer to keep the session context around, and reuse the RDD), since I'm reusing the RDD so the data should exist on the nodes.
The code is very simple:
private static Function<String, Boolean> func = new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("a");
}
};
----
rdd.filter(aFunc).count();

Spark does not distribute work

I set up two virtual machines to test Spark in a distributed setup. It seems that my jobs are only run locally on one node, the one I use to submit the job.
One node is run as datanode/worker node and the second one is additionally namenode/secondary-namenode
I configured the underlying hadoop to use Yarn.
The jps command confirms that the various services are started correctly and basically available after I expected the start*-scripts in hadoop/spark.
I use htop to "track" if the other node is used, but the cpu usage jumps between 2 an 3% --> probably not used. I wonder what I am missing here.
I start my job with this command:
./spark-submit --class com.... DistributedTest --master yarn-client myJar.jar
This is the class I am executing (the data.txt file is about 1GB pure text)
public class DistributedTest
{
public static void main(String[] args)
throws IOException
{
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile("hdfs://woodpecker:10001/husr/data.txt");// .persist(StorageLevel.DISK_ONLY());
long numAs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("b");
}
}).count();
sc.close();
String s = "Lines with a: " + numAs + ", lines with b: " + numBs;
System.out.println(s);
}
}
Anyone any ideas why my setup does not distribute
The filter operation is definitely distributed, and count is partially computed on a worker, while the total count is calculated back on the master. The result of the count is also on the master.
Filtering one GB of data isn't really going to stress Spark anyway, so you should only see a short CPU spike on the worker. Rather take a look at I/O usage.
Your app is fine, there must be something wrong with your setup.
First, you go through your Spark UI and make sure you have multiple workers and it is also depend on how much partition you have in your Rdd.

Resources