Spark does not distribute work - apache-spark

I set up two virtual machines to test Spark in a distributed setup. It seems that my jobs are only run locally on one node, the one I use to submit the job.
One node is run as datanode/worker node and the second one is additionally namenode/secondary-namenode
I configured the underlying hadoop to use Yarn.
The jps command confirms that the various services are started correctly and basically available after I expected the start*-scripts in hadoop/spark.
I use htop to "track" if the other node is used, but the cpu usage jumps between 2 an 3% --> probably not used. I wonder what I am missing here.
I start my job with this command:
./spark-submit --class com.... DistributedTest --master yarn-client myJar.jar
This is the class I am executing (the data.txt file is about 1GB pure text)
public class DistributedTest
{
public static void main(String[] args)
throws IOException
{
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile("hdfs://woodpecker:10001/husr/data.txt");// .persist(StorageLevel.DISK_ONLY());
long numAs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("b");
}
}).count();
sc.close();
String s = "Lines with a: " + numAs + ", lines with b: " + numBs;
System.out.println(s);
}
}
Anyone any ideas why my setup does not distribute

The filter operation is definitely distributed, and count is partially computed on a worker, while the total count is calculated back on the master. The result of the count is also on the master.
Filtering one GB of data isn't really going to stress Spark anyway, so you should only see a short CPU spike on the worker. Rather take a look at I/O usage.
Your app is fine, there must be something wrong with your setup.

First, you go through your Spark UI and make sure you have multiple workers and it is also depend on how much partition you have in your Rdd.

Related

ClassNotFound when submit a Spring-boot-fat-jar to spark

when I build a fat jar using spring-boot, its name is
sparker4m-0.0.1.jar
And I "spark-submit" it to spark cluster.
./bin/spark-submit --deploy-mode client --class org.springframework.boot.loader.JarLauncher ~/Desktop/sparker4m-0.0.1.jar --spark.master=spark://172.16.179.52:7077 --source_path=file:///usr/local/spark-2.3.1-bin-hadoop2.7/hdfs_mock/1530842877616 --out_path=file:///usr/local/spark-2.3.1-bin-hadoop2.7/hdfs_mock/153084287761_out --hadoop.home.dir=/usr/local/hadoop
You can see I use " --class org.springframework.boot.loader.JarLauncher", because it's the main-class of Spring-boot-jar, and the actual Main-class is test.sparker4m.Sparker4mApplication
here is my code:
#SpringBootApplication
public class Sparker4mApplication {
public static void main(String[] args) throws Exception{
ApplicationContext ctx =
SpringApplication.run(Sparker4mApplication.class, args);
SparkConf conf = new SparkConf().setAppName("Test")
.setMaster(happyDay.getVal("spark.master"))
.set("spark.driver.host", Inet4Address.getLocalHost().getHostAddress());
JavaSparkContext sc = new JavaSparkContext(conf);
.........
JavaPairRDD<String, String> transformed = rdd.mapToPair(new PairFunction<Tuple2<String, PortableDataStream>, String, String>() {
#Override
public Tuple2<String, String> call(Tuple2<String, PortableDataStream> tuple2) throws Exception {
String fname = tuple2._1();
PortableDataStream content = tuple2._2();
byte[] bytes = content.toArray();
String result = YUVSimpleTrans.transform(bytes);
return new Tuple2<>(fname, result);
}
});
..........
.........
}
}
Everything goes well, but, when Excecutor runs, exception throws:
Caused by: java.lang.ClassNotFoundException: test.sparker4m.Sparker4mApplication$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
test.sparker4m.Sparker4mApplication$1 is an inner class used by test.sparker4m.Sparker4mApplication and it should be "new PairFunction<Tuple2<String, PortableDataStream>, String, String>()".
I supposed that, spark's Executor process use a individual URLClassLoader to load classes, but my(sprint-boot-fat-jar's) classes are all in sparker4m-0.0.1.jar/BOOT-INF/classes, and my dependency jars are all in sparker4m-0.0.1.jar/BOOT-INF/lib, so the URLClassLoader cannot load the class.
Is there any good idea for this? needs help!
Finally, got solution.
The key's are:
1.It should use an uber-jar with all classes and resoures and not in "jar lib"
2.Must understand how "JarLaucher" works. JarLaucher is offered by "spring-boot-tools", it is used to "createArchive" and URLLoadClass from /BOOT-INF/*(spring-boot 2.x) and start true Main class.
so, if "1" is done, "2" is not needed, because we don't need URLLoadClass from /BOOT-INF/* any more.
The solution is : use maven-shade-plugin(with out spring-boot-maven-plugin) to build an uber-jar, and define your true main class(not JarLaucher) use org.apache.maven.plugins.shade.resource.ManifestResourceTransformer in this plugin.
After all things done, I found a question Submitting spring boot application jar to spark-submit was already fit this out, but the answer didn't point clearly that you shoud not use ${start-class}(JarLaucher) but true main class

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

Spark Streaming Issue. ( JAVA )

I am trying to run a network word count job using spark streaming. The code is as below:
public class myclass {
public static Iterator<String> myfunc(String s){
String ar[]=s.split(" ");
List<String> l = new ArrayList<String>();
for(String a:ar){
l.add(a);
}
return l.iterator();
}
public static void main(String args[]) throws Exception{
SparkConf conf = new SparkConf().setAppName("trial");
JavaStreamingContext jssc = new JavaStreamingContext(conf,Durations.seconds(1));
JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);
JavaPairDStream<String,Integer> tup=lines.flatMap(s -> myfunc(s)).mapToPair(s -> new Tuple2<String,Integer>(s,1)).reduceByKey((a,b)-> a+b);
tup.print();
jssc.start();
jssc.awaitTermination();
jssc.stop();
jssc.close();
}
}
I ran this job using the command :
spark-submit --class myclass --master local <myjar.jar>
In another terminal I ran a local server as :
nc -lk 7777
The job runs fine, but it keeps going on and on..
It keeps on adding a new job for each millisecond. I tried adding lines on the local server it shows no error whatsoever but still the job goes on and on & output is shown.
I went through some online tutorials where with the same code, the result of the code is shown regularly like below :
time 14500000ms
__________________
<output>
But that doesn't happen in my case , it keeps on adding jobs like below :
Added jobs for time 1450000
Added jobs for time 1460000
Added jobs for time 1470000
. . . .
. . . .
What am I doing wrong ?

Spark Jobserver: Very large task size

I'm getting messages along the lines of the following in my Spark JobServer logs:
Stage 14 contains a task of very large size (9523 KB). The maximum recommended task size is 100 KB.
I'm creating my RDD with this code:
List<String> data = new ArrayList<>();
for (int i = 0; i < 2000000; i++) {
data.add(UUID.randomUUID().toString());
}
JavaRDD<String> randomData = sc.parallelize(data).cache();
I understand that the first time I run this is could be big, because the data in the RDD doesn't exist on the executor nodes yet.
I would have thought that it would be quick on subsequent runs though (I'm using Spark JobServer to keep the session context around, and reuse the RDD), since I'm reusing the RDD so the data should exist on the nodes.
The code is very simple:
private static Function<String, Boolean> func = new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("a");
}
};
----
rdd.filter(aFunc).count();

What is the difference between Spark Client and Spark Driver?

What is the difference between Spark Client and Spark Driver? I keep hearing that Spark Client is the one that does spark-submit but I never had to do that all I do is install Spark Libraries and Initialize spark context under public static void main(String[] args) like the code below
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;
public class JavaWordCount {
public static void main(String[] args) {
// create Spark context with Spark configuration
JavaSparkContext sc = new JavaSparkContext(new SparkConf()
.setAppName("Spark Count"))
.setMaster(master_url)
.setJars(JavaSparkContext.jarOfClass(JavaWordCount.class));
// get threshold
final int threshold = Integer.parseInt(args[1]);
// read in text file and split each document into words
JavaRDD<String> tokenized = sc.textFile(args[0]).flatMap(
new FlatMapFunction() {
public Iterable call(String s) {
return Arrays.asList(s.split(" "));
}
}
);
// count the occurrence of each word
JavaPairRDD<String, Integer> counts = tokenized.mapToPair(
new PairFunction() {
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}
).reduceByKey(
new Function2() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
// filter out words with fewer than threshold occurrences
JavaPairRDD<String, Integer> filtered = counts.filter(
new Function, Boolean>() {
public Boolean call(Tuple2 tup) {
return tup._2 >= threshold;
}
}
);
// count characters
JavaPairRDD<Character, Integer> charCounts = filtered.flatMap(
new FlatMapFunction<Tuple2<String, Integer>, Character>() {
#Override
public Iterable<Character> call(Tuple2<String, Integer> s) {
Collection<Character> chars = new ArrayList<Character>(s._1().length());
for (char c : s._1().toCharArray()) {
chars.add(c);
}
return chars;
}
}
).mapToPair(
new PairFunction<Character, Character, Integer>() {
#Override
public Tuple2<Character, Integer> call(Character c) {
return new Tuple2<Character, Integer>(c, 1);
}
}
).reduceByKey(
new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
System.out.println(charCounts.collect());
}
}
In this code I dont see any spark-submit so which is the client and which is the driver here? Also I dont really understand what is happening with SparkConf.setJars? Why is that needed and what does it do?
The Master is the one who creates the Task and submits it to the Workers and coordinates between different Job Stages,whereas the Driver sits one step above the Master which is involved in creating the context and creating the RDD Graph,which it then submits to the Master.
The Spark Driver is a program that declares Transformations and actions on RDDs of data and submits such requests to the master.It basically creates your SparkContext.
Hence,the Driver prepares the context and "Declares" operations on data using RDD Transformations and Actions.The Driver then submits this serialized RDD graph to the Master.
The Master then creates the Task out of it and submit them to the Worker for the execution.It also coordinates different Job stages.
By Spark Client I'm assuming you're talking about Spark Master, which is the server responsible for scheduling Spark jobs.
The code you have works with the Spark API, but you have to run it, as you run any program. In order to run jobs in Spark, you have to schedule them with the relevant resource manager (Standalone, YARN or Mesos) to actually execute the job in the cluster environment.
After you assemble your JAR, you go to the Spark Master server and schedule the job using the spark-submit script. As the documentation states:
Once a user application is bundled, it can be launched using the
bin/spark-submit script. This script takes care of setting up the
classpath with Spark and its dependencies, and can support different
cluster managers and deploy modes that Spark supports:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
†
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes. application-arguments: Arguments passed to the
main method of your main class, if any
Also I dont really understand what is happening with
SparkContext.setJars? Why is that needed and what does it do?
I assume you mean SparkConf.setJars. What that settings helps do is distribute JARs which aren't packaged into an uber JAR to the worker nodes, if needed. This is identical to passing the --jars flag to spark-submit.

Resources