Apache Spark java.lang.ClassNotFoundException - apache-spark

Spark standalone cluster looks it's running without a problem :
http://i.stack.imgur.com/gF1fN.png
I followed this tutorial.
I have built a fat jar for running this JavaApp on the cluster. Before maven package:
find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
content of SimpleApp.java is :
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("spark://10.35.23.13:7077")
.setAppName("My app")
.set("spark.executor.memory", "1g");
JavaSparkContext sc = new JavaSparkContext (conf);
String logFile = "/home/ubuntu/spark-0.9.1/test_data";
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
System.out.println("Lines with a: " + numAs);
}
}
This program only works when master is set as setMaster("local"). Otherwise I get this error
$java -cp path_to_file/simple-project-1.0-allinone.jar SimpleApp
http://i.stack.imgur.com/doRSn.png

There's the anonymous class (that extends Function) in SimpleApp.java file. This class is compiled to SimpleApp$1, which should be broadcast to each worker in the Spark cluster.
The simplest way for it is to add the jar explicitly to the Spark context. Add something like sparkContext.addJar("path_to_file/simple-project-1.0-allinone.jar") after JavaSparkContext creating and rebuild your jar file. Then the main Spark program (called the driver program) will automatically deliver your application code to the cluster.

Related

Loading file from HDFS in spark

I'm trying to run this spark program from HDFS because when I run it locally I don't have enough memory on my pc to handle it. Can someone inform me on how to load the csv file from my HDFS as opposed to doing it locally? Here is my code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.StructType;
public class VideoGamesSale {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Video Games Spark")
.config("spark.master", "local")
.getOrCreate();
You can use the below code to create a dataset/dataframe from a csv file.
Dataset<Row> csvDS = spark.read().csv("/path/of/csv/file.csv");
If you want to read multiple files from directories you can use the below
Seq<String> paths = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("path1","path2"));
Dataset<Row> csvsDS = spark.read().csv(paths);

why some nodes were not assigned to allocate data on spark?

I am a beginner to the Spark. I made the codes and ran them on the multi nodes.
I have one master node and four worker nodes. I ran my codes multiple times and to my surprise, sometimes some of them did't work and sometimes all the worker nodes worked because they were assigned to have the data that master specified.
I didn't setup any detailed configurations so this behavior looks weired to me.
I want to have all my worker nodes process at the same time to get the better and faster results. How to achieve my requirement?
I attached my codes and commands. It is very straightforward so I skipped detailed explanation. Thanks.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
/**
* Created by dst on 2/1/17.
*/
public class Test {
public static void main(String[] args) throws Exception {
String inputFile = args[0];
String outputFile = args[1];
SparkConf conf = new SparkConf().setAppName("Data Transformation")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile(inputFile);
JavaRDD<String> newLine = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) throws Exception {
List<String> ret = new ArrayList<String>();
List<String> ls = Arrays.asList(s.split("\t"));
String values = ls.get(ls.size()-1);
List<String> value = Arrays.asList(values.split("\\|"));
for(int i=0;i<value.size();++i){
String ns = ls.get(0)+"\t"+ls.get(1)+"\t"+ls.get(2)+"\t"+ls.get(3)+"\t"+ls.get(4)+"\t"+ls.get(5);
ns = ns + "\t" + value.get(i);
ret.add(ns);
}
return ret.iterator();
}
});
newLine.saveAsTextFile(outputFile);
}
}
Spark-submit.
spark-submit \
--class Test \
--master spark://spark.dso.xxxx \
--executor-memory 10G \
/home/jumbo/user/sclee/dt/jars/dt_01_notcache-1.0-SNAPSHOT.jar \
/user/sclee/data/ /user/sclee/output
Referring to documentation try setting spark.deploy.spreadOut = false and the behavior will remain same after this setting.

streaming.StreamingContext: Error starting the context, marking it as stopped [Spark Streaming]

I was trying to run a sample spark streaming code. but I get this error:
16/06/02 15:25:42 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
My code is given below. I know there are a few unused imports, because I was doing something else and getting the same error so I modified the same code to run the sample program given on the spark streaming website:
package com.streams.spark_consumer;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.apache.spark.api.java.JavaSparkContext;
public class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
System.out.println("Han chal raha hai"); //just to know if this part of the code is executed
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
System.out.println("Han bola na chal raha hau chutiye 1"); //just to know if this part of the code is executed
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
jssc.start();
jssc.awaitTermination();
}
}
Can anybody help me out with this?
I am using local master, even then I have tried starting a master and stoping a master (also slaves), I didn't know why that might help but just in case, I have already tried that.
According to Spark documentation
Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
So use any of the output operations after your tranformations.
print()
foreachRDD(func)
saveAsObjectFiles(prefix, [suffix])
saveAsTextFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])

checkpoint SqlContext nullpointerException issue

I am using check pointing in my application, and when my application starts with a failure, I get a NullPointerException on SQLContext.
I assume the application is not able to recover the SQLContext because of serialization/deserialization issues. Is SQLContext not serializable?
Here is my code below
//DriverClass
final JavaSparkContext javaSparkCtx = new JavaSparkContext(conf);
final SQLContext sqlContext = new SQLContext(javaSparkCtx);
JavaStreamingContextFactory javaStreamingContextFactory = new JavaStreamingContextFactory() {
#Override
public JavaStreamingContext create() { //only first time executed
// TODO Auto-generated method stub
JavaStreamingContext jssc = new JavaStreamingContext(javaSparkCtx, Durations.minutes(1));
jssc.checkpoint(CHECKPOINT_DIRECTORY);
HashMap < String, String > kafkaParams = new HashMap < String, String > ();
kafkaParams.put("metadata.broker.list",
"abc.xyz.localdomain:6667");
//....
JavaDStream < String > fullMsg = messages
.map(new MapFunction());
fullMsg.foreachRDD(new SomeClass(sqlContext));
return jssc;
}
};
}
//Closure Class
public class SomeClass implements Serializable, Function < JavaRDD < String > , Void > {
SQLContext sqlContext;
public SomeClass(SQLContext sqlContext) {
// TODO Auto-generated constructor stub
this.sqlContext = sqlContext;
}
public void doSomething() {
this.sqlContext.createDataFrame();**// here is the nullpointerException**
}
//.......
}
SQLContext is Serializable because Spark SQL needs to use SQLContext in the executor side internally. However, you should not serialize it to the Streaming checkpoint. Instead, you should get it from rdd like this SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
See Streaming docs for more details: http://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#dataframe-and-sql-operations

Using Py4J to invoke a method that takes a JavaSparkContext and return a JavaRDD<Integer>

I am looking for some help or example code that illustrates pyspark calling user written Java code outside of spark itself that takes a spark context from Python and then returns an RDD built in Java.
For completeness, I'm using Py4J 0.81, Java 8, Python 2.7, and spark 1.3.1
Here is what I am using for the Python half:
import pyspark
sc = pyspark.SparkContext(master='local[4]',
appName='HelloWorld')
print "version", sc._jsc.version()
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
print gateway.entry_point.getRDDFromSC(sc._jsc)
The Java portion is:
import java.util.Map;
import java.util.List;
import java.util.ArrayList;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import py4j.GatewayServer;
public class HelloWorld
{
public JavaRDD<Integer> getRDDFromSC(JavaSparkContext jsc)
{
JavaRDD<Integer> result = null;
if (jsc == null)
{
System.out.println("XXX Bad mojo XXX");
return result;
}
int n = 10;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++)
{
l.add(i);
}
result = jsc.parallelize(l);
return result;
}
public static void main(String[] args)
{
HelloWorld app = new HelloWorld();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Running produces on the Python side:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit main.py
version 1.3.1
sc._jsc <class 'py4j.java_gateway.JavaObject'>
org.apache.spark.api.java.JavaSparkContext#50418105
None
The Java side reports:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit --class "HelloWorld" --master local[4] target/hello-world-1.0.jar
XXX Bad mojo XXX
The problem appears to be that I am not correctly passing the JavaSparkContext from Python to Java. The same failure of the JavaRDD being null occurs when I use from python sc._scj.sc().
What is the correct way to invoke user defined Java code that uses spark from Python?
So I've got an example of this in a branch that I'm working on for Sparkling Pandas The branch lives at https://github.com/holdenk/sparklingpandas/tree/add-kurtosis-support and the PR is at https://github.com/sparklingpandas/sparklingpandas/pull/90 .
As it stands it looks like you have two different gateway servers which seems like it might cause some problems, instead you can just use the existing gateway server and do something like:
sc._jvm.what.ever.your.class.package.is.HelloWorld.getRDDFromSC(sc._jsc)
assuming you make that a static method as well.

Resources