ClassNotFound when submit a Spring-boot-fat-jar to spark - apache-spark

when I build a fat jar using spring-boot, its name is
sparker4m-0.0.1.jar
And I "spark-submit" it to spark cluster.
./bin/spark-submit --deploy-mode client --class org.springframework.boot.loader.JarLauncher ~/Desktop/sparker4m-0.0.1.jar --spark.master=spark://172.16.179.52:7077 --source_path=file:///usr/local/spark-2.3.1-bin-hadoop2.7/hdfs_mock/1530842877616 --out_path=file:///usr/local/spark-2.3.1-bin-hadoop2.7/hdfs_mock/153084287761_out --hadoop.home.dir=/usr/local/hadoop
You can see I use " --class org.springframework.boot.loader.JarLauncher", because it's the main-class of Spring-boot-jar, and the actual Main-class is test.sparker4m.Sparker4mApplication
here is my code:
#SpringBootApplication
public class Sparker4mApplication {
public static void main(String[] args) throws Exception{
ApplicationContext ctx =
SpringApplication.run(Sparker4mApplication.class, args);
SparkConf conf = new SparkConf().setAppName("Test")
.setMaster(happyDay.getVal("spark.master"))
.set("spark.driver.host", Inet4Address.getLocalHost().getHostAddress());
JavaSparkContext sc = new JavaSparkContext(conf);
.........
JavaPairRDD<String, String> transformed = rdd.mapToPair(new PairFunction<Tuple2<String, PortableDataStream>, String, String>() {
#Override
public Tuple2<String, String> call(Tuple2<String, PortableDataStream> tuple2) throws Exception {
String fname = tuple2._1();
PortableDataStream content = tuple2._2();
byte[] bytes = content.toArray();
String result = YUVSimpleTrans.transform(bytes);
return new Tuple2<>(fname, result);
}
});
..........
.........
}
}
Everything goes well, but, when Excecutor runs, exception throws:
Caused by: java.lang.ClassNotFoundException: test.sparker4m.Sparker4mApplication$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
test.sparker4m.Sparker4mApplication$1 is an inner class used by test.sparker4m.Sparker4mApplication and it should be "new PairFunction<Tuple2<String, PortableDataStream>, String, String>()".
I supposed that, spark's Executor process use a individual URLClassLoader to load classes, but my(sprint-boot-fat-jar's) classes are all in sparker4m-0.0.1.jar/BOOT-INF/classes, and my dependency jars are all in sparker4m-0.0.1.jar/BOOT-INF/lib, so the URLClassLoader cannot load the class.
Is there any good idea for this? needs help!

Finally, got solution.
The key's are:
1.It should use an uber-jar with all classes and resoures and not in "jar lib"
2.Must understand how "JarLaucher" works. JarLaucher is offered by "spring-boot-tools", it is used to "createArchive" and URLLoadClass from /BOOT-INF/*(spring-boot 2.x) and start true Main class.
so, if "1" is done, "2" is not needed, because we don't need URLLoadClass from /BOOT-INF/* any more.
The solution is : use maven-shade-plugin(with out spring-boot-maven-plugin) to build an uber-jar, and define your true main class(not JarLaucher) use org.apache.maven.plugins.shade.resource.ManifestResourceTransformer in this plugin.
After all things done, I found a question Submitting spring boot application jar to spark-submit was already fit this out, but the answer didn't point clearly that you shoud not use ${start-class}(JarLaucher) but true main class

Related

Register Java Class in Flink Cluster

I am running my Fat Jar in Flink Cluster which reads Kafka and saves in Cassandra, the code is,
final Properties prop = getProperties();
final FlinkKafkaConsumer<String> flinkConsumer = new FlinkKafkaConsumer<>
(kafkaTopicName, new SimpleStringSchema(), prop);
flinkConsumer.setStartFromEarliest();
final DataStream<String> stream = env.addSource(flinkConsumer);
DataStream<Person> sensorStreaming = stream.flatMap(new FlatMapFunction<String, Person>() {
#Override
public void flatMap(String value, Collector<Person> out) throws Exception {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
logger.error("Json Processing Exception", e);
}
}
});
savePersonDetails(sensorStreaming);
env.execute();
and The Person POJO contains,
#Column(name = "event_time")
private Instant eventTime;
There is codec required to store Instant as below for Cassandra side,
final Cluster cluster = ClusterManager.getCluster(cassandraIpAddress);
cluster.getConfiguration().getCodecRegistry().register(InstantCodec.instance);
When i run standalone works fine, but when i run local cluster throws me an error as below,
Caused by: com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [timestamp <-> java.time.Instant]
at com.datastax.driver.core.CodecRegistry.notFound(CodecRegistry.java:679)
at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:526)
at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:506)
at com.datastax.driver.core.CodecRegistry.access$200(CodecRegistry.java:140)
at com.datastax.driver.core.CodecRegistry$TypeCodecCacheLoader.load(CodecRegistry.java:211)
at com.datastax.driver.core.CodecRegistry$TypeCodecCacheLoader.load(CodecRegistry.java:208)
I read the below document for registering,
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/custom_serializers.html
but InstantCodec is 3rd party one. How can i register it?
I solved the problem, there was LocalDateTime which was emitting from and when i was converting with same type, there was above error. I changed the type into java.util Date type then it worked.

SparkListener in Spark on YARN-CLUSTER not works?

My main purpose is to get the appId after submitting the yarn-cluster task through java code, which is convenient for more business operations.
Add the --conf=spark.extraListeners=Mylistener
While SparkListener does work when I use Spark in standalone mode, it doesn't work when I run Spark on a cluster over Yarn. Is it possible for SparkListener to work when running over Yarn? If so, what steps should I do to enable that?
Here is the Mylistener class code:
public class Mylistener extends SparkListener {
private static Logger logger = LoggerFactory.getLogger(EnvelopeSparkListener.class);
#Override
public void onApplicationStart(SparkListenerApplicationStart sparkListenerApplicationStart) {
Option<String> appId = sparkListenerApplicationStart.appId();
EnvelopeSubmit.appId = appId.get();
logger.info("====================start");
}
#Override
public void onBlockManagerAdded(SparkListenerBlockManagerAdded blockManagerAdded) {
logger.info("=====================add");
}
}
Here is the Main class to submit the application:
public static void main(String[] args) {
String jarpath = args[0];
String childArg = args[1];
System.out.println("jarpath:" + jarpath);
System.out.println("childArg:" + childArg);
System.setProperty("HADOOP_USER_NAME", "hdfs");
String[] arg = {"--verbose=true", "--class=com.cloudera.labs.envelope.EnvelopeMain",
"--master=yarn", "--deploy-mode=cluster","--conf=spark.extraListeners=Mylistener","--conf","spark.eventLog.enabled=true", "--conf","spark.yarn.jars=hdfs://192.168.6.188:8020/user/hdfs/lib/*", jarpath, childArg};
SparkSubmit.main(arg);
}
If you just want to get the app id you can simply do this,
logger.info(s"Application id: ${sparkSession.sparkContext.applicationId}")
Hope this answers your question!

What is the difference between Spark Client and Spark Driver?

What is the difference between Spark Client and Spark Driver? I keep hearing that Spark Client is the one that does spark-submit but I never had to do that all I do is install Spark Libraries and Initialize spark context under public static void main(String[] args) like the code below
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;
public class JavaWordCount {
public static void main(String[] args) {
// create Spark context with Spark configuration
JavaSparkContext sc = new JavaSparkContext(new SparkConf()
.setAppName("Spark Count"))
.setMaster(master_url)
.setJars(JavaSparkContext.jarOfClass(JavaWordCount.class));
// get threshold
final int threshold = Integer.parseInt(args[1]);
// read in text file and split each document into words
JavaRDD<String> tokenized = sc.textFile(args[0]).flatMap(
new FlatMapFunction() {
public Iterable call(String s) {
return Arrays.asList(s.split(" "));
}
}
);
// count the occurrence of each word
JavaPairRDD<String, Integer> counts = tokenized.mapToPair(
new PairFunction() {
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}
).reduceByKey(
new Function2() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
// filter out words with fewer than threshold occurrences
JavaPairRDD<String, Integer> filtered = counts.filter(
new Function, Boolean>() {
public Boolean call(Tuple2 tup) {
return tup._2 >= threshold;
}
}
);
// count characters
JavaPairRDD<Character, Integer> charCounts = filtered.flatMap(
new FlatMapFunction<Tuple2<String, Integer>, Character>() {
#Override
public Iterable<Character> call(Tuple2<String, Integer> s) {
Collection<Character> chars = new ArrayList<Character>(s._1().length());
for (char c : s._1().toCharArray()) {
chars.add(c);
}
return chars;
}
}
).mapToPair(
new PairFunction<Character, Character, Integer>() {
#Override
public Tuple2<Character, Integer> call(Character c) {
return new Tuple2<Character, Integer>(c, 1);
}
}
).reduceByKey(
new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
System.out.println(charCounts.collect());
}
}
In this code I dont see any spark-submit so which is the client and which is the driver here? Also I dont really understand what is happening with SparkConf.setJars? Why is that needed and what does it do?
The Master is the one who creates the Task and submits it to the Workers and coordinates between different Job Stages,whereas the Driver sits one step above the Master which is involved in creating the context and creating the RDD Graph,which it then submits to the Master.
The Spark Driver is a program that declares Transformations and actions on RDDs of data and submits such requests to the master.It basically creates your SparkContext.
Hence,the Driver prepares the context and "Declares" operations on data using RDD Transformations and Actions.The Driver then submits this serialized RDD graph to the Master.
The Master then creates the Task out of it and submit them to the Worker for the execution.It also coordinates different Job stages.
By Spark Client I'm assuming you're talking about Spark Master, which is the server responsible for scheduling Spark jobs.
The code you have works with the Spark API, but you have to run it, as you run any program. In order to run jobs in Spark, you have to schedule them with the relevant resource manager (Standalone, YARN or Mesos) to actually execute the job in the cluster environment.
After you assemble your JAR, you go to the Spark Master server and schedule the job using the spark-submit script. As the documentation states:
Once a user application is bundled, it can be launched using the
bin/spark-submit script. This script takes care of setting up the
classpath with Spark and its dependencies, and can support different
cluster managers and deploy modes that Spark supports:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
†
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes. application-arguments: Arguments passed to the
main method of your main class, if any
Also I dont really understand what is happening with
SparkContext.setJars? Why is that needed and what does it do?
I assume you mean SparkConf.setJars. What that settings helps do is distribute JARs which aren't packaged into an uber JAR to the worker nodes, if needed. This is identical to passing the --jars flag to spark-submit.

java Spark streaming to cassandra

Goal: Read kafka with spark streaming and store data in cassandra
By: Java Spark cassandra connector 1.6
Data input: simple json line object {"id":"1","field1":"value1}
i´ve a java class to read from kafka by spark streaming, processing the data read and then store it in cassandra.
here is the main code:
**JavaPairReceiverInputDStream**<String, String> messages =
KafkaUtils.createStream(ssc,
targetKafkaServerPort, targetTopic, topicMap);
**JavaDStream** list = messages.map(new Function<Tuple2<String,String>,List<Object>>(){
public List<Object> call( Tuple2<String,String> tuple2){
List<Object> **list**=new ArrayList<Object>();
Gson gson = new Gson();
MyClass myclass = gson.fromJson(tuple2._2(), MyClass.class);
myclass.setNewData("new_data");
String jsonInString = gson.toJson(myclass);
list.add(jsonInString);
return list;
}
});
The next code is incorrect:
**javaFunctions**(list)
.writerBuilder("schema", "table", mapToRow(JavaDStream.class))
.saveToCassandra();
Because "javaFunctions" method expect a JavaRDD object and "list" is a JavaDStream...
I´d need to cast JavaDStream to JavaRDD but I don´t find the right way...
Any help?
Let's use
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.* instead of com.datastax.spark.connector.japi.CassandraJavaUtil.*
ummmm not really...What I´ve done is use a foreachRDD after create the dsStream:
dStream.foreachRDD(new Function<JavaRDD<MyObject>, Void>() {
#Override
public Void call(JavaRDD<MyObject> rdd) throws Exception {
if (rdd != null) {
javaFunctions(rdd)
.writerBuilder("schema", "table", mapToRow(MyObject.class))
.saveToCassandra();
logging(" --> Saved data to cassandra",1,null);
}
return null;
}
});
Hope to be usefull...

Spark does not distribute work

I set up two virtual machines to test Spark in a distributed setup. It seems that my jobs are only run locally on one node, the one I use to submit the job.
One node is run as datanode/worker node and the second one is additionally namenode/secondary-namenode
I configured the underlying hadoop to use Yarn.
The jps command confirms that the various services are started correctly and basically available after I expected the start*-scripts in hadoop/spark.
I use htop to "track" if the other node is used, but the cpu usage jumps between 2 an 3% --> probably not used. I wonder what I am missing here.
I start my job with this command:
./spark-submit --class com.... DistributedTest --master yarn-client myJar.jar
This is the class I am executing (the data.txt file is about 1GB pure text)
public class DistributedTest
{
public static void main(String[] args)
throws IOException
{
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile("hdfs://woodpecker:10001/husr/data.txt");// .persist(StorageLevel.DISK_ONLY());
long numAs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("b");
}
}).count();
sc.close();
String s = "Lines with a: " + numAs + ", lines with b: " + numBs;
System.out.println(s);
}
}
Anyone any ideas why my setup does not distribute
The filter operation is definitely distributed, and count is partially computed on a worker, while the total count is calculated back on the master. The result of the count is also on the master.
Filtering one GB of data isn't really going to stress Spark anyway, so you should only see a short CPU spike on the worker. Rather take a look at I/O usage.
Your app is fine, there must be something wrong with your setup.
First, you go through your Spark UI and make sure you have multiple workers and it is also depend on how much partition you have in your Rdd.

Resources