Spark Streaming standalone app and dependencies - apache-spark

I've got a scala spark streaming application that I'm running from inside IntelliJ. When I run against local[2], it runs fine. If I set the master to spark://masterip:port, then I get the following exception:
java.lang.ClassNotFoundException: RmqReceiver
I should add that I've got a custom receiver implemented in the same project called RmqReceiver. This is my app's code:
import akka.actor.{Props, ActorSystem}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
object Streamer {
def main(args:Array[String]): Unit ={
val conf = new SparkConf(true).setMaster("spark://192.168.40.2:7077").setAppName("Streamer")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messages = ssc.receiverStream(new RmqReceiver(...))
messages.print()
ssc.start()
ssc.awaitTermination()
}
}
The RmqReceiver class is in the same scala folder as Streamer. I understand that using spark-submit with --jars for dependencies will likely make this work. Is there any way to get this working from inside the application?

To run job on standalone spark cluster it need to know about all classes used in your applications. So you can add them to spark class path at startup, what is difficult and I don't suggest you to do that.
You need to package your application as uber-jar (compress all dependencies into single jar file) and then add it to SparkConf jars.
We use sbt-assembly plugin. If you're using maven, it has the same functionality with maven assembly
val sparkConf = new SparkConf().
setMaster(config.getString("spark.master")).
setJars(SparkContext.jarOfClass(this.getClass).toSeq)
I don't think that you can dp it from Intellij Idea, you definitely can do it as a part of sbt test phase.

Related

How to connect to Phoenix From Apache Spark 2.X Using Java

Surprisingly , couldn't find any up to date JAVA document in the web for this . The 1 or 2 examples in the entire World Wild Web is too old. I came up with the following which fails with error 'Module not Found org.apache.phoenix.spark' , but That module is part of the Jar for Sure . I don't think following approach is right because it is copy - paste from different examples, and loading a module like this is a bit anti pattern , as we already have the package as part of the jar. Please show me the right way.
Note- Please do Scala or Phython example , They are easily available over net,
public class ECLoad {
public static void main(String[] args){
//Create a SparkContext to initialize
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("ECLoad")
.master("local")
.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate();
spark.conf().set("spark.testing.memory", "2147480000"); // if you face any memory issue
Dataset<Row> df = spark.sqlContext().read().format("org.apache.phoenix.spark.*").option("table",
"CLINICAL.ENCOUNTER_CASES").option("zkUrl", "localhost:2181").load();
df.show();
}
}
I'm trying to run it as
spark-submit --class "encountercases.ECLoad" --jars phoenix-spark-5.0.0-HBase-2.0.jar,phoenix-core-5.0.0-HBase-2.0.jar --master local ./PASpark-1.0-SNAPSHOT.jar
and I get following error -
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
I see required jars are already at the suggested path and hbase-site.xml symlink exixsts.
Before getting phoenix working with spark, you will need to setup the environment for spark so that it knows how to access phoenix/hbase.
First create a symbolic link to hbase-site.xml
ln -s /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/hbase-site.xml
Alternatively you can add this file while creating spark session or in spark defaults.
You will need to add jars under /usr/hdp/current/phoenix-client/ to driver as well as executor class path. Parameters to be set: spark.driver.extraClassPath and spark.executor.extraClassPath
This step is trivial and could easily be translated into java/scala/python/R, above 2 steps are critical for it to work as those setup env:
val df = sqlContext.load("org.apache.phoenix.spark",Map("table" -> "CLINICAL.ENCOUNTER_CASES", "zkUrl" -> "localhost:2181"))
Refer to: https://community.hortonworks.com/articles/179762/how-to-connect-to-phoenix-tables-using-spark2.html

Register custom classes for Kryo serialization in Beam Spark runner

I have seen that Beam Spark runner uses BeamSparkRunnerRegistrator for kryo registration. Is there a way to register custom user classes as well?
There is a way to do so, but first, may I ask why you want to do this?
Generally speaking, Beam's Spark runner uses Beam coders to serialize user data.
We currently have a bug in which cached DStreams are being serialized using Kryo, and if the user classes are not Kryo serializable this fails. BEAM-2669. We are currently attempting to solve this issue.
If this is the issue you are facing you can currently workaround this by using Kryo's registrator. Is this the issue you are facing? or do you have a different reason for doing this, please let me know.
In any case, here is how you can provide your own custom JavaSparkContext instance to Beam's Spark runner by using SparkContextOptions
SparkConf conf = new SparkConf();
conf.set("spark.serializer", KryoSerializer.class.getName());
conf.set("spark.kryo.registrator", "my.custom.KryoRegistrator");
JavaSparkContext jsc = new JavaSparkContext(..., conf);
SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setRunner(SparkRunner.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
Pipeline p = Pipeline.create(options);
For more information see:
Beam Spark runner documentation
Example: ProvidedSparkContextTest.java
Create your own KryoRegistrator with this custom serializer
package Mypackage
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[A], new CustomASerializer())
}}
Then, add configuration entry about it with your registrator's fully-qualified name, e.g. Mypackage.MyRegistrator:
val conf = new SparkConf()
conf.set("spark.kryo.registrator", "Mypackage.KryoRegistrator")
See documentation: Data Serialization Spark
If you don’t want to register your classes, Kryo serialization will still work, but it will have to store the full class name with each object, which is wasteful.

Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI

I'm using Spark 2.0 with PySpark.
I am redefining SparkSession parameters through a GetOrCreate method that was introduced in 2.0:
This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.
https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder.getOrCreate
So far so good:
from pyspark import SparkConf
SparkConf().toDebugString()
'spark.app.name=pyspark-shell\nspark.master=local[2]\nspark.submit.deployMode=client'
spark.conf.get("spark.app.name")
'pyspark-shell'
Then I redefine SparkSession config with the promise to see the changes in WebUI
appName(name)
Sets a name for the application, which will be shown in the Spark web UI.
https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder.appName
c = SparkConf()
(c
.setAppName("MyApp")
.setMaster("local")
.set("spark.driver.memory","1g")
)
from pyspark.sql import SparkSession
(SparkSession
.builder
.enableHiveSupport() # metastore, serdes, Hive udf
.config(conf=c)
.getOrCreate())
spark.conf.get("spark.app.name")
'MyApp'
Now, when I go to localhost:4040, I would expect to see MyApp as an app name.
However, I still see pyspark-shell application UI
Where am I wrong?
Thanks in advance!
I believe that documentation is a bit misleading here and when you work with Scala you actually see a warning like this:
... WARN SparkSession$Builder: Use an existing SparkSession, some configuration may not take effect.
It was more obvious prior to Spark 2.0 with clear separation between contexts:
SparkContext configuration cannot be modified on runtime. You have to stop existing context first.
SQLContext configuration can be modified on runtime.
spark.app.name, like many other options, is bound to SparkContext, and cannot be modified without stopping the context.
Reusing existing SparkContext / SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
spark.conf.get("spark.sql.shuffle.partitions")
String = 200
val conf = new SparkConf()
.setAppName("foo")
.set("spark.sql.shuffle.partitions", "2001")
val spark = SparkSession.builder.config(conf).getOrCreate()
... WARN SparkSession$Builder: Use an existing SparkSession ...
spark: org.apache.spark.sql.SparkSession = ...
spark.conf.get("spark.sql.shuffle.partitions")
String = 2001
While spark.app.name config is updated:
spark.conf.get("spark.app.name")
String = foo
it doesn't affect SparkContext:
spark.sparkContext.appName
String = Spark shell
Stopping existing SparkContext / SparkSession
Now let's stop the session and repeat the process:
spark.stop
val spark = SparkSession.builder.config(conf).getOrCreate()
... WARN SparkContext: Use an existing SparkContext ...
spark: org.apache.spark.sql.SparkSession = ...
spark.sparkContext.appName
String = foo
Interestingly when we stop the session we still get a warning about using existing SparkContext, but you can check it is actually stopped.
I ran into the same problem and struggled with it for a long time, then find a simple solution:
spark.stop()
then build your new sparksession again

Got UnsatisfiedLinkError when start Spark program from Java code

I am using SparkLauncher to launch my spark app from Java.The code looks like
Map<String, String> envMap = new HashMap<>();
envMap.put("HADOOP_CONF_DIR","/etc/hadoop/conf");
envMap.put("JAVA_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native");
envMap.put("LD_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native");
envMap.put("SPARK_HOME","/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark");
envMap.put("DEFAULT_HADOOP_HOME","/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop");
envMap.put("SPARK_DIST_CLASSPATH","all jars under /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars");
envMap.put("HADOOP_HOME","/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop");
SparkLauncher sparklauncher = new SparkLauncher(envMap)
.setAppResource("myapp.jar")
.setSparkHome("/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/")
.setMainClass("spark.HelloSpark")
.setMaster("yarn-cluster")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.setConf("spark.driver.userClassPathFirst", "true")
.setConf("spark.executor.userClassPathFirst", "true").launch();
Every time, I got
User class threw exception: java.lang.UnsatisfiedLinkError:
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
It looks like your jar is including Spark/Hadoop libraries that have conflict with other libraries in the cluster. Check that your Spark and Hadoop dependencies are marked as provided.

Spark uses only one core when using forEachRemaining

I'm still new to Spark. I wrote this code to parse large string list.
I had to use forEachRemaining because I had to initialize some non-serializable objects in each partition.
JavaRDD<String> lines=initRetriever();
lines.foreachPartition(iter->{
Obj1 obj1=initObj1()
MyStringParser parser=new MyStringParser(obj1);
iter.forEachRemaining(str->{
try {
parser.parse(str);
} catch (ParsingException e) {
e.printStackTrace();
}
});
System.out.print("all parsed");
obj1.close();
});
I believe Spark is all about parallelism. But this program uses only a single thread on my local machine. Did I do something wrong? Missing configuration? Or maybe the iter doesn't allow it to execute all in parallel.
EDIT
I have no configuration files for Spark.
That's how I initialize Spark
SparkConf conf = new SparkConf()
.setAppName(AbstractSparkImporter.class.getCanonicalName())
.setMaster("local");
I run it on IDE and using mvn:exec command.
As #Alberto-Bonsanto indicated, using local[*] triggers Spark to use all available threads. More info here.
SparkConf conf = new SparkConf()
.setAppName(AbstractSparkImporter.class.getCanonicalName())
.setMaster("local[*]");

Resources