I have seen that Beam Spark runner uses BeamSparkRunnerRegistrator for kryo registration. Is there a way to register custom user classes as well?
There is a way to do so, but first, may I ask why you want to do this?
Generally speaking, Beam's Spark runner uses Beam coders to serialize user data.
We currently have a bug in which cached DStreams are being serialized using Kryo, and if the user classes are not Kryo serializable this fails. BEAM-2669. We are currently attempting to solve this issue.
If this is the issue you are facing you can currently workaround this by using Kryo's registrator. Is this the issue you are facing? or do you have a different reason for doing this, please let me know.
In any case, here is how you can provide your own custom JavaSparkContext instance to Beam's Spark runner by using SparkContextOptions
SparkConf conf = new SparkConf();
conf.set("spark.serializer", KryoSerializer.class.getName());
conf.set("spark.kryo.registrator", "my.custom.KryoRegistrator");
JavaSparkContext jsc = new JavaSparkContext(..., conf);
SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setRunner(SparkRunner.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
Pipeline p = Pipeline.create(options);
For more information see:
Beam Spark runner documentation
Example: ProvidedSparkContextTest.java
Create your own KryoRegistrator with this custom serializer
package Mypackage
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[A], new CustomASerializer())
}}
Then, add configuration entry about it with your registrator's fully-qualified name, e.g. Mypackage.MyRegistrator:
val conf = new SparkConf()
conf.set("spark.kryo.registrator", "Mypackage.KryoRegistrator")
See documentation: Data Serialization Spark
If you don’t want to register your classes, Kryo serialization will still work, but it will have to store the full class name with each object, which is wasteful.
Related
I've created a custom Catalog in Spark 3.0.0:
class ExCatalogPlugin extends SupportsNamespaces with TableCatalog
I've provided the configuration asking Spark to load the Catalog:
.config("spark.sql.catalog.ex", "com.test.ExCatalogPlugin")
But Spark never loads the plugin, during debug no breakpoints are ever hit inside the initialize method, and none of the namespaces it exposes are recognized. There are also no error messages logged. If I change the class name to an invalid class name no errors are thrown either.
I wrote a small TEST case similar to the test cases in the Spark code, and I am able to load the plugin if I call:
package org.apache.spark.sql.connector.catalog
....
class CatalogsTest extends FunSuite {
test("EX") {
val conf = new SQLConf()
conf.setConfString("spark.sql.catalog.ex", "com.test.ExCatalogPlugin")
val plugin:CatalogPlugin = Catalogs.load("ex", conf)
}
}
Spark is using it's normal Lazy loading techniques, and doesn't instantiate the custom Catalog Plugin until it's needed.
In my case referencing the plugin in one of two ways worked:
USE ex, this explicit USE statement caused Spark to lookup the catalog and instantiate it.
I have a companion TableProvider defined as class DefaultSource extends SupportsCatalogOptions. This class has a hard coded extractCatalog set to ex. If I create a reader for this source, it sees the name of the catalog provider and will instantiate it. It then uses the Catalog Provider to create the table.
I'm still new to Spark. I wrote this code to parse large string list.
I had to use forEachRemaining because I had to initialize some non-serializable objects in each partition.
JavaRDD<String> lines=initRetriever();
lines.foreachPartition(iter->{
Obj1 obj1=initObj1()
MyStringParser parser=new MyStringParser(obj1);
iter.forEachRemaining(str->{
try {
parser.parse(str);
} catch (ParsingException e) {
e.printStackTrace();
}
});
System.out.print("all parsed");
obj1.close();
});
I believe Spark is all about parallelism. But this program uses only a single thread on my local machine. Did I do something wrong? Missing configuration? Or maybe the iter doesn't allow it to execute all in parallel.
EDIT
I have no configuration files for Spark.
That's how I initialize Spark
SparkConf conf = new SparkConf()
.setAppName(AbstractSparkImporter.class.getCanonicalName())
.setMaster("local");
I run it on IDE and using mvn:exec command.
As #Alberto-Bonsanto indicated, using local[*] triggers Spark to use all available threads. More info here.
SparkConf conf = new SparkConf()
.setAppName(AbstractSparkImporter.class.getCanonicalName())
.setMaster("local[*]");
Is there a way to get spark configurations from the worker (i.e. inside the closure of a map function). I tried using
SparkEnv.get().conf()
but it seems to not contain all the custom spark configs that I've set prior to creating SparkContext
EDIT:
Through SparkEnv I'm able to get default configurations set via spark-defaults.config but all confs I set explicitly through the setter method
SparkConf conf = new SparkConf()
conf.set("my.configuration.key", "myConfigValue")
SparkContext sc = new SparkContext(conf)
are not present in the SparkConf object I get through SparkEnv.get().conf()
SparkEnv is a part of the developer API and is not intended for external use.
You can simply create a broadcast variable, though.
val confBd = sc.broadcast(sc.getConf.getAll.toMap)
rdd.foreachPartition(_ => println(confBd.value.get("spark.driver.host")))
Is there an idiomatic way to create a spark context, that if no other master is provided will default to some fall back master?
e.g.
new SparkContext(defaultMaster = "local[4]")
If I run this with let's say, spark-submit and specify a master as a CLI param, or via an env variable, it will use that, but if I run it without specifying anything, it will default to what I provided above.
Is there a built in way to achieve this? (I have workarounds but I was wondering if there is a common pattern for this behavior)
You can use the following:
val conf = new SparkConf()
conf.setIfMissing("spark.master", "local[4]")
val sc = new SparkContext(conf)
you can set the default master url in conf/spark-defaults.conf in the Spark directory
or
Use :
val conf = new SparkConf()
conf.setMaster("local[4]")
val sc = new SparkContext(conf)
And whenever you set master url using --master it overrides the default values.
I've got a scala spark streaming application that I'm running from inside IntelliJ. When I run against local[2], it runs fine. If I set the master to spark://masterip:port, then I get the following exception:
java.lang.ClassNotFoundException: RmqReceiver
I should add that I've got a custom receiver implemented in the same project called RmqReceiver. This is my app's code:
import akka.actor.{Props, ActorSystem}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
object Streamer {
def main(args:Array[String]): Unit ={
val conf = new SparkConf(true).setMaster("spark://192.168.40.2:7077").setAppName("Streamer")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messages = ssc.receiverStream(new RmqReceiver(...))
messages.print()
ssc.start()
ssc.awaitTermination()
}
}
The RmqReceiver class is in the same scala folder as Streamer. I understand that using spark-submit with --jars for dependencies will likely make this work. Is there any way to get this working from inside the application?
To run job on standalone spark cluster it need to know about all classes used in your applications. So you can add them to spark class path at startup, what is difficult and I don't suggest you to do that.
You need to package your application as uber-jar (compress all dependencies into single jar file) and then add it to SparkConf jars.
We use sbt-assembly plugin. If you're using maven, it has the same functionality with maven assembly
val sparkConf = new SparkConf().
setMaster(config.getString("spark.master")).
setJars(SparkContext.jarOfClass(this.getClass).toSeq)
I don't think that you can dp it from Intellij Idea, you definitely can do it as a part of sbt test phase.