When are custom TableCatalogs loaded? - apache-spark

I've created a custom Catalog in Spark 3.0.0:
class ExCatalogPlugin extends SupportsNamespaces with TableCatalog
I've provided the configuration asking Spark to load the Catalog:
.config("spark.sql.catalog.ex", "com.test.ExCatalogPlugin")
But Spark never loads the plugin, during debug no breakpoints are ever hit inside the initialize method, and none of the namespaces it exposes are recognized. There are also no error messages logged. If I change the class name to an invalid class name no errors are thrown either.
I wrote a small TEST case similar to the test cases in the Spark code, and I am able to load the plugin if I call:
package org.apache.spark.sql.connector.catalog
....
class CatalogsTest extends FunSuite {
test("EX") {
val conf = new SQLConf()
conf.setConfString("spark.sql.catalog.ex", "com.test.ExCatalogPlugin")
val plugin:CatalogPlugin = Catalogs.load("ex", conf)
}
}

Spark is using it's normal Lazy loading techniques, and doesn't instantiate the custom Catalog Plugin until it's needed.
In my case referencing the plugin in one of two ways worked:
USE ex, this explicit USE statement caused Spark to lookup the catalog and instantiate it.
I have a companion TableProvider defined as class DefaultSource extends SupportsCatalogOptions. This class has a hard coded extractCatalog set to ex. If I create a reader for this source, it sees the name of the catalog provider and will instantiate it. It then uses the Catalog Provider to create the table.

Related

Using Hive Jars with Pyspark

The problem statement is usage of hive jars in py-spark code.
We are following the below set of standard steps
Create temporary function in pyspark code - spark.sql (" ")
spark.sql("create temporary function public_upper_case_udf as 'com.hive.udf.PrivateUpperCase' using JAR 'gs://hivebqjarbucket/UpperCase.jar'")
Invoke the temporary function in the spark.sql statements
The issue that we are facing is if the java class in jar file is not declared as public explicitly we are facing with the error during spark.sql invocations of the hive udf
org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'com.hive.udf.PublicUpperCase'
Java Class Code
class PrivateUpperCase extends UDF {
public String evaluate(String value) {
return value.toUpperCase();
}
}
When I make the class public, the issue seems to get resolved.
The query is if making the class public is only solution or is there any other way around it ?
Any assistance is appreciated.
Note - The Hive Jars cannot be converted to Spark UDFs owing to the complexity.
If it was not public, how would external packages call PrivateUpperCase.evaluate?
https://www.java-made-easy.com/java-access-modifiers.html
To allow the PrivateUpperCase to be private, the class would need to be in the same package from where PrivateUpperCase.evaluate() is called from. You might be able to hunt that down and set the package name the same, but otherwise it needs to be public.

Circular Reference in Bean Class While Creating a Dataset from an Avro Generated Class

I have a class RawSpan.java that is Avro generated from the corresponding avdl defintion. I am trying to use this class to create a Dataframe to a Dataset<RawSpan> in Spark as:
val ds = df.select("value").select(from_avro($"value", "topic", "schema-reg-url")).select("from_avro(value).*").as[RawSpan]
However, I run into this error during deserialization:
UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema
The problem apparently happens here (L19), as per a similar question asked earlier.
I found this Jira but the PR to address it was closed due to no activity. Is there some workaround to this? My Spark version is 3.1.2. I am running this on Databricks.

how to register graphx Edge class with Kryo?

I've been trying to register Edge class with Kryo but I'm always getting the following error.
java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.graphx.Edge\nNote: To register this class use: kryo.register(org.apache.spark.graphx.Edge.class);
what is wrong with following line?
sc.getConf.registerKryoClasses(Array(Class.forName("org.apache.spark.graphx.Edge")))
How should I do it?
I've had trouble getting graphx classes registered. This finally works for me...
import org.apache.spark.graphx.GraphXUtils
val conf = new SparkConf().setAppName("yourAppName")
GraphXUtils.registerKryoClasses(conf)
Here's what's going on behind the scenes...
https://github.com/amplab/graphx/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphKryoRegistrator.scala
In your case... I'm not sure why the following wouldn't work fine, since Edge is exposed...
conf.registerKryoClasses(Array(classOf[Edge]))
But I think there are private classes in graphx that aren't exposed through the spark API, at least I see them in the graphx repo, but not the spark.graphx repo. In my case, I couldn't get VertexAttributeBlock registered, until I used the GraphXUtils method.

SparkSession doesn't shutdown properly between unit tests

I have a few unit tests that need to have their own sparkSession. I extended SQLTestUtils, and am overriding the beforeAll and afterAll functions that are used in many other Spark Unit tests (from the source). I have a few test suites that look something like this:
class MyTestSuite extends QueryTest with SQLTestUtils {
protected var spark: SparkSession = null
override def beforeAll(): Unit = {
super.beforeAll()
spark = // initialize sparkSession...
}
override def afterAll(): Unit = {
try {
spark.stop()
spark = null
} finally {
super.afterAll()
}
}
// ... my tests ...
}
If I run one of these, it's fine, but if I run two or more, I get this error:
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/jenkins/workspace/Query/apache-spark/sql/hive-thriftserver-cat-server/metastore_db.
But I thought that the afterAll() was supposed to properly shut spark down so that I could create a new one. Is this not right? How do I accomplish this?
One way to do it this is to disable parallel test execution for your Spark app project to make sure only one instance of Spark Session object is active at the time. In sbt syntax it would like this:
project.in(file("your_spark_app"))
.settings(parallelExecution in Test := false)
The downside is that this is a per project setting and it would also affect the tests that would benefit from parallelization. A workaround would be to create a separate project for Spark tests.

Register custom classes for Kryo serialization in Beam Spark runner

I have seen that Beam Spark runner uses BeamSparkRunnerRegistrator for kryo registration. Is there a way to register custom user classes as well?
There is a way to do so, but first, may I ask why you want to do this?
Generally speaking, Beam's Spark runner uses Beam coders to serialize user data.
We currently have a bug in which cached DStreams are being serialized using Kryo, and if the user classes are not Kryo serializable this fails. BEAM-2669. We are currently attempting to solve this issue.
If this is the issue you are facing you can currently workaround this by using Kryo's registrator. Is this the issue you are facing? or do you have a different reason for doing this, please let me know.
In any case, here is how you can provide your own custom JavaSparkContext instance to Beam's Spark runner by using SparkContextOptions
SparkConf conf = new SparkConf();
conf.set("spark.serializer", KryoSerializer.class.getName());
conf.set("spark.kryo.registrator", "my.custom.KryoRegistrator");
JavaSparkContext jsc = new JavaSparkContext(..., conf);
SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setRunner(SparkRunner.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
Pipeline p = Pipeline.create(options);
For more information see:
Beam Spark runner documentation
Example: ProvidedSparkContextTest.java
Create your own KryoRegistrator with this custom serializer
package Mypackage
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[A], new CustomASerializer())
}}
Then, add configuration entry about it with your registrator's fully-qualified name, e.g. Mypackage.MyRegistrator:
val conf = new SparkConf()
conf.set("spark.kryo.registrator", "Mypackage.KryoRegistrator")
See documentation: Data Serialization Spark
If you don’t want to register your classes, Kryo serialization will still work, but it will have to store the full class name with each object, which is wasteful.

Resources