Using Hive Jars with Pyspark - apache-spark

The problem statement is usage of hive jars in py-spark code.
We are following the below set of standard steps
Create temporary function in pyspark code - spark.sql (" ")
spark.sql("create temporary function public_upper_case_udf as 'com.hive.udf.PrivateUpperCase' using JAR 'gs://hivebqjarbucket/UpperCase.jar'")
Invoke the temporary function in the spark.sql statements
The issue that we are facing is if the java class in jar file is not declared as public explicitly we are facing with the error during spark.sql invocations of the hive udf
org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'com.hive.udf.PublicUpperCase'
Java Class Code
class PrivateUpperCase extends UDF {
public String evaluate(String value) {
return value.toUpperCase();
}
}
When I make the class public, the issue seems to get resolved.
The query is if making the class public is only solution or is there any other way around it ?
Any assistance is appreciated.
Note - The Hive Jars cannot be converted to Spark UDFs owing to the complexity.

If it was not public, how would external packages call PrivateUpperCase.evaluate?
https://www.java-made-easy.com/java-access-modifiers.html
To allow the PrivateUpperCase to be private, the class would need to be in the same package from where PrivateUpperCase.evaluate() is called from. You might be able to hunt that down and set the package name the same, but otherwise it needs to be public.

Related

Circular Reference in Bean Class While Creating a Dataset from an Avro Generated Class

I have a class RawSpan.java that is Avro generated from the corresponding avdl defintion. I am trying to use this class to create a Dataframe to a Dataset<RawSpan> in Spark as:
val ds = df.select("value").select(from_avro($"value", "topic", "schema-reg-url")).select("from_avro(value).*").as[RawSpan]
However, I run into this error during deserialization:
UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema
The problem apparently happens here (L19), as per a similar question asked earlier.
I found this Jira but the PR to address it was closed due to no activity. Is there some workaround to this? My Spark version is 3.1.2. I am running this on Databricks.

Unable to find Databricks spark sql avro shaded jars in any public maven repository

We are trying to create avro record with confluent schema registry. The same record we want to publish to kafka cluster.
To attach schema id to each records (magic bytes) we need to use--
to_avro(Column data, Column subject, String schemaRegistryAddress)
To automate this we need to build project in pipeline & configure databricks jobs to use that jar.
Now the problem we are facing in notebooks we are able to find a methods with 3 parameters to it.
But the same library when we are using in our build downloaded from https://mvnrepository.com/artifact/org.apache.spark/spark-avro_2.12/3.1.2 its only having 2 overloaded methods of to_avro
Is databricks having some other maven repository for its shaded jars?
NOTEBOOK output
import org.apache.spark.sql.avro.functions
println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/databricks/jars/----workspace_spark_3_1--vendor--avro--avro_2.12_deploy_shaded.jar
functions
.getClass()
.getMethods()
.filter(p=>p.getName.equals("to_avro"))
.foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))
LOCAL output
import org.apache.spark.sql.avro.functions
println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/<home-dir-path>/.gradle/caches/modules-2/files-2.1/org.apache.spark/spark-avro_2.12/3.1.2/1160ae134351328a0ed6a062183faf9a0d5b46ea/spark-avro_2.12-3.1.2.jar
functions
.getClass()
.getMethods()
.filter(p=>p.getName.equals("to_avro"))
.foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))
Versions
Databricks => 9.1 LTS
Apache Spark => 3.1.2
Scala => 2.12
Update from databricks support
Unfortunately we do not have a sharable jar supporting the functionalities in the DBR. There was a feature request to include this in DBConnect; however it was not implemented as we did not have enough upvote to implement the feature.
Since your use case is to automate creation of the Jar file and then submit this as Job in Databricks, we should be able to create a jar stub (dbr-avro-dummy.jar) with a dummy implementation of the to_avro() function with three parameters and use this jar as a dependency to fool the compiler of your actual Jar (for the Job).
This will avoid the compilation error while building the Jar and at run time, since its run on the Databricks environment, it will pick the actual avro Jar from the DBR
You may build the dummy Jar Stub using the package code below: (You will use the maven/sbt spark/scala dependency for the Column function)
package org.apache.spark.sql
import java.net.URL
package object avro {
def from_avro(data: Column, key: String, schemaRegistryURL: URL): Column = {
new Column("dummy")
}
}
No, these jars aren't published to any public repository. You may check if the databricks-connect provides these jars (you can get their location with databricks-connect get-jar-dir), but I really doubt in that.
Another approach is to mock it, for example, create a small library that will declare a function with specific signature, and use it for compilation only, don't include into the resulting jar.

When are custom TableCatalogs loaded?

I've created a custom Catalog in Spark 3.0.0:
class ExCatalogPlugin extends SupportsNamespaces with TableCatalog
I've provided the configuration asking Spark to load the Catalog:
.config("spark.sql.catalog.ex", "com.test.ExCatalogPlugin")
But Spark never loads the plugin, during debug no breakpoints are ever hit inside the initialize method, and none of the namespaces it exposes are recognized. There are also no error messages logged. If I change the class name to an invalid class name no errors are thrown either.
I wrote a small TEST case similar to the test cases in the Spark code, and I am able to load the plugin if I call:
package org.apache.spark.sql.connector.catalog
....
class CatalogsTest extends FunSuite {
test("EX") {
val conf = new SQLConf()
conf.setConfString("spark.sql.catalog.ex", "com.test.ExCatalogPlugin")
val plugin:CatalogPlugin = Catalogs.load("ex", conf)
}
}
Spark is using it's normal Lazy loading techniques, and doesn't instantiate the custom Catalog Plugin until it's needed.
In my case referencing the plugin in one of two ways worked:
USE ex, this explicit USE statement caused Spark to lookup the catalog and instantiate it.
I have a companion TableProvider defined as class DefaultSource extends SupportsCatalogOptions. This class has a hard coded extractCatalog set to ex. If I create a reader for this source, it sees the name of the catalog provider and will instantiate it. It then uses the Catalog Provider to create the table.

Does Spark SQL 2.3+ support UDT?

I was going through this ticket and could not understand if Spark support UDT in 2.3+ version in any language (Scala, Python , Java, R) ?
I have class something like this
Class Test{
string name;
int age;
}
And My UDF method is:
public Test UDFMethod(string name, int age){
Test ob = new Test();
ob.name = name;
ob.age = age;
}
Sample Spark query
Select *, UDFMethod(name, age) From SomeTable;
Now UDFMethod(name, age) will return Test object. So will this work in Spark SQL after using SQLUserDefinedType tag and extending UserDefinedType class?
As UserDefinedType class was made private in Spark 2.0. I just want to know if UDT is supported in Spark 2.3+. If yes what is the best to use UserDefinedType or UDTRegisteration. As of now both are private in spark.
As you can check, JIRA ticket you've linked has been delayed to at least Spark 3.0. So it means that there is no such option intended for public usage for now.
It is always possible to get around access limits (by reflection, by putting your own code in Spark namespace), but it is definitely not supported, and you shouldn't expect help, if it fails or breaks in the future.

How can I create User Defined Functions in Cassandra with Custom Java Class?

I couldn't find this anywhere online. How can I create a custom user defined function in cassandra?.
For Ex :
CREATE OR REPLACE FUNCTION customfunc(custommap map<text, int>)
CALLED ON NULL INPUT
RETURNS map<int,bigint>
LANGUAGE java AS 'return MyClass.mymethod(custommap);';
Where "MyClass" is a class that I can register in the Classpath?
I have the same issue, too. The custom class in UDF is support in cassandra 2.2.14, but not in cassandra 3.11.4.
Go through the source codes, cassandra 3.11.4 setup the UDF class loader with no parent class loader so that it have full control about what class/resource UDF uses. In org.apache.cassandra.cql3.functions.UDFunction.java, a whitelist and blacklist is used to control which class/package can be access.
For your issue, you should add the full name of MyClass into whitelist, and re-build the cassandra.
1. First build your java project that contains your class. Remember you have to add package name to your class.
Example :
package exp;
import java.lang.Math;
import java.util.*;
public class MyClass
{
public static Map<Integer,Long> mymethod(Map<String, Integer> data) {
Map<Integer,Long> map = new HashMap<>();
map.put(1, 10L);
map.put(2, 20L);
map.put(3, 30L);
return map;
}
}
After compile and build i have the jar test.jar
2. Copy the jar file to all cassandra node's $CASSANDRA_HOME/lib Directory
3. Restart All Cassandra Node
4. Create your custom function
Example :
CREATE OR REPLACE FUNCTION customfunc(custommap map<text, int>)
CALLED ON NULL INPUT
RETURNS map<int,bigint>
LANGUAGE java
AS 'return exp.MyClass.mymethod(custommap);';
Now you can use the function :
cassandra#cqlsh:test> SELECT * FROM test_fun ;
id | data
----+------------------
1 | {'a': 1, 'b': 2}
(1 rows)
cassandra#cqlsh:test> SELECT customfunc(data) FROM test_fun ;
test.customfunc(data)
-----------------------
{1: 10, 2: 20, 3: 30}
(1 rows)
Just adding my 2 cents to this thread as I tried building an external class method to support something similar. After trying for hours with Datastax Sandbox 5.1 I could not get this to work as it couldn't seem to find my class and kept raising type errors.
My guess is that external JAR-based code for UDFs is not supported (see http://koff.io/posts/hll-in-cassandra/ and https://issues.apache.org/jira/browse/CASSANDRA-9892). Support for "TRUSTED" JARs is in planning stages for Cassandra 4. It might work in earlier versions pre 3.0 but I'm using the latest version from Datastax.
To work around this issue, I had to fallback to using a Javascript version instead (I was trying to convert a JSON string into a Map object).
While I realize Java UDFs perform better, the code I was testing was using Java Nashorn javascript support anyway, so using Javascript might not be such a bad thing. It does end up with a simpler one-liner UDF.

Resources