I was going through this ticket and could not understand if Spark support UDT in 2.3+ version in any language (Scala, Python , Java, R) ?
I have class something like this
Class Test{
string name;
int age;
}
And My UDF method is:
public Test UDFMethod(string name, int age){
Test ob = new Test();
ob.name = name;
ob.age = age;
}
Sample Spark query
Select *, UDFMethod(name, age) From SomeTable;
Now UDFMethod(name, age) will return Test object. So will this work in Spark SQL after using SQLUserDefinedType tag and extending UserDefinedType class?
As UserDefinedType class was made private in Spark 2.0. I just want to know if UDT is supported in Spark 2.3+. If yes what is the best to use UserDefinedType or UDTRegisteration. As of now both are private in spark.
As you can check, JIRA ticket you've linked has been delayed to at least Spark 3.0. So it means that there is no such option intended for public usage for now.
It is always possible to get around access limits (by reflection, by putting your own code in Spark namespace), but it is definitely not supported, and you shouldn't expect help, if it fails or breaks in the future.
Related
Using Room and RxJava3, my DAO interfaces always have methods with different return types (Maybe and Flowable) but with the exact same query. This is how I solve an important issue, which is that whenever I use Flowables, I want an empty list emitted immediately when the table is empty (rather than nothing at all).
Duplicating the query string may introduce bugs if I ever get sloppy and forget to update all of them. Now that I found I can still get syntax highlighting in Android Studio when storing the query, I came up with the following.
String query = "SELECT * FROM MyObject"; // public, as DAO is a public java interface
#Query(query)
Maybe<List<MyObject>> maybeMyObjectList();
#Query(query)
Flowable<List<MyObject>> flowableMyObjectList();
This enables me to do things like flowableMyObjectList().startWith(maybeMyObjectList().defaultIfEmpty(Collections.emptyList())).distinctUntilChanged()
Still, having SQL queries stored in a public string feels like a bad idea, security-wise. On the other hand, I don't think the database schema in my app bundle is supposed to be secret anyway. Can anyone with a better knowledge than mine confirm that it is a bad as it sounds, or better propose a workaround?
Instead of an interface you can use an abstract class, thus you can then have methods with bodies and private variables.
You then have to make the methods abstract.
So you could have:-
#Dao
abstract class TheDao {
private static final String query = "SELECT * FROM MyObject";
#Query(query)
abstract Maybe<List<MyObject>> maybeMyObjectList();
#Query(query)
abstract Flowable<List<MyObject>> flowableMyObjectList();
}
The problem statement is usage of hive jars in py-spark code.
We are following the below set of standard steps
Create temporary function in pyspark code - spark.sql (" ")
spark.sql("create temporary function public_upper_case_udf as 'com.hive.udf.PrivateUpperCase' using JAR 'gs://hivebqjarbucket/UpperCase.jar'")
Invoke the temporary function in the spark.sql statements
The issue that we are facing is if the java class in jar file is not declared as public explicitly we are facing with the error during spark.sql invocations of the hive udf
org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'com.hive.udf.PublicUpperCase'
Java Class Code
class PrivateUpperCase extends UDF {
public String evaluate(String value) {
return value.toUpperCase();
}
}
When I make the class public, the issue seems to get resolved.
The query is if making the class public is only solution or is there any other way around it ?
Any assistance is appreciated.
Note - The Hive Jars cannot be converted to Spark UDFs owing to the complexity.
If it was not public, how would external packages call PrivateUpperCase.evaluate?
https://www.java-made-easy.com/java-access-modifiers.html
To allow the PrivateUpperCase to be private, the class would need to be in the same package from where PrivateUpperCase.evaluate() is called from. You might be able to hunt that down and set the package name the same, but otherwise it needs to be public.
Question for those who knows Presto API for plugins.
I implement BigQuery plugin. BigQuery supports struct type, which could be represented as RowType class in Presto.
RowType creates RowBlockBuilder in RowType::createBlockBuilder, which has RowBlockBuilder::appendStructure method, which requires to accept only instance of AbstractSingleRowBlock class.
This means that in my implementation of Presto's RecordCursor BigQueryRecordCursor::getObject method I had to return something that is AbstractSingleRowBlock for field which has type RowType.
But AbstractSingleRowBlock has package private abstract method, which prevents me from implementing this class. The only child SingleRowBlock has package private constructor, and there are no factories or builders that could build an instance for me.
How to implement struct support in BigQueryRecordCursor::getObject?
(reminding: BigQueryRecordCursor is a child of RecordCursor).
You need to assemble the block for the row by calling beginBlockEntry, appending the values for each column via Type.writeXXX with the column's type and then closeEntry. Here's some pseudo-code.
BlockBuilder builder = type.createBlockBuilder(..);
builder = builder.beginBlockEntry();
for each column {
...
columnType.writeXXX(builder, ...);
}
builder.closeEntry();
return (Block) type.getObject(builder, 0);
However, I suggest you use the columnar APIs, instead (i.e., ConnectorPageSource and friends). Take a look at how the Elasticsearch connector implements it:
https://github.com/prestosql/presto/blob/master/presto-elasticsearch/src/main/java/io/prestosql/elasticsearch/ElasticsearchPageSourceProvider.java
https://github.com/prestosql/presto/blob/master/presto-elasticsearch/src/main/java/io/prestosql/elasticsearch/ElasticsearchPageSource.java
Here's how it handles Row types:
https://github.com/prestosql/presto/blob/master/presto-elasticsearch/src/main/java/io/prestosql/elasticsearch/decoders/RowDecoder.java
Also, I suggest you join #dev channel on the Presto Community Slack, where all the Presto developers hang out.
I have the following data structure (in pseudo code):
class GroupedData
{
String key;
List<Tuple<String, String>> records;
}
I figured the best way to model this was by doing something like
#Table(name="record_by_group)
class DataWithGroup{
#PrimaryKey(name="group_key")
String groupKey;
#PrimaryKey(name="data_key"
String dataKey;
String data;
}
I would then stream the GroupedData into a DataWithGroup using batched operations with inside spring. I am doing this inside a webflux application so i am using the reactive cassandra repository what I have noticed is that there is a vanilla CassandraBatchOperations but there is no ReactiveCassandraBatchOperations. I am wondering if I am missing something or is there a way to insert batch reactively. Alternatively how do I insert a something structured like the GroupedData into cassandra I think I should be using composite columns but I couldn't really get my head round them let alone figure out how to map them in using spring-data.
I couldn't find this anywhere online. How can I create a custom user defined function in cassandra?.
For Ex :
CREATE OR REPLACE FUNCTION customfunc(custommap map<text, int>)
CALLED ON NULL INPUT
RETURNS map<int,bigint>
LANGUAGE java AS 'return MyClass.mymethod(custommap);';
Where "MyClass" is a class that I can register in the Classpath?
I have the same issue, too. The custom class in UDF is support in cassandra 2.2.14, but not in cassandra 3.11.4.
Go through the source codes, cassandra 3.11.4 setup the UDF class loader with no parent class loader so that it have full control about what class/resource UDF uses. In org.apache.cassandra.cql3.functions.UDFunction.java, a whitelist and blacklist is used to control which class/package can be access.
For your issue, you should add the full name of MyClass into whitelist, and re-build the cassandra.
1. First build your java project that contains your class. Remember you have to add package name to your class.
Example :
package exp;
import java.lang.Math;
import java.util.*;
public class MyClass
{
public static Map<Integer,Long> mymethod(Map<String, Integer> data) {
Map<Integer,Long> map = new HashMap<>();
map.put(1, 10L);
map.put(2, 20L);
map.put(3, 30L);
return map;
}
}
After compile and build i have the jar test.jar
2. Copy the jar file to all cassandra node's $CASSANDRA_HOME/lib Directory
3. Restart All Cassandra Node
4. Create your custom function
Example :
CREATE OR REPLACE FUNCTION customfunc(custommap map<text, int>)
CALLED ON NULL INPUT
RETURNS map<int,bigint>
LANGUAGE java
AS 'return exp.MyClass.mymethod(custommap);';
Now you can use the function :
cassandra#cqlsh:test> SELECT * FROM test_fun ;
id | data
----+------------------
1 | {'a': 1, 'b': 2}
(1 rows)
cassandra#cqlsh:test> SELECT customfunc(data) FROM test_fun ;
test.customfunc(data)
-----------------------
{1: 10, 2: 20, 3: 30}
(1 rows)
Just adding my 2 cents to this thread as I tried building an external class method to support something similar. After trying for hours with Datastax Sandbox 5.1 I could not get this to work as it couldn't seem to find my class and kept raising type errors.
My guess is that external JAR-based code for UDFs is not supported (see http://koff.io/posts/hll-in-cassandra/ and https://issues.apache.org/jira/browse/CASSANDRA-9892). Support for "TRUSTED" JARs is in planning stages for Cassandra 4. It might work in earlier versions pre 3.0 but I'm using the latest version from Datastax.
To work around this issue, I had to fallback to using a Javascript version instead (I was trying to convert a JSON string into a Map object).
While I realize Java UDFs perform better, the code I was testing was using Java Nashorn javascript support anyway, so using Javascript might not be such a bad thing. It does end up with a simpler one-liner UDF.