Spark Custom Aggregator -- register and invoke through PySpark - apache-spark

According to various docs, to write a custom Aggregator in Spark it must be written in Java/Scala.
https://medium.com/swlh/apache-spark-3-0-remarkable-improvements-in-custom-aggregation-41dbaf725903
I have built and compiled a test implementation of a custom aggregator, but would now like to register and invoke it through PySpark and SparkSQL.
I tried spark.udf.registerJavaUDAF ... but that seems only to work with the older style UDAF functions not the new Aggregators.
How can I register a new Aggregator function written in Java through PySpark if at all possible? (I know how to pass the JAR to spark-submit etc the problem is the registration call).

I'm not sure what the correct approach is, but I was able to get the following to work.
In your Java class that extends Aggregator:
// This is assumed to be part of: com.example.java.udaf
// MyUdaf is the class that extends Aggregator
// I'm using Encoders.LONG() as an example, change this as needed
// Change the registered Spark SQL name, `myUdaf`, as needed
// Note that if you don't want to hardcode the "myUdaf" string, you can pass that in too.
// Expose UDAF registration
// This function is necessary for Python utilization
public static void register(SparkSession spark) {
spark.udf().register("myUdaf", functions.udaf(new MyUdaf(), Encoders.LONG()));
}
Then in Python:
udaf_jar_path = "..."
# Running in standalone mode
spark = SparkSession.builder\
.appName("udaf_demo")\
.config("spark.jars", udaf_jar_path)\
.master("local[*]")\
.getOrCreate()
# Register using registration function provided by Java class
spark.sparkContext._jvm.com.example.java.udaf.MyUdaf.register(_spark._jsparkSession)
As a bonus, you can use this same registration function in Java:
// Running in standalone mode
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName("udaf_demo")
.getOrCreate();
register(spark);
Then you should be able to use this directly in Spark SQL:
SELECT
col0
, myUdaf(col1)
FROM some_table
GROUP BY 1
I tested this with a simple summation and it worked reasonably well. For summing 1M numbers, the Python version was ~150ms slower than the Java one (local testing using standalone mode, with both run directly within my IDEs). Compared to the built-in sum it was about half a second slower.
An alternative approach is to use Spark native functions. I haven't directly used this approach; however, I have used the spark-alchemy library which does. See their repo for more details.

Related

Apache Spark + cassandra+Java +Spark session to display all records

I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.
I have autowired sparkSession and the below lines of code seems to work.
Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());
Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
.load();
ds.show();
But this is always giving me 20 records. I want to select all the records of table. can someone tell me how to do this ?
Thanks in advance.
show always outputs 20 records by default, although you can pass an argument to specify how many items do you need. But show is usually used just for briefly examine the data, especially when working interactively.
In your case, everything is really depends on what do you want to do with the data - you already successfully loaded the data using the load function - after that you can just start to use normal Spark functions - select, filter, groupBy, etc.
P.S. You can find here more examples on using Spark Cassandra Connector (SCC) from Java, although it's more cumbersome than using Scala... And I recommend to make sure that you're using SCC 2.5.0 or higher because of the many new features there.

What should be the input to setJars() method in Spark Streaming

val conf = new SparkConf(true)
.setAppName("Streaming Example")
.setMaster("spark://127.0.0.1:7077")
.set("spark.cassandra.connection.host","127.0.0.1")
.set("spark.cleaner.ttl","3600")
.setJars(Array("your-app.jar"))
Lets say I am creating a Spark Streaming Application
What should be the content of "your-app.jar" file ? Do I have to create it manually in my local file system and pass the path or Is that a Scala compiled file using sbt.
If thats a scala file please help to write the code
Since I am a beginer I am just trying to run some sample codes.
setJars method of the SparkConf class takes external JARs that need to be distributed on the cluster. Any external drivers like JDBC, etc.
You do not have to pass your own application JAR in this if that's what you are asking.

SnappyData - snappy-job - cannot run jar file

I'm trying run jar file from snappydata cli.
I'm just want to create a sparkSession and SnappyData session on beginning.
package io.test
import org.apache.spark.sql.{SnappySession, SparkSession}
object snappyTest {
def main(args: Array[String]) {
val spark: SparkSession = SparkSession
.builder
.appName("SparkApp")
.master("local")
.getOrCreate
val snappy = new SnappySession(spark.sparkContext)
}
}
From sbt file:
name := "SnappyPoc"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.0.0"
When I'm debuging code in IDE, it works fine, but when I create a jar file and try to run it directly on snappy I get message:
"message": "Ask timed out on [Actor[akka://SnappyLeadJobServer/user/context-supervisor/snappyContext1508488669865777900#1900831413]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
I have Spark Standalone 2.1.1, SnappyData 1.0.0.
I added dependencies to Spark instance.
Could you help me ?. Thank in advanced.
The difference between "embedded" mode and "smart connector" mode needs to be explained first.
Normally when you run a job using spark-submit, then it spawns a set of new executor JVMs as per configuration to run the code. However in the embedded mode of SnappyData, the nodes hosting the data also host long-running Spark Executors themselves. This is done to minimize data movement (i.e. move execution rather than data). For that mode you can submit a job (using snappy-job.sh) which will run the code on those pre-existing executors. Alternative routes include the JDBC/ODBC for embedded execution. This also means that you cannot (yet) use spark-submit to run embedded jobs because that will spawn its own JVMs.
The "smart connector" mode is the normal way in which other Spark connectors work but like all those has the disadvantage of having to pull the required data into the executor JVMs and thus will be slower than embedded mode. For configuring the same, one has to specify "snappydata.connection" property to point to the thrift server running on SnappyData cluster's locator. It is useful for many cases where users want to expand the execution capacity of cluster (e.g. if cluster's embedded execution is saturated all the time on CPU), or for existing Spark distributions/deployments. Needless to say that spark-submit can work in the connector mode just fine. What is "smart" about this mode is: a) if physical nodes hosting the data and running executors are common, then partitions will be routed to those executors as much as possible to minimize network usage, b) will use the optimized SnappyData plans to scan the tables, hash aggregation, hash join.
For this specific question, the answer is: runSnappyJob will receive the SnappySession object as argument which should be used rather than creating it. Rest of the body that uses SnappySession will be exactly same. Likewise for working with base SparkContext, it might be easier to implement SparkJob and code will be similar except that SparkContext will be provided as function argument which should be used. The reason being as explained above: embedded mode already has a running SparkContext which needs to be used for jobs.
I think there were missing methods isValidJob and runSnappyJob.
When I added those to code it works, but know someone what is releation beetwen body of metod runSnappyJob and method main
Should be the same in both ?

RegisterTempTable using dataset Spark Java

I have been using dataframe in my Java Spark Project (Spark version 1.6.1).
Now I am refactoring, trying to use the dataset in order to exploit the strong typed feature which comes with them.
In some part of the project I was using the following code:
dataframe.registerTempTable("table")
in order to use pure sql queries.
This kind of feature looks to be not present with dataset, I cannot find any similar method offered by them.
Can you confirm that?
I confirm that no method available in spark 1.6 for registering temp table or view using dataset.
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Dataset.html
These methods were introduced in spark 2.0.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html
Use createOrReplaceTempView:
public void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this > Dataset.
Parameters:
viewName - (undocumented)
Since:
2.0.0

How to use getOrCreate() method in SparkContext class and what exactly is the functionality we achieve from this method

What is the use of getOrCreate() method in SparkContext Class and how I can use it? I did not found any suitable example(coding wise) for this.
What I understand is that using above method I can share spark context between applications. What do we mean by applications here?
Is application a different job submitted to a spark cluster?
If so then we should be able to use global variables(broadcast) and temp tables registered in one application into another application ?
Please if anyone can elaborate and give suitable example on this.
As given in the Javadoc for SparkContext, getOrCreate() is useful when applications may wish to share a SparkContext. So yes, you can use it to share a SparkContext object across Applications. And yes, you can re-use broadcast variables and temp tables across.
As for understanding Spark Applications, please refer this link. In short, an application is the highest-level unit of computation in Spark. And what you submit to a spark cluster is not a job, but an application. Invoking an action inside a Spark application triggers the launch of a job to fulfill it.
getOrCreate
public SparkSession getOrCreate()
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
This method first checks whether there is a valid thread-local SparkSession and if yes, return that one. It then checks whether there is a valid global default SparkSession and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.
Please check link: [https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html][1]
An example can be :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

Resources