Pyspark UDF in Java Spark Program - apache-spark

Is there any way by which I can use UDF's created in pyspark into Java Spark job
I know there is a way to use Java UDF into pyspark, but I am looking for other way round

First, I have to say that I don’t recommend you to do that. It sounds like a huge latency for the UDF, and I really suggest you to try write the UDF in Scala / Java.
If you still want to do that, here is how:
you should write a UDF that creates a Python interpreter and executes your code.
Here is a Scala code example:
System.setProperty("python.import.site", "false")
val interpreter = new PythonInterpreter
interpreter.exec("from __builtin__ import *")
// execute a function that takes a string and returns its length
val someFunc = interpreter.get("len")
val result = someFunc.__call__(new PyString("Test!"))
val realResult = result.__tojava__(classOf[Integer]).asInstanceOf[Int]
print(realResult)
This code call the len Python function and returns its result on the string "Test!".
I really think it’ll cause a bad performance for your job, and you should reconsider this plan again.

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Pass python functions to Scala RDD in pyspark

I have a scala library which (to put it simple) receives a function, applies it to an RDD and returns another RDD
def runFunction(rdd: RDD, function: Any => Any) = {
....
val res = rdd.map(function)
...
}
In scala the usage would be
import mylibrary.runFunction
runFunction(myRdd, myScalaFun)
This library is packaged in a jar and I want to now use it in python too. What I would like to do is to load this library in Python and pass to it a python function. Usage in Python would be:
spark._jvm.mylibrary.runFunction(myPythonRdd, myPythonFun)
This would allow me to use python functions as well as Scala ones without the need to port the whole library to python. Is this something that can be achieved with Spark capabilities of going back and forth between Python and JVM?
There are some subtleties in the way Python and JVM in PySpark communicate. The bridge uses Java objects, i.e., JavaRDD and not RDD, and those need explicit unboxing in Scala. Since your Scala function takes an RDD, you need to write a wrapper in Scala that receives a JavaRDD and performs the unboxing first:
def runFunctionWrapper(jrdd: JavaRDD, ...) = {
runFunction(jrdd.rdd, ...)
}
Then call it like
spark._jvm.mylibrary.runFunctionWrapper(myPythonRdd._jrdd, ...)
Note that by Python convention, _jrdd is considered a private member of the Python RDD class, so this is effectively relying on an undocumented implementation detail. Same applies to the _jvm member of SparkContext.
The real problem is making Scala call back into Python for the application of function. In PySpark, the Python RDD's map() method creates an instance of org.apache.spark.api.python
.PythonFunction, which holds pickled reference to the Python mapper function together with its environment. Then each RDD partition gets serialised and together with the pickled stuff sent over TCP to a Python process colocated with the Spark executor, where the partition is deserialised and iterated over. Finally, the result gets serialised again and sent back to the executor. The whole process is orchestrated by an instance of org.apache.spark.api.python.PythonRunner. This is very different from building a wrapper around the Python function and passing it to the map() method of the RDD instance.
I believe it is best if you simply replicate the functionality of runFunction in Python or (much better performance-wise) replicate the functionality of myPythonFun in Scala. Or, if what you do can be done interactively, follow the suggestion of #EnzoBnl and make use of a polyglot notebook environment like Zeppelin or Polynote.

What is the spark command to find the operations applied on a specific RDD

If my RDD value is:
val a = sc.parallelize(1 to 5)
and after some code, if I forgot what operations were applied to a, what is the command to find those operations?
RDD.toDebugString will give you the required information:
val a = sc.parallelize(1 to 5)
println(a.toDebugString)
prints
(4) ParallelCollectionRDD[0] at parallelize at Test.scala:31 []
You can find some more information on how to interpret the debug string here.
The debug string contains the DAG without data. There is no Spark feature that will "record" all operations including the data. If it is necessary to keep the data, one might try to intercept the Spark API with AspectJ, but this will require a considerable amount of work.

Parallelize SparkSession in PySpark

I would like to do calculations for getting top 5 keywords in each country and inside the method to get top 5 keywords, is there any way I can parallelize SparkSessions?
Now I am doing
country_mapping_df.rdd.map(lambda country_tuple: get_top_5_keywords(country_tuple))
def get_top_5_keywords(country_tuple):
result1 = spark.sql("""sample""")
result.write_to_s3
which is not working! Anyone knows how to make this work?
Spark does not support two contexts/Sessions running concurrently in the same program, Hence you can not parallelize SparkSessions.
source: https://spark.apache.org/docs/2.4.0/rdd-programming-guide.html#unit-testing

Spark SQL Stackoverflow

I'm a newbie on spark and spark sql and I was trying to make the example that is on Spark SQL website, just a simple SQL query after loading the schema and data from a JSON files directory, like this:
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "/home/shaza90/Desktop/tweets_1428981780000"
val tweet = sqlContext.jsonFile(path).cache()
tweet.registerTempTable("tweet")
tweet.printSchema() //This one works fine
val texts = sqlContext.sql("SELECT tweet.text FROM tweet").collect().foreach(println)
The exception that I'm getting is this one:
java.lang.StackOverflowError
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
Update
I'm able to execute select * from tweet but whenever I use a column name instead of * I get the error.
Any Advice?
This is SPARK-5009 and has been fixed in Apache Spark 1.3.0.
The issue was that to recognize keywords (like SELECT) with any case, all possible uppercase/lowercase combinations (like seLeCT) were generated in a recursive function. This recursion would lead to the StackOverflowError you're seeing, if the keyword was long enough and the stack size small enough. (This suggests that if upgrading to Apache Spark 1.3.0 or later is not an option, you can use -Xss to increase the JVM stack size as a workaround.)

Resources