Issues with Scala ScriptEngine inside spark submit application - apache-spark

I am working on a system where I let users write DSLS and I load it as instances of my Type during runtime and these can be applied on top of RDDs. The entire application runs as a spark-submit application and I use ScriptEngine engine to compile DSLs written in Scala itself. Every tests works fine in SBT and IntelliJ. But while doing a spark-submit my own types available in my fat-jar is not available to import in Script. I initialize script engine as follows.
val engine: ScriptEngine = new ScriptEngineManager().getEngineByName("scala")
private val settings: Settings = engine.asInstanceOf[scala.tools.nsc.interpreter.IMain].settings
settings.usejavacp.value = true
settings.embeddedDefaults[DummyClass]
private val loader: ClassLoader = Thread.currentThread().getContextClassLoader
settings.embeddedDefaults(loader)
It seems like this is a problem with classloader during spark-submit. But I am not able to figure out the reason why my own types in my jar which also has the main program for spark-submit is unavailable in my script which is created in same JVM. scala scala-compiler,scala-reflect and scala-library versions are 2.11.8. Some help will be greatly appreciated.

I have found a working solution. By going through code and lot of debugging, I finally found out that ScriptEngine creates a Classloader for itself by consuming Classpath string of Classloader used to create it. In case of spark-submit, spark creates a special classloader which can read from both local and hdfs files. But classpath string obtained from this classloader will not have our application jars which is present in HDFS.
By manually appending my application jar to the ScriptEngine classpath before initialising it solved my problems. For this to work I had to locally download my application jar in HDFS to local before appending it.

If you instantiate the Scala interpreter directly instead of via ScriptEngineManager, you can pass in settings and override the classpath:
val cl = java.lang.Thread.currentThread.getContextClassLoader
val jar = cl.asInstanceOf[java.net.URLClassLoader].getURLs.toList.head.toString
val settings = new scala.tools.nsc.Settings()
settings.classpath.value = jar
val engine = scala.tools.nsc.interpreter.Scripted(settings = settings)

Related

How to get the Hadoop path with Java/Scala API in Code Repositories

My need is to read other formats: JSON, binary, XML and infer the schema dynamically within a transform in Code Repositories and using Spark datasource api.
Example:
val df = spark.read.json(<hadoop_path>)
For that, I need an accessor to the Foundry file system path, which is something like:
foundry://...#url:port/datasets/ri.foundry.main.dataset.../views/ri.foundry.main.transaction.../startTransactionRid/ri.foundry.main.transaction...
This is possible with PySpark API (Python):
filesystem = input_transform.filesystem()
hadoop_path = filesystem.hadoop_path
However, for Java/Scala I didn’t find a way to do it properly.
The getter to the Hadoop path has been recently added to Foundry Java API. By upgrading the version of the java transform (transformsJavaVersion >= 1.188.0), and you can get it:
val hadoopPath = myInput.asFiles().getFileSystem().hadoopPath()

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.
So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.
You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.
In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

Read multiple files with SparkSession in Spark 2.0

In Spark 1.6 to read multiple files, I have used:
JavaSparkContext ctx;
ctx.textFile(filePaths);
With filePaths is the directory to files. For example we have:
/home/user/folderA/0.log,/home/user/folderB/0.log. Each path separates by comma character.
But, when I upgrade to Spark 2.0. Method
SparkSession sparkSession;
sparkSession.read().textFile(filePaths);
doesn't work. The code throws exception: Path does not exist:
Question: Is there any solution to read multiple files, from multiple paths in Spark 2.0 just like Spark 1.6?
Edit: I try to call the method like Spark 1.6 using:
sparkSession.sparkContext().textFile(filePaths, 1).toJavaRDD();
The problem will solved. But, Is there have another solution?

Spark workflow with jar

I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
Writing analysis script, which will upload a jar of itself (created
below)
Go make the jar.
Run the script.
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}
You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")
While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)
You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for
users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!

Resources