Spark workflow with jar - apache-spark

I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
Writing analysis script, which will upload a jar of itself (created
below)
Go make the jar.
Run the script.
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}

You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")

While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)

You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for
users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!

Related

Spark "modifiedBefore" option while reading data from files

I am using Spark-2.4 to read files from hadoop.
The requirement is to read the files whose modified time is before some provided value.
I came across the spark documentation that mentions about the option modifiedBefore, please refer to the following spark doc Modification Time Path Filters, but I am not sure if it's available in spark 2.4, if not how can I achieve this?
The options modifiedBefore and modifiedAfter are available since Spark 3+ and can only be used in batch not streaming. For Spark 2.4, you can use Hadoop FileSystem globStatus method and filter files using getModificationTime.
Here is an example of a function that takes a path and a threshold and returns list of file paths filtered using the threshold:
import org.apache.hadoop.fs.Path
def getFilesModifiedBefore(path: Path, modifiedBefore: String) = {
val format = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val thresHoldTime = format.parse(modifiedBefore).getTime()
val files = path.getFileSystem(sc.hadoopConfiguration).globStatus(path)
files.filter(_.getModificationTime < thresHoldTime).map(_.getPath.toString)
}
Then using it with spark.read.csv :
val df = spark.read.csv(getFilesModifiedBefore(new Path("/mypath"), "2021-03-17T10:46:12"):_*)

Issues with Scala ScriptEngine inside spark submit application

I am working on a system where I let users write DSLS and I load it as instances of my Type during runtime and these can be applied on top of RDDs. The entire application runs as a spark-submit application and I use ScriptEngine engine to compile DSLs written in Scala itself. Every tests works fine in SBT and IntelliJ. But while doing a spark-submit my own types available in my fat-jar is not available to import in Script. I initialize script engine as follows.
val engine: ScriptEngine = new ScriptEngineManager().getEngineByName("scala")
private val settings: Settings = engine.asInstanceOf[scala.tools.nsc.interpreter.IMain].settings
settings.usejavacp.value = true
settings.embeddedDefaults[DummyClass]
private val loader: ClassLoader = Thread.currentThread().getContextClassLoader
settings.embeddedDefaults(loader)
It seems like this is a problem with classloader during spark-submit. But I am not able to figure out the reason why my own types in my jar which also has the main program for spark-submit is unavailable in my script which is created in same JVM. scala scala-compiler,scala-reflect and scala-library versions are 2.11.8. Some help will be greatly appreciated.
I have found a working solution. By going through code and lot of debugging, I finally found out that ScriptEngine creates a Classloader for itself by consuming Classpath string of Classloader used to create it. In case of spark-submit, spark creates a special classloader which can read from both local and hdfs files. But classpath string obtained from this classloader will not have our application jars which is present in HDFS.
By manually appending my application jar to the ScriptEngine classpath before initialising it solved my problems. For this to work I had to locally download my application jar in HDFS to local before appending it.
If you instantiate the Scala interpreter directly instead of via ScriptEngineManager, you can pass in settings and override the classpath:
val cl = java.lang.Thread.currentThread.getContextClassLoader
val jar = cl.asInstanceOf[java.net.URLClassLoader].getURLs.toList.head.toString
val settings = new scala.tools.nsc.Settings()
settings.classpath.value = jar
val engine = scala.tools.nsc.interpreter.Scripted(settings = settings)

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

How to read a file from classpath without external dependencies?

Is there a one-liner in Scala to read a file from classpath without using external dependencies, e.g. commons-io?
IOUtils.toString(getClass.getClassLoader.getResourceAsStream("file.xml"), "UTF-8")
val text = io.Source.fromInputStream(getClass.getResourceAsStream("file.xml")).mkString
If you want to ensure that the file is closed:
val source = io.Source.fromInputStream(getClass.getResourceAsStream("file.xml"))
val text = try source.mkString finally source.close()
If the file is in the resource folder (then it will be in the root of the class path), you should use the Loader class that it is too in the root of the class path.
This is the code line if you want to get the content (in scala 2.11):
val content: String = scala.io.Source.fromInputStream(getClass.getClassLoader.getResourceAsStream("file.xml")).mkString
In other versions of Scala, Source class could be in other classpath
If you only want to get the Resource:
val resource = getClass.getClassLoader.getResource("file.xml")
Just an update, with Scala 2.13 it is possible to do something like this:
import scala.io.Source
import scala.util.Using
Using.resource(getClass.getResourceAsStream("file.xml")) { stream =>
Source.fromInputStream(stream).mkString
}
Hope it might help someone.
In Read entire file in Scala? #daniel-spiewak proposed a bit different approach which I personally like better than the #dacwe's response.
// scala is imported implicitly
import io.Source._
val content = fromInputStream(getClass.getResourceAsStream("file.xml")).mkString
I however wonder whether or not it's still a one-liner?

Resources