az synapse spark job submit - azure

According to the documentation, using az synapse spark job submit, I can pass arguments using --arguments. So far so good.
However, I cannot figure out to actually access those arguments in my code. Here's my current effort:
val conf = new SparkConf().setAppName("foo")
val sc = new SparkContext(conf)
val spark = SparkSession.builder.appName("foo").getOrCreate()
val start_time = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm").format(LocalDateTime.now)
val appID = sc.getConf.getAppId
//let's get some arguments
val inputArgs = spark.sqlContext.getConf("spark.driver.args").split("\\s+")
//val inputArgs = sc.getConf.get("spark.driver.args").split("\\s+")
Either of those lines throw the following exception:
22/03/25 19:07:45 ERROR ApplicationMaster: User class threw exception: java.util.NoSuchElementException: spark.driver.args
java.util.NoSuchElementException: spark.driver.args
So, how do I read the arguments in the Scala code?

Ok, I was overcomplicating this.
def main(args: Array[String]) {
...
val foo = args(0)

Related

how to create dataframe in UDF

I have a problem. I want to create a DataFrame in UDF and use my model to transform it to another. But I get this Exception. Is there something wrong in Spark Conf? I don't know. Is there anyone can help me to solve this problem?
Code:
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String):Double = {
val modelLoaded = modelBroad.value
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val dataDF = sparkss.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
println(f"The prediction of $id and $text is $result")
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text")).show()
Exception:
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics$lzycompute(LocalTableScanExec.scala:37)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.SparkPlan.resetMetrics(SparkPlan.scala:85)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3366)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3365)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
at com.zamplus.mine.SparkSubmit$.com$zamplus$mine$SparkSubmit$$model_predict$1(SparkSubmit.scala:21)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
... 22 more
There is issue with your UDF. UDF runs on multiple instances and uses all variables that we are using inside it. So you should passed all required global variable as a parameters such as modelBroad otherwise it will give you null pointer exception.
There are few more good practice that you are not following in UDF. Some of are:
You do not need to create spark session in UDF. Otherwise it will create multiple spark session and which will cause issues. Instead of this pass global spark session as a variable in UDF if required.
Remove unnecessary pritnln in UDF, which effect your return also.
I have changed your code just for reference. It is just a prototype of ideal UDF. Please change it accordingly.
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String,spark:SparkSession,modelBroad:<datatype>):Double = {
val modelLoaded = modelBroad.value
val dataDF = spark.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text",lit(sparkss),lit(modelBroad))).show()

To avoid manual files errors how to code dynamic datatype of a column check in spark/scala

We are getting lot of manual files which we need to validate the few datatypes before process the data-frame. Can someone please suggest how can I proceed on this requirement. Basically need to write one spark Generic/common program which should work for many files. if possible please send more detail on this email id as well pathirammi1#gmail.com.
Wondering if your files have records with delimiter seperated (like csv file). If yes, you could very well read it as a text file and split the records based and delimiter and process it.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDFromCSVFile {
def main(args:Array[String]): Unit ={
def splitString(row:String):Array[String]={
row.split(",")
}
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.textFile("randomfile.csv")
val rdd2:RDD = rdd.map(row=>{
val strArray = splitString(row)
val field1 = strArray(0)
val field2 = strArray(1)
val field3 = strArray(3)
val field4 = strArray(4)
// DO custom code here and return to create RDD
})
rdd2.foreach(a=>println(a.toString))
}
}
If you have non-structured data then you should use below code
import org.apache.spark.sql.SparkSession
object RDDFromWholeTextFile {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("alice.txt")
rdd.foreach(a=>println(a._1+"---->"+a._2))
}
}
Hope this helps !!
Thanks,
Naveen

How to retrieve Metrics like Output Size and Records Written from Spark UI?

How do I collect these metrics on a console (Spark Shell or Spark submit job) right after the task or job is done.
We are using Spark to load data from Mysql to Cassandra and it is quite huge (ex: ~200 GB and 600M rows). When the task the done, we want to verify how many rows exactly did spark process? We can get the number from Spark UI, but how can we retrieve that number ("Output Records Written") from spark shell or in spark-submit job.
Sample Command to load from Mysql to Cassandra.
val pt = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://...:3306/...").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "payment_types").option("user", "hadoop").option("password", "...").load()
pt.save("org.apache.spark.sql.cassandra",SaveMode.Overwrite,options = Map( "table" -> "payment_types", "keyspace" -> "test"))
I want to retrieve all the Spark UI metrics on the above task mainly Output size and Records Written.
Please help.
Thanks for your time!
Found the answer. You can get the stats by using SparkListener.
If your job has no input or output metrics you might get None.get exceptions which you can safely ignore by providing if stmt.
sc.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
val metrics = taskEnd.taskMetrics
if(metrics.inputMetrics != None){
inputRecords += metrics.inputMetrics.get.recordsRead}
if(metrics.outputMetrics != None){
outputWritten += metrics.outputMetrics.get.recordsWritten }
}
})
Please find the below example.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "...")
.set("spark.driver.allowMultipleContexts","true")
.set("spark.master","spark://....:7077")
.set("spark.driver.memory","1g")
.set("spark.executor.memory","10g")
.set("spark.shuffle.spill","true")
.set("spark.shuffle.memoryFraction","0.2")
.setAppName("CassandraTest")
sc.stop
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
var outputWritten = 0L
sc.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
val metrics = taskEnd.taskMetrics
if(metrics.inputMetrics != None){
inputRecords += metrics.inputMetrics.get.recordsRead}
if(metrics.outputMetrics != None){
outputWritten += metrics.outputMetrics.get.recordsWritten }
}
})
val bp = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://...:3306/...").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "bucks_payments").option("partitionColumn","id").option("lowerBound","1").option("upperBound","14596").option("numPartitions","10").option("fetchSize","100000").option("user", "hadoop").option("password", "...").load()
bp.save("org.apache.spark.sql.cassandra",SaveMode.Overwrite,options = Map( "table" -> "bucks_payments", "keyspace" -> "test"))
println("outputWritten",outputWritten)
Result:
scala> println("outputWritten",outputWritten)
(outputWritten,16383)

Error when creating a StreamingContext

I open the spark shell
spark-shell --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0
Then I want to create a streaming context
import org.apache.spark._
import org.apache.spark.streaming._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(1))
I run into a exception:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
When you open the spark-shell, there is already a streaming context created. It is called sc, meaning you do not need to create a configure object. Simply use the existing sc object.
val ssc = new StreamingContext(sc,Seconds(1))
instead of var we will use val

Read from Spark RDD a Kryo File

I'm Spark & Scala newbie.
I need to read and analyze a file in Spark that it has written in my scala code with Kryo serialized:
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.io.Output
val kryo:Kryo = new Kryo()
val output:Output = new Output(new FileOutputStream("filename.ext",true))
//kryo.writeObject(output, feed) (tested both line)
kryo.writeClassAndObject(output, myScalaObject)
This is a pseudo-code for create a file with my object (myScalaObject) serialized, that is a complex object.
The file seems that write well, but i have problem when I read it in Spark RDD
pseudo-code in Spark:
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "myScalaObject")
val sc = new SparkContext(conf)
val file=sc.objectFile[myScalaObject]("filename.ext")
val counts = file.count()
When I try to execute it i receive this error:
org.apache.spark.SparkException:
Job aborted: Task 0.0:0 failed 1 times (most recent failure:
Exception failure: java.io.IOException: file: filename.ext not a SequenceFile)
Is possible read this type of file in Spark?
If this solution is not possible, what is a good solution for create a complex file structure to read in Spark?
thank you
If you want to read with objectFile, write out the data with saveAsObjectFile.
val myObjects: Seq[MyObject] = ...
val rddToSave = sc.parallelize(myObjects) // Or better yet: construct as RDD from the start.
rddToSave.saveAsObjectFile("/tmp/x")
val rddLoaded = sc.objectFile[MyObject]("/tmp/x")
Alternatively, as zsxwing says, you can create an RDD of the filenames, and use map to read the contents of each. If want each file to be read into a separate partition, parallelize the filenames into separate partitions:
def loadFiles(filenames: Seq[String]): RDD[Object] = {
def load(filename: String): Object = {
val input = new Input(new FileInputStream(filename))
return kryo.readClassAndObject(input)
}
val partitions = filenames.length
return sc.parallelize(filenames, partitions).map(load)
}

Resources