Getting parameters of Spark submit while running a Spark job - apache-spark

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?

A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

Related

Retrieve the time it took for spark to finish the job

I need to time some things in spark, e.g. how long it takes for spark to read my file, so I like to use sc.setLogLevel("INFO") to enable extra information printed to the screen and one thing I find really useful is when a message like this is printed
2018-12-18 02:05:38 INFO DAGScheduler:54 - Job 2 finished: count at <console>:26, took 9.555080 s because this tells me how long something took.
Is there anyway to get this programmatically (preferably in scala) ? Right now I just copy this result and save it in a text file.
You can create something like:
import scala.concurrent.duration._
case class TimedResult[R](result: R, durationInNanoSeconds: FiniteDuration)
def time[R](block: => R): TimedResult[R] = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
val duration = t1 - t0
TimedResult(result, duration nanoseconds)
}
and then use it call your code block :
val timedResult = time{
someDataframe.count()
}
println("Count of records ${timedResult.result}")
println("Time taken : ${timedResult.durationInNanoSeconds.toSeconds}")
There are 2 solutions available to record logging for your spark program.
a) You can redirect your console output to a desired file while using the spark-submit command.
spark-submit your_code_file > logfile.txt 2>&1
b) There are 2 log files (log4j.properties) can be created for driver and executor and while issuing the spark-submit command include them by giving their path in java options for driver and executer.
spark-submit --class MAIN_CLASS --driver-java-options "-Dlog4j.configuration=file:LOG4J_PATH" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:LOG4J_PATH" --master MASTER_IP:PORT JAR_PATH

Error initializing SparkContext when using SPARK-SHELL in spark standalone

I have Installed Scala.
I have installed java 8.
Also all environment variables has been set for spark,java and Hadoop.
Still getting this error while running spark-shell command. Please someone help....google it a lot but didn't find anything.
spark-shell error
spark shell error2
Spark’s shell provides a simple way to learn the API, Start shell by running the following in the Spark directory:
./bin/spark-shell
Then run below scala code snippet:
import org.apache.spark.sql.SparkSession
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
If stills error persist,then we have to look into environment set up

Unable to use a local file using spark-submit

I am trying to execute a spark word count program. My input file and output dir are on local and not on HDFS. When I execute the code, I get input directory not found exception.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("yarn").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile("input",1)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile("output")
}
}
Spark Submit:
spark-submit --class com.practice.WordCount sparkwordcount_2.11-0.1.jar --files home/hmusr/ReconTest/inputdir/sample /home/hmusr/ReconTest/inputdir/wordout
I am using the option --files to fetch the local input file and point the output to output dir in spark-submit. When I submit the jar using spark-submit, it says input path does not exist:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://dev/user/hmusr/input
Could anyone let me know what is the mistake I am doing here ?
A couple of things:
val text = spark.sparkContext.textFile(input,1)
To use a variable, remove double quotes, is input not "input".
You expect input and output as an argument so in spark submit after jar (without --files) and use master as local.
Also, use file:// to use local files.
Your spark-submit should look something like:
spark-submit --master local[2] \
--class com.practice.WordCount \
sparkwordcount_2.11-0.1.jar \
file:///home/hmusr/ReconTest/inputdir/sample \
file:///home/hmusr/ReconTest/inputdir/wordout

spark-submit is not exiting until I hit ctrl+C

I am running this spark command to run spark Scala program successfully using Hortonworks vm. But once the job is completed it is not exiting from spark-submit command until I hit ctrl+C. Why?
spark-submit --class SimpleApp --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory12m --executor-cores 1 target/scala-2.10/application_2.10-1.0.jar /user/root/decks/largedeck.txt
Here is the code, I am running.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val cards = sc.textFile(args(0)).flatMap(_.split(" "))
val cardCount = cards.count()
println(cardCount)
}
}
You have to call stop() on context to exit your program cleanly.
I had the same kind of problem when writing files to S3. I use the spark 2.0 version, even after adding stop() if it didn't work for you. Try the below settings
In Spark 2.0 you can use,
val spark = SparkSession.builder().master("local[*]").appName("App_name").getOrCreate()
spark.conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Parquet file in Spark SQL

I am trying to use Spark SQL using parquet file formats. When I try the basic example :
object parquet {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
}
}
I get a null pointer exception :
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.parquet$.main(parquet.scala:16)
which is the line saveAsParquetFile. What's the issue here?
This error occurs when I was using Spark in eclipse in Windows. I tried the same on spark-shell and it works fine. I guess spark might not be 100% compatible with windows.
Spark is compatible with Windows. You can run your program in a spark-shell session in Windows or you can run it using spark-submit with necessary argument such as "-master" (again, in Windows or other OS).
You cannot just run your Spark program as an ordinary Java program in Eclispe without properly setting up the Spark environment and so on. You problem has nothing to do with Windows.

Resources