I am using hadoop 1.2.1 to develop mapreduce programs. I look in hadoop 1.2.1 many of them or in general use JobConf Object to operate the driver class. And then I need waitforcompletition function that belongs to Job Object.
So should I use 2 object's JobConf and Job in one driver class? or I just use only Job ? Or how?
Related
I have noticed that in my project there are 2 ways of running spark jobs.
First way is submitting a job to spark-submit file
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100
Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:
hadoop jar JarFile.jar MainClassName
`
What is the difference between these 2 ways?
Which prerequisites I need to have in order to use either?
As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.
The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.
I have spark reading from Jdbc source (oracle) I specify lowerbound,upperbound,numpartitions,partitioncolumn but looking at web ui all the read is happening on driver not workers,executors. Is that expected?
In Spark framework, in general whatever code you write within a transformation such as map, flatMap etc. will be executed on the executor. To invoke a transformation you need a RDD which is created using the dataset that you are trying to compute on. To materialize the RDD you need to invoke an action so that transformations are applied to the data.
I believe in your case, you have written a spark application that reads jdbc data. If that is the case it will all be executed on Driver and not executor.
If you haven not already, try creating a Dataframe using this API.
I'm deploy a spark on yarn driver java application,it will submit spark jobs(mainly doing some offline statistics over hive,elasticsearch and hbase) to cluster when Task Scheduling System gives it a task call.So I make this driver app keep running,always wait for request.
I use thread pool to handle task calls,Every task will open a new
SparkSession and close it when job finished(we skip multiple tasks
call at the same time scenario to simplify this question).Java code should be like this:
SparkSession sparkSession=SparkSession.builder().config(new SparkConf().setAppName(appName)).enableHiveSupport().getOrCreate();
......doing statistics......
sparkSession.close();
This app compiled and running under jdk8 and memory related configured as fellow:
spark.ui.enabled=false
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.driver.memory=2G
--driver-java-options "-XX:MaxDirectMemorySize=2048M -XX:+UseG1GC"
At first glance I thought this driver app will consumes at most 4G memories,but as it keeps running,TOP shows it takes more and more resident size.
I dumped it's heap file and saw many Spark related instances left in threadLocal after sparksession closed,such as Hive metastore,SparkSession itself.After many study,I find Spark using a lot of threadlocals and havn't remove them(or I just havn't use the right way to close sparksession) I add those codes to clear threadlocals that spark left behind:
import org.apache.hadoop.hive.ql.metadata.Hive;
import org.apache.hadoop.hive.ql.session.SessionState;
......
SparkSession.clearDefaultSession();
sparkSession.close();
Hive.closeCurrent();
SessionState.detachSession();
SparkSession.clearActiveSession();
This seems work for now,but I think it's just not decent enough,I'm wondering is there a better way to do it like another single spark java api can do all the cleaning?I just can't find a clue from spark document.
I would like to know is there a way to chain the jobs in Spark, so the output RDD (or other format) of first job is passed as input to the second job ?
Is there any API for it from Apache Spark ? Is this even idiomatic approach ?
From what I found is that there is a way to spin up another process through the yarn client for example Spark - Call Spark jar from java with arguments, but this assumes that you save it to some intermediate storage between jobs.
Also there are runJob and submitJob on SparkContext, but are they good fit for it ?
Use the same RDD definition to define the input/output of your jobs.
You should then be able to chain them.
The other option is to use DataFrames instead of RDD and figure out the schema at run-time.
I was trying to profile some Spark jobs and I want to collect Java Flight Recorder(JFR) files from each executor. I am running my job on a YARN cluster with several nodes, so I cannot manually collect JRF file for each run. I want to write a script which can collect JFR file from each node in cluster for a given job.
MR provides a way to name JFR files generated by each task with taskId. It replaces '#task#' with TaskId in Java opts. With this I can get a unique name for JFR files created by each task and the since TaskId also has JobId, I can parse it to distinguish files generated by different MR jobs.
I am wondering, if Spark has something similar. Does Spark provides a way to determine executorId in Java opts? Has anyone else has tried to do something similar and found a better way collect all JFR files for a Spark job?
You can't set an executor id in the opts, but you can get the executor Id from the event log, as well as the slave node bearing it.
However I believe the option you give to spark-submit for a yarn master and a standalone one have the same effect on executors JVM, so you should be fine!
You can use {{EXECUTOR_ID}} and {{APP_ID}} placeholders in spark.executor.extraJavaOptions parameter. They will be replaced by Spark with executor's ID and application's ID, respectively.