I'm attempting to set driver log levels when launching jobs in Dataproc (https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#LoggingConfig). Launching is done via a Java program using the dataproc SDK.
LoggingConfig loggingConfig = new LoggingConfig();
loggingConfig.put("driverLogLevels", Collections.singletonMap("root", "ERROR"));
com.google.api.services.dataproc.model.SparkJob sparkJob = new com.google.api.services.dataproc.model.SparkJob().setMainClass(mainClass).setJarFileUris(jarFileUris).setArgs(args).setProperties(properties).setLoggingConfig(loggingConfig);
Job job = new Job().setPlacement(new JobPlacement().setClusterName(clusterName)).setSparkJob(sparkJob);
// ommitted irrelevant code
Dataproc dp = new Dataproc.Builder(httpTransport, jsonFactory, credential).setApplicationName(jobName).build();
SubmitJobRequest request = new SubmitJobRequest().setJob(job);
return dp.projects().regions().jobs().submit(googleProject, "global", request).execute();
This launches successfully, but does not successfully set log4j configuration:
log4j:ERROR Could not read configuration file from URL [file:/tmp/[guid]/driver_log4j.properties].
java.io.FileNotFoundException: /tmp/[guid]/driver_log4j.properties (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:117)
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:102)
at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogIfNecessary(ApplicationMaster.scala:736)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:736)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:751)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
log4j:ERROR Ignoring configuration file [file:/tmp/[guid]/driver_log4j.properties].
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
where [guid] is a GUID that differs for every job. Logging is by (verbose) default config.
How can I successfully set config? What is the most elegant and robust way on dataproc to adjust log levels for Spark? would be a fallback, but I'd rather use a method that's not liable to change out from under me.
The official way to set log level is the method described in your link. See the dataproc docs.
so I believe that to invoke this from the java SDK within the setArgs(...) term of your builder. So in your case you would want to add:
args.add("--driver-log-levels");
args.add("root=ERROR");
like so:
args.add("--driver-log-levels");
args.add("root=ERROR");
com.google.api.services.dataproc.model.SparkJob sparkJob = new com.google.api.services.dataproc.model.SparkJob().setMainClass(mainClass).setJarFileUris(jarFileUris).setArgs(args).setProperties(properties).setLoggingConfig(loggingConfig);
Job job = new Job().setPlacement(new JobPlacement().setClusterName(clusterName)).setSparkJob(sparkJob);
// ommitted irrelevant code
Dataproc dp = new Dataproc.Builder(httpTransport, jsonFactory, credential).setApplicationName(jobName).build();
SubmitJobRequest request = new SubmitJobRequest().setJob(job);
return dp.projects().regions().jobs().submit(googleProject, "global", request).execute();
I'm not sure what you mean when you call this a feature that's liable to change out from under you. This should be a stable feature.
Related
During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp. As per the documentation at the LOCAL_DIRS env variable that gets defined by the yarn.
However, post starting the cluster (I am passing master --yarn) I couldn't find any LOCAL_DIRS env variable using os.environ but, I can see SPARK_LOCAL_DIRS which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS is hadoop/spark/tmp
tmp. Default value of spark.local.dir
/home/username. I have tried sending custom value to spark.local.dir while starting the pyspark using --conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir. This is the value of yarn.nodemanager.local-dirs property in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
At the end of log4j.properties file located at $SPARK_HOME/conf/ add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?
Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:
>>> irdd = spark.sparkContext.range(0,100,1,10)
>>> def wherearemydirs(p):
... import os
... return os.getenv('LOCAL_DIRS')
...
>>>
>>> irdd.map(wherearemydirs).collect()
>>>
...will show local dirs in terminal
/data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...
But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager, as #KoedIt mentioned:
:
23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
:
This is going to depend on what your cluster setup is and your Spark version, but you're more or less looking at the correct places.
For this explanation, I'll be talking about Spark v3.3.1. which is the latest version as of the time of this post.
There is an interesting method in org.apache.spark.util.Utils called getConfiguredLocalDirs and it looks like this:
/**
* Return the configured local directories where Spark can write files. This
* method does not create any directories on its own, it only encapsulates the
* logic of locating the local directories according to deployment mode.
*/
def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_SANDBOX"))
} else {
if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) {
logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox because " +
s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}
This is interesting, because it makes us understand the order of precedence each config setting has. The order is:
if running in Yarn, getYarnLocalDirs should give you your local dir, which depends on the LOCAL_DIRS environment variable
if SPARK_EXECUTOR_DIRS is set, it's going to be one of those
if SPARK_LOCAL_DIRS is set, it's going to be one of those
if MESOS_SANDBOX and !shuffleServiceEnabled, it's going to be MESOS_SANDBOX
if spark.local.dir is set, it's going to be that
ELSE (catch-all) it's going to be java.io.tmpdir
IMPORTANT: In case you're using Kubernetes, all of this is disregarded and this logic is used.
Now, how do we find this directory?
Luckily, there is a nicely placed logging line in DiskBlockManager.createLocalDirs which prints out this directory if your logging level is INFO.
So, set your default logging level to INFO in log4j.properties (like so), restart your spark application and you should be getting a line saying something like
Created local directory at YOUR-DIR-HERE
I am new to Airflow and the SparkSubmitOperator. I can see that Spark applications are submitted to the 'root.default' queue out the box when targeting YARN.
Simple question - how does one set a custom queue name ?
wordcount = SparkSubmitOperator(
application='/path/to/wordcount.py',
task_id="wordcount",
conn_id="spark_default",
dag=dag
)
p.s. I have read the docs:
https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/spark_submit_operator.html
Thanks
I can see now that --queue value is coming from the Airflow spark-default connection:
Conn Id = spark_default
Host = yarn
Extra = {"queue": "root.default"}
Go to Admin Menu > Connections, select spark default and edit it :
Change Extra {"queue": "root.default"} to {"queue": "default"} in the Airflow WebServer UI.
This of course means an Airflow connection is required for each queue.
To be clear, there are at least two ways to do this:
Via the Spark connection, as Phillip answered.
Via the a --conf parameter, which Dustan mentions in a comment.
From my testing, if there's a queue set in the Connection's Extra field, that is used regardless of what you pass into the SparkSubmit conf.
However, if you remove queue from Extra in the Connection, and send it in the SparkSubmitOperator conf arg like below, YARN will show it properly.
conf={
"spark.yarn.queue": "team_the_best_queue",
"spark.submit.deployMode": "cluster",
"spark.whatever.configs.you.have" = "more_config",
}
In our current system there is one Java code that is reading one file and it will generate many JSON documents for the full day - 24h; all JSON docs are written to CosmosDB. When I execute it in the console everything is OK. I have tried to schedule a Databricks job by using the uber-jar file and it failed with the following error:
"Resource with specified id or name already exists."
It seems ok... IMO because the default settings of the existing cluster contain many executors - so each executor will try to write to CosmosDB the same set of JSON docs.
So I changed the main method as below:
public static void main(String[] args) {
SparkConf conf01 = new SparkConf().set("spark.executor.instances","1").set("spark.executor.cores","1");
SparkSession spark = SparkSession.builder().config(conf01).getOrCreate();
...
}
But I received the same error "Resource with specified id or name already exists" from the CosmosDB.
I would like to have only one executor for this specific Java code, how to use only one spark executor?
Any help (link/doc/url/code) will be appreciated.
Thank you !
I'm trying to log the properties for each Spark application that run in one Yarn cluster ( properties like spark.shuffle.compress, spark.reducer.maxMbInFlight, spark.executor.instances and so on ).
However i don't know if this information is logged anywhere. I know that we can access to the yarn logs through the "yarn" command but the properties I'm talking about are not store there.
Is there anyway to access to this kind of info?. The idea is to have a trace of all the applications that run in the cluster together with its properties to identify which ones have the most impact in their execution time.
You could log it yourself... use sc.getConf.toDebugString, sqlContext.getConf("") or sqlContext.getAllConfs.
scala> sqlContext.getConf("spark.sql.shuffle.partitions")
res129: String = 200
scala> sqlContext.getAllConfs
res130: scala.collection.immutable.Map[String,String] = Map(hive.server2.thrift.http.cookie.is.httponly -> true, dfs.namenode.resource.check.interval ....
scala> sc.getConf.toDebugString
res132: String =
spark.app.id=local-1449607289874
spark.app.name=Spark shell
spark.driver.host=10.5.10.153
Edit: However, I could not find the properties you specified among the 1200+ properties in sqlContext.getAllConfs :( Otherwise the documentation says:
The application web UI at http://:4040 lists Spark properties
in the “Environment” tab. This is a useful place to check to make sure
that your properties have been set correctly. Note that only values
explicitly specified through spark-defaults.conf, SparkConf, or the
command line will appear. For all other configuration properties, you
can assume the default value is used.
I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()