In Databricks I understand that a notebook can be executed from another notebook but the notebook will run in the current cluster by default.
For eg: I have notebook1 running on cluster1 and I am running notebook2 from notebook1 using below command
dbutils.notebook.run("notebook2", 3600)
but this will run on cluster1, how can I make it run on cluster2 ?
After digging through dbutils.py, I found a hidden argument to dbutils.notebook.run() called _NotebookHandler__databricks_internal_cluster_spec that accepts a cluster configuration JSON.
If you want to run "notebook2" on a cluster you've already created, you'll simply pass the JSON for that cluster. If you want Databricks to create a new cluster for you, just define the cluster's resources under the key "new_cluster". For example:
cluster_config = '''
{
"new_cluster": {
"spark_version": "9.1.x-cpu-ml-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
...
"enable_elastic_disk": true,
"num_workers": 4
}
}
'''
dbutils.notebook.run('notebook2', 36000, _NotebookHandler__databricks_internal_cluster_spec=cluster_config)
I am only able to test this on Azure Databricks, unfortunately.
Just for the record. If you want to run the notebook job with a specified cluster, all you need to do is to find out the cluster id of your desired cluster and run as follows:
cluster_config = '''
{
"existing_cluster_id": "****-032***82-bs34ww4f"
}
'''
dbutils.notebook.run('test_run_notebook', 36000, _NotebookHandler__databricks_internal_cluster_spec=cluster_config)
You can find the clusterId from the sideBar -> Compute -> select your cluster -> Press the "Automatically added tags"
Related
During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp. As per the documentation at the LOCAL_DIRS env variable that gets defined by the yarn.
However, post starting the cluster (I am passing master --yarn) I couldn't find any LOCAL_DIRS env variable using os.environ but, I can see SPARK_LOCAL_DIRS which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS is hadoop/spark/tmp
tmp. Default value of spark.local.dir
/home/username. I have tried sending custom value to spark.local.dir while starting the pyspark using --conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir. This is the value of yarn.nodemanager.local-dirs property in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
At the end of log4j.properties file located at $SPARK_HOME/conf/ add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?
Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:
>>> irdd = spark.sparkContext.range(0,100,1,10)
>>> def wherearemydirs(p):
... import os
... return os.getenv('LOCAL_DIRS')
...
>>>
>>> irdd.map(wherearemydirs).collect()
>>>
...will show local dirs in terminal
/data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...
But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager, as #KoedIt mentioned:
:
23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
:
This is going to depend on what your cluster setup is and your Spark version, but you're more or less looking at the correct places.
For this explanation, I'll be talking about Spark v3.3.1. which is the latest version as of the time of this post.
There is an interesting method in org.apache.spark.util.Utils called getConfiguredLocalDirs and it looks like this:
/**
* Return the configured local directories where Spark can write files. This
* method does not create any directories on its own, it only encapsulates the
* logic of locating the local directories according to deployment mode.
*/
def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_SANDBOX"))
} else {
if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) {
logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox because " +
s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}
This is interesting, because it makes us understand the order of precedence each config setting has. The order is:
if running in Yarn, getYarnLocalDirs should give you your local dir, which depends on the LOCAL_DIRS environment variable
if SPARK_EXECUTOR_DIRS is set, it's going to be one of those
if SPARK_LOCAL_DIRS is set, it's going to be one of those
if MESOS_SANDBOX and !shuffleServiceEnabled, it's going to be MESOS_SANDBOX
if spark.local.dir is set, it's going to be that
ELSE (catch-all) it's going to be java.io.tmpdir
IMPORTANT: In case you're using Kubernetes, all of this is disregarded and this logic is used.
Now, how do we find this directory?
Luckily, there is a nicely placed logging line in DiskBlockManager.createLocalDirs which prints out this directory if your logging level is INFO.
So, set your default logging level to INFO in log4j.properties (like so), restart your spark application and you should be getting a line saying something like
Created local directory at YOUR-DIR-HERE
I am trying to create a DAG which uses the DatabricksRunNowOperator to run pyspark.
However I'm unable to figure out how I can access the airflow config inside the pyspark script.
parity_check_run = DatabricksRunNowOperator(
task_id='my_task',
databricks_conn_id='databricks_default',
job_id='1837',
spark_submit_params=["file.py", "pre-defined-param"],
dag=dag,
)
I've tried accessing it via kwargs but that doesn't seem to be working.
You can use the notebook_params argument as seen in the documentation .
e.g:
job_id=42
notebook_params = {
"dry-run": "true",
"oldest-time-to-consider": "1457570074236"
}
notebook_run = DatabricksRunNowOperator(
job_id=job_id,
notebook_params=notebook_params,
)
Then you can access the value via dbutils.widgets.get("oldest-time-to-consider") in the PySpark code.
The DatabricksRunNowOperator supports different ways of providing parameters to the existing jobs, depending on how job is defined (doc):
notebook_params if you use notebooks - it's a dictionary of the widget name -> value. You can fetch parameters using the dbutils.widgets.get
python_params - list of parameters that will be passed to Python task - you can fetch them via sys.argv
jar_params - list of parameters that will be passed to Jar task. You can get them as usual for Java/Scala program
spark_submit_params - list of parameters that will be passed to the spark-submit
I am new to Airflow and the SparkSubmitOperator. I can see that Spark applications are submitted to the 'root.default' queue out the box when targeting YARN.
Simple question - how does one set a custom queue name ?
wordcount = SparkSubmitOperator(
application='/path/to/wordcount.py',
task_id="wordcount",
conn_id="spark_default",
dag=dag
)
p.s. I have read the docs:
https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/spark_submit_operator.html
Thanks
I can see now that --queue value is coming from the Airflow spark-default connection:
Conn Id = spark_default
Host = yarn
Extra = {"queue": "root.default"}
Go to Admin Menu > Connections, select spark default and edit it :
Change Extra {"queue": "root.default"} to {"queue": "default"} in the Airflow WebServer UI.
This of course means an Airflow connection is required for each queue.
To be clear, there are at least two ways to do this:
Via the Spark connection, as Phillip answered.
Via the a --conf parameter, which Dustan mentions in a comment.
From my testing, if there's a queue set in the Connection's Extra field, that is used regardless of what you pass into the SparkSubmit conf.
However, if you remove queue from Extra in the Connection, and send it in the SparkSubmitOperator conf arg like below, YARN will show it properly.
conf={
"spark.yarn.queue": "team_the_best_queue",
"spark.submit.deployMode": "cluster",
"spark.whatever.configs.you.have" = "more_config",
}
I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not we can create an EMR cluster at DAG run time and once the job is to finish it will terminate the created an EMR cluster.
Absolutely, that would be the most efficient use of resources. Let me warn you: there are a lot of details in this; I'll try to list as many as would get you going. I encourage you to add your own comprehensive answer listing any problems that you encountered and the workaround (once you are through this)
Regarding cluster creation / termination
For cluster creation and termination, you have EmrCreateJobFlowOperator and EmrTerminateJobFlowOperator respectively
Don't fret if you do not use AWS SecretAccessKey (and rely wholly on IAM Roles); instantiating any AWS-related hook or operator in Airflow will automatically fall-back to underlying EC2's attached IAM Role
If your'e NOT using the EMR-Steps API for job-submission, then you'll also have to manually sense both the above operations using Sensors. There's already a sensor for polling creation phase called EmrJobFlowSensor and you can modify it slightly to create a sensor for termination too
You pass your cluster-config JSON in job_flow_extra. You can also pass configs in a Connection's (like my_emr_conn) extra param, but refrain from it because it often breaks SQLAlchemy ORM loading (since its a big json)
Regarding job submission
You either submit jobs to Emr using EMR-Steps API, which can be done either during cluster creation phase (within the Cluster-Configs JSON) or afterwards using add_job_flow_steps(). There's even an emr_add_steps_operator() in Airflow which also requires an EmrStepSensor. You can read more about it in AWS docs and you might also have to use command-runner.jar
For application-specific cases (like Hive, Livy), you can use their specific ways. For instance you can use HiveServer2Hook to submit a Hive job. Here's a tricky part: The run_job_flow() call (made during cluster-creation phase) only gives you a job_flow_id (cluster-id). You'll have to use a describe_cluster() call using EmrHook to obtain the private-IP of the master node. Using this you will then be able to programmatically create a Connection (such as Hive Server 2 Thrift connection) and use it for submitting your computations to cluster. And don't forget to delete those connections (for elegance) before completing your workflow.
Finally there's the good-old bash for interacting with cluster. For this you should also pass an EC2 key pair during cluster creation phase. Afterwards, you can programmatically create an SSH connection and use it (with an SSHHook or SSHOperator) for running jobs on your cluster. Read more about SSH-stuff in Airflow here
Particularly for submitting Spark jobs to remote Emr cluster, read this discussion
The best way to do this is probably to have a node at the root of your Airflow DAG that creates the EMR cluster, and then another node at the very end of the DAG that spins the cluster down after all of the other nodes have completed.
Check my implementation, DAG will create emr cluster and run the spark job against the data in s3 and terminate automatically once done.
https://beyondexperiment.com/vijayravichandran06/aws-emr-orchestrate-with-airflow/
The best way to do is as below
create EMR cluster >> run spark application >> wait to complete spark application >> terminate EMR cluster
import time
from airflow.operators.python import PythonOperator
from datetime import timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
# Spark-submit command for application
SPARK_APP = [
{
'Name': 'spark_app1',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--class',
'package_path_to_main',
'location_of_jar',
args],
},
}
]
# EMR cluster configurations
JOB_FLOW_OVERRIDES = {
'Name': 'emr_cluster_name',
'ReleaseLabel': 'emr-6.4.0',
'Applications': [{"Name": "Spark"}],
'LogUri': 's3_path_for_log',
'Instances': {
'InstanceGroups': [
{
'Name': 'Master node',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'r5.8xlarge',
'InstanceCount': 1
},
{
'Name': "Slave nodes",
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'r5.8xlarge',
'InstanceCount': 32
}
],
'Ec2SubnetId': 'subnet-id',
'Ec2KeyName': 'KeyPair',
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': False,
"AdditionalMasterSecurityGroups": [ "security-group" ]
},
'JobFlowRole': 'EMR_EC2_DefaultRole',
'SecurityConfiguration': "SecurityConfig_name",
'ServiceRole': 'EMR_DefaultRole',
'StepConcurrencyLevel': 10,
}
# Airflow Dag defination
with DAG(
dag_id='dag_name',
default_args={
'owner': 'airflow',
'depends_on_past': False,
'email': ['email-address'],
'email_on_failure': True,
'email_on_retry': False,
},
dagrun_timeout=timedelta(hours=4),
start_date=days_ago(1),
schedule_interval='0 * * * *',
catchup=False,
tags=['example'],
) as dag:
# EMR cluster creator
cluster_creator = EmrCreateJobFlowOperator(
task_id='cluster_creator',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)
# Adding step adder to run spark application
step_adder_1 = EmrAddStepsOperator(
task_id='step_adder_1',
job_flow_id="{{ task_instance.xcom_pull(task_ids='cluster_creator', key='return_value')}}",
aws_conn_id='aws_default',
steps=SPARK_APP,
trigger_rule='all_done',
)
# Adding step sensor to track the completion of step adder
step_checker_1 = EmrStepSensor(
task_id='step_checker_1',
job_flow_id="{{ task_instance.xcom_pull('cluster_creator', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(task_ids='step_adder_1', key='return_value')[0] }}",
aws_conn_id='aws_default',
trigger_rule='all_done',
)
# Terminating EMR cluster if all task are completed which are running on top of cluster_remover task
cluster_remover = EmrTerminateJobFlowOperator(
task_id='cluster_remover',
job_flow_id="{{ task_instance.xcom_pull('cluster_creator', key='return_value') }}",
aws_conn_id='aws_default',
trigger_rule='all_done',
)
# defining the order of task
cluster_creator >> step_adder_1 >> step_checker_1 >> cluster_remover
I am creating HDI spark cluster using ARM template.
"scriptActions": [
{
"name": "Install Server",
"uri": "https://raw..sh",
"parameters": "[parameters('clusterWorkerNode')]",
"isHeadNode": true,
"isWorkerNode": false,
"isZookeeperNode": false
}
]
How can I pass multiple values in parameters in above scriptActions?
if "isHeadNode": true, Will my script install on both headnodes? and What about for "isWorkerNode": false and "isZookeeperNode": false . Is it same scenario?
HDI cluster is taking more than 20 minutes to create. Is there a way to reduce the time taken?
Update:-
I am able to pass multiple dynamic variables using the following and it works.
"parameters": "[concat(parameters('param1'),' ',parameters('param2'),' ',parameters('param3'),' ',parameters('param4'),' ',parameters('param5'))]",
How can I pass multiple values in parameters in above scriptActions?
You could add multiple parameters values, just like below:
"scriptActions": [
{
"name": "test",
"uri": "https://hdiconfigactions.blob.core.windows.net/linuxgiraphconfigactionv01/giraph-installer-v01.sh",
"parameters": "install upgrade",
"isHeadNode": true,
"isWorkerNode": true,
"isZookeeperNode": true
}
]
if "isHeadNode": true, Will my script install on both headnodes? and
What about for "isWorkerNode": false and "isZookeeperNode": false . Is
it same scenario?
Script Actions can be restricted to run on only certain node types, for example head nodes or worker nodes. If isHeadNod is yes, the script will install on both headnodes. isWorkerNode isZookeeperNode are the same. More information please refer to this link.
HDI cluster is taking more than 20 minutes to create. Is there a way
to reduce the time taken?
Based on my knowledge, you could not do it except you select less VMs. The optimization of the installation process is controlled by Azure.