I am using databricks and writing my code in python notebook. Recently we deployed it in prod. However sometimes the notebook is getting failed.
I am looking for notebook command execution log file however there is no option to generate the log file in databricks.
I want to store log files in DBFS with timestamp so i can refer these log files if it fails.
Is there anyway we can achieve this? Thanks in advance for your help.
Yes there is a way to do this. You would utilize the Databricks API. This is taken from their website.
Create a cluster with logs delivered to a DBFS location
The following cURL command creates a cluster named “cluster_log_dbfs” and requests Databricks to sends its logs to dbfs:/logs with the cluster ID as the path prefix.
curl -n -H "Content-Type: application/json" -X POST -d #- https://<databricks-
instance>/api/2.0/clusters/create <<JSON
{
"cluster_name": "cluster_log_dbfs",
"spark_version": "5.2.x-scala2.11",
"node_type_id": "i3.xlarge",
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/logs"
}
}
}
Related
I am trying to export the csv file by following this guide https://docs.databricks.com/dev-tools/cli/index.html , but there's no response when executing below command, it looks like exist the command directly without saying exporting is successfully or failed.
I have finished install the cli and setup authentication by entering a host and token in mac terminal by following the guide as well.
export DATABRICKS_CONFIG_FILE="dbfs:/FileStore/tables/partition.csv"
please refer to this screenshot:
At first, I write the dataframe into file system by below code
df.coalesce(1).write.mode("overwrite").csv("dbfs:/FileStore/tables/partition.csv")
how could i successfully export the file from databricks and where does it stored locally?
screenShot:
Yes, you can copy to your local machine or move to another destination as needed
Configure azure CLI with azure databricks:
Please follow this steps:
pip install databricks-cli
Use databricks configure --token command
Mention Azure databricks host name: https://adb-xxxxx.azuredatabricks.net/
Past your Personal Access Token.
Now all set to export the CSV file and store it in a destination location.
databricks fs cp dbfs:/FileStore/tables/partition.csv dbfs:/destination/your_folder/file.csv
databricks fs cp C:/folder/file.csv dbfs:/FileStore/folder
Or
If you have a lot of CSV files placed in a folder .you prefer to export the entire folder rather than individual files.
Use -r to select your folder instead of the individual file.
databricks fs cp -r dbfs:/<folder> destination/folder
Alternative approach in python:
You can use directly dbutils.fs.cp("dbfs:/FileStore/gender_submission.csv","destination/folder")
We are trying to trigger an existing Azure Databricks Notebook using a REST API Call through the shell script. There are existing clusters running in the workspace. We want to attach the Databricks notebook with an existing cluster and trigger the Notebook
We are trying to figure out the configuration and the REST API Call that can trigger the notebook with a specific cluster dynamically at the run time.
I have reproduced the above and got the below results.
Here, I have created two clusters C1 and C2 and two notebooks Nb1 and Nb2.
My Nb1 notebook code for sample:
print("Hello world")
I have created Nb1 job and executed with C1 cluster by using the below shell script from Nb2 which is attached with C2.
%sh
curl -n --header "Authorization: Bearer <Access token>" \
-X POST -H 'Content-Type: application/json' \
-d '{
"run_name": "My Notebook run",
"existing_cluster_id": "<cluster id>",
"notebook_task":
{
"notebook_path": "<Your Notebook path>"
}
}' https://<databricks-instance>/api/2.0/jobs/runs/submit
Execution from Nb2:
Job Created:
Job Output:
I have configuration and cluster set in GCP and i can submit a spark job, but I am trying to run cloud dataproc job submit spark from my CLI for the same configuration.
I've set the service account in my local, I am just unable to build the equivalent command for the console configuration.
console config:
"sparkJob": {
"mainClass": "main.class",
"properties": {
"spark.executor.extraJavaOptions": "-DARGO_ENV_FILE=gs://file.properties",
"spark.driver.extraJavaOptions": "-DARGO_ENV_FILE=gs://file.properties"
},
"jarFileUris": [
"gs://my_jar.jar"
],
"args": [
"arg1",
"arg2",
"arg3"
]
}
And the equivalent command that I built is-
cloud dataproc job submit spark
-t spark
-p spark.executor.extraJavaOptions:-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions-DARGO_ENV_FILE=gs://file.properties
-m main.class
-c my_cluster
-f gs://my_jar.jar
-a ‘arg1’,‘arg2’,‘arg3’
It's not reading the file.properties files and giving this error-
error while opening file spark.executor.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties: error: open spark.executor.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties: no such file or directory
And when I run command without mentioning the -p (properties) flag and those files, it runs but eventually fails because of those missing properties files.
Where I am doing something wrong, I can't figure it out.
ps: I'm trying to run dataproc command from CLI something like a spark-submit command-
spark-submit --conf "spark.driver.extraJavaOptions=-Dkafka.security.config.filename=file.properties"
--conf "spark.executor.extraJavaOptions=-Dkafka.security.config.filename=file.properties"
--class main.class my_jar.jar
--arg1
--arg2
--arg3
A spark application can run many jobs. My spark is running on yarn. Version 2.2.0.
How to get job running status and other info for a given application id, possibly using REST API?
job like follows:
enter image description here
This might be late but putting it for convenience. Hope it helps. You can use below Rest API command to get the status of any jobs running on YARN.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343/state'
O/P - {"state":"RUNNING"}
Throughout the job cycle the state will vary from NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
You can use jq for a formatted output.
curl --negotiate -s -u : -X GET 'http://resourcemanagerhost:8088/ws/v1/cluster/apps/application_121766109986_12343'| jq .app.state
O/P - "RUNNING"
YARN has a Cluster Applications API. This shows the state along with other information. To use it:
$ curl 'RMURL/ws/v1/cluster/apps/APP_ID'
with your application id as APP_ID.
It provides:
We are trying to use the Azure CLI on linux to upload a WebJob as part of our continuous deployment pipeline.
azure site job upload -v $WEB_JOB_NAME $WEB_JOB_TYPE run.zip $WEB_SITE_NAME
But the command fails after > 20 mins of waiting on the "Uploading WebJob" step.
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Some more info:
The cli is properly authenticated. We can trigger already existing WebJobs just fine.
The exact same run.zip uploads successfully from Microsoft Azure Powershell on Windows.
The zip-file contains a runnable jar, and a small .cmd-script to start it. File size: 30 MB
We tried setting the verbose-flag, but it does not give any more information.
It looks like a bug in the xplat-cli. I don't think it's related to linux because I get the same error when I run the xplat-cli on Windows with a zip file that's around 30 MB too. I'd suggest opening an issue for them here https://github.com/Azure/azure-xplat-cli/issues
Workaround:
You can use the cli to get the site creds and then use curl to upload the webjob. Here is a little script that would do that.
# get site config from azure cli
siteConfig=`azure site show $WEB_SITE_NAME -d --json`
# extract publishing username and password for the site
publishingUserName=`echo $siteConfig| python -c "import json,sys;obj=json.load(sys.stdin);print obj['config']['publishingUserName'];"`
publishingPassword=`echo $siteConfig| python -c "import json,sys;obj=json.load(sys.stdin);print obj['config']['publishingPassword'];"`
siteScmUrl=`echo $siteConfig | python -c "import json,sys;obj=json.load(sys.stdin);print obj['site']['siteProperties']['properties']['RepositoryUri'];"`
# build the path for the webjob on the server
jobPath="zip/site/wwwroot/App_Data/jobs/$WEB_JOB_TYPE/$WEB_JOB_NAME"
fullUrl=$siteScmUrl$jobPath
# Upload the zip file using curl
curl -XPUT --data-binary #run.zip -u $publishingUserName:$publishingPassword $fullUrl
You can read more about the webjob REST APIs here https://github.com/projectkudu/kudu/wiki/WebJobs-API