Notebook path can't be in DBFS? - databricks

Some of us are working with IDEs and trying to deploy notebooks (.py) files to dbfs. the problem I have noticed is when configuring jobs, those paths are not recognized.
notebook_path:
If I use this :
dbfs:/artifacts/client-state-vector/0.0.0/bootstrap.py.
Only absolute paths are currently supported. Paths must begin with '/'."
If I use this;
/dbfs/artifacts/client-state-vector/0.0.0/bootstrap.py
or
/artifacts/client-state-vector/0.0.0/bootstrap.py
I get Notebook not found.
what could be the issue here?
I see from the Databricks's architecture that Notebooks are in Microsoft managed subscription whereas DBFS is in Customer's subscription. Could that be the reason (that Notebook task is only able to pick from the microsoft managed subscription)? e.g. The folders I created on the workspace level where I have some notebooks, do not show up in the DBFS browser, so I am beginning to think that could be the reason.

Notebooks aren't the files on file system - they are stored inside the control plane, not in the data plane where DBFS is located. If you want to execute notebook - you need to upload it via Workspace API - import, or via databricks workspace import ... command of databricks-cli or via databricks_notebook resource of Databricks Terraform provider. Only after that you will be able to refer it in the notebook_path parameter.

Related

Access S3 files from Azure Synapse Notebook

Goal:
Move a lot of files from AWS S3 to ADLS Gen2 using Azure Synapse as fast as possible using parameterized regex expression for filename pattern using Synapse Notebook.
What I tried so far:
I know to access ADLS gen2, we can use
mssparkutils.fs.ls('abfss://container_name#storage_account_name.blob.core.windows.net/foldername') works but what is the equivalent to access S3 ?
I used mssparkutils.credentials.getsecret('AKV name','secretname') and mssparkutils.credentials.getsecret('AKV name','secret key id') to fetch secret details in the Synapse notebook but unable configure S3 to Synapse.
Question: Do I have to use the existing linked service using the credentials.getFullConnectionString(LinkedService) API ?
In short, my question is, How do I configure connectivity to S3 from within Synapse Notebook?
Answering my question here. AzCopy worked.Below is the link which helped me finish the task. The steps are as follows.
Install AzCopy on your machine.
Goto your terminal and goto the directory where the executeable is installed; run "AzCopy Login"; use Azure Active Directory credentials in your browser using the link from terminal message..Use the CODE provided in the terminal.
Authorize with S3 using below
set AWS_ACCESS_KEY_ID=
set AWS_SECRET_ACCESS_KEY=
For ADLS Gen2, you are already done in step-2
Use the commands (which ever suits your need) from the link below.
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3

how to copy py file stored in dbfs location to databricks workspace folders

how to copy py file stored in dbfs location to databricks workspace folders. once it is copied to workspace folders. once it is copied to databricsk workspace folders, I can run it as notebook using %run command.
DBFS & Workspace folders are two different things that aren't connected directly:
DBFS is located in your own environment (so-called data plane, see Databricks Architecture docs), built on top of the specific cloud storage, like, AWS S3, Azure Data Lake Storage, etc.
Workspace folders are located in the control plane that is owned by Databricks - the folders are just metadata to represent a hierarchy of notebooks. When executed, code of the notebooks is sent from Databricks environment to the machines running in your environment.
To put code into workspace you can either use UI to upload it, you can use Workspace API to import it, or even easier - just use workspace import (or workspace import_dir to import many files from a directory) command of Databricks CLI that is a wrapper over REST API but it's easier to use.
If you already copied notebooks onto DBFS, you can simply download them again to your local machine using the fs cp command of Databricks CLI, and then use workspace import (or workspace import_dir) to import them

How to create a empty folder in Azure Blob from Azure databricks

I have scenario where I want to list all the folders inside a directory in Azure Blob. If no folders present create a new folder with certain name.
I am trying to list the folders using dbutils.fs.ls(path).
But the problem with the above command is it fails if the path doesn't exist, which is a valid scenario for me.
If my program runs for the first time the path will not exist and dbutils.fs.ls command will fail.
Is there any way I can handle this scenario dynamically from Databricks.
It will also work for me if I can create an empty folder in Azure Blob from Databricks before executing my job.
I have tried running below command from databricks notebook
%sh mkdir -p /mnt/<mountName>/path/folderName
Here the command runs successfully and even though my container in Azure Blob is mounted it doesn't create the folder.
Sorry for such an elongated post. Any help is much appreciated. Thanks in advance
dbutils.fs.mkdirs("/mnt/<mountName>/path/folderName")
I found this was able to create a folder with a mounted blob storage

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

Unable to locate the repository cloned from git using Azure cloud shell

I opened Azure Cloud Shell and once the command prompt was ready, I tried git clone https://github.com/Azure-Samples/python-docs-hello-world and it was cloned successfully. However, i am unable to locate where the cloned files are. Need help with the process for locating using Azure Cloud Shell.
The Azure Cloud shell stores the files in a file share within a storage account that you either specified or Azure created for you.
When you use basic settings and select only a subscription, Cloud
Shell creates three resources on your behalf in the supported region
that's nearest to you:
Resource group: cloud-shell-storage-<region>
Storage account: cs<uniqueGuid>
File share: cs-<user>-<domain>-com-<uniqueGuid>
Source.

Resources