Access S3 files from Azure Synapse Notebook - azure

Goal:
Move a lot of files from AWS S3 to ADLS Gen2 using Azure Synapse as fast as possible using parameterized regex expression for filename pattern using Synapse Notebook.
What I tried so far:
I know to access ADLS gen2, we can use
mssparkutils.fs.ls('abfss://container_name#storage_account_name.blob.core.windows.net/foldername') works but what is the equivalent to access S3 ?
I used mssparkutils.credentials.getsecret('AKV name','secretname') and mssparkutils.credentials.getsecret('AKV name','secret key id') to fetch secret details in the Synapse notebook but unable configure S3 to Synapse.
Question: Do I have to use the existing linked service using the credentials.getFullConnectionString(LinkedService) API ?
In short, my question is, How do I configure connectivity to S3 from within Synapse Notebook?

Answering my question here. AzCopy worked.Below is the link which helped me finish the task. The steps are as follows.
Install AzCopy on your machine.
Goto your terminal and goto the directory where the executeable is installed; run "AzCopy Login"; use Azure Active Directory credentials in your browser using the link from terminal message..Use the CODE provided in the terminal.
Authorize with S3 using below
set AWS_ACCESS_KEY_ID=
set AWS_SECRET_ACCESS_KEY=
For ADLS Gen2, you are already done in step-2
Use the commands (which ever suits your need) from the link below.
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3

Related

Azure Data Factory Copy Snowflake to Azure blog storage ErrorCode 2200User configuration issue ErrorCode=SnowflakeExportCopyCommandValidationFailed

Need help with this error.
ADF copy activity, Moving data from snowflake to azure blob storage delimited text.
I am able to preview the snowflake source data. I am also able to browse the containers via sink browse. This doesn't look like an issue with permissions.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Snowflake Export Copy Command validation failed:
'The Snowflake copy command payload is invalid.
Cannot specify property: column mapping,
Source=Microsoft.DataTransfer.ClientLibrary,'
Thanks for your help
Clear the mapping from copy activity, it worked.
ErrorCode=SnowflakeExportCopyCommandValidationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Snowflake Export Copy Command validation failed: 'The Snowflake copy command payload is invalid. Cannot specify property: column mapping,Source=Microsoft.DataTransfer.ClientLibrary,'"
To resolve the above error, please check the following:
Please check the copy command you are using, it means the copy command you are using is not valid.
To know more in detail about the error, try like below:
COPY INTO MYTABLE VALIDATION_MODE = 'RETURN_ERRORS' FILES=('result.csv');
Check if you have any issue with the data files before loading the data again.
Try checking the storage account you are currently using and note that Snowflake doesn't support Data Lake Storage Gen1.
Use the COPY INTO command to copy the data from the Snowflake database table into Azure blob storage container.
Note:
Use the blob.core.windows.net endpoint for all supported types of Azure blob storage accounts, including Data Lake Storage Gen2.
Make sure to have either ACCOUNTADMIN role or a role with the global CREATE INTEGRATION privilege to run the below sample command:
copy into 'azure://myaccount.blob.core.windows.net/mycontainer/unload/' from mytable storage_integration = myint;
By using the above command, you no need to include credentials to access the storage.
For more in detail, please refer below links:
Unloading into Microsoft Azure — Snowflake Documentation
COPY INTO < table> — Snowflake Documentation

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")
Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.
The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Notebook path can't be in DBFS?

Some of us are working with IDEs and trying to deploy notebooks (.py) files to dbfs. the problem I have noticed is when configuring jobs, those paths are not recognized.
notebook_path:
If I use this :
dbfs:/artifacts/client-state-vector/0.0.0/bootstrap.py.
Only absolute paths are currently supported. Paths must begin with '/'."
If I use this;
/dbfs/artifacts/client-state-vector/0.0.0/bootstrap.py
or
/artifacts/client-state-vector/0.0.0/bootstrap.py
I get Notebook not found.
what could be the issue here?
I see from the Databricks's architecture that Notebooks are in Microsoft managed subscription whereas DBFS is in Customer's subscription. Could that be the reason (that Notebook task is only able to pick from the microsoft managed subscription)? e.g. The folders I created on the workspace level where I have some notebooks, do not show up in the DBFS browser, so I am beginning to think that could be the reason.
Notebooks aren't the files on file system - they are stored inside the control plane, not in the data plane where DBFS is located. If you want to execute notebook - you need to upload it via Workspace API - import, or via databricks workspace import ... command of databricks-cli or via databricks_notebook resource of Databricks Terraform provider. Only after that you will be able to refer it in the notebook_path parameter.

Creating database in Azure databricks on External Blob Storage giving error

I have mapped my blob storage to dbfs:/mnt/ under name /mnt/deltalake
and blob storage container name is deltalake.
Mounting to Dbfs is done using Azure KeyVault backed secret scope.
When I try to create a database using CREATE DATABASE abc with location '/mnt/deltalake/databases/abc' this errors out saying path does not exist.
However when I was using the dbfs path as storage by using .. CREATE DATABASE abc with location '/user/hive/warehouse/databases/abc' .. it was always successful.
Wonder what is going wrong .
Suggestions please.
Using a mount point, you should be able to access existing files or write new files through databricks.
However, I believe the SQL commands, such as CREATE DATABASE, only work on the underlying hive metastore.
You may need to create a database for your blob storage outside of databricks, and then connect to the database to read and write from it using databricks.

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

Resources