Currently I have an azure databricks instance where I have the following
myDF.withColumn("created_on", current_timestamp())\
.writeStream\
.format("delta")\
.trigger(processingTime= triggerDuration)\
.outputMode("append")\
.option("checkpointLocation", "/mnt/datalake/_checkpoint_Position")\
.option("path", "/mnt/datalake/DeltaData")\
.partitionBy("col1", "col2", "col3", "col4", "col5")\
.table("deltadata")
This is saving the data into a storage account as blobs.
Now, I'm trying to connect to this table from another azure databricks workspace and my first "move" is the mount to the azure storage account:
dbutils.fs.mount(
source = sourceString,
mountPoint = "/mnt/data",
extraConfigs = Map(confKey -> sasKey)
Note: sourceString, confKey and sasKey are not shown for obvious reasons, in any case the mount works fine.
And then I try to create the table, but I get an error:
CREATE TABLE delta_data USING DELTA LOCATION '/mnt/data/DeltaData/'
Error in SQL statement: AnalysisException:
You are trying to create an external table `default`.`delta_data`
from `/mnt/data/DeltaData` using Databricks Delta, but the schema is not specified when the
input path is empty.
According to the documentation the schema should be picked up from the existing data correct?
Also, I trying to do this in a different workspace because the idea is to give only read access to people.
It seems my issue was the mount. It did not give any error while creating it but was not working fine. I discovered this after trying:
dbutils.fs.ls("/mnt/data/DeltaData")
Which was not showing anything. I unmounted and reviewed all the configs and after that it worked.
Related
How can I transform my data in databricks workspace 1 (DBW1) and then push it (send/save the table) to another databricks workspace (DBW2)?
On the DBW1 I installed this JDBC driver.
Then I tried:
(df.write
.format("jdbc")
.options(
url="jdbc:spark://<DBW2-url>:443/default;transportMode=http;ssl=1;httpPath=<http-path-of-cluster>;AuthMech=3;UID=<uid>;PWD=<pat>",
driver="com.simba.spark.jdbc.Driver",
dbtable="default.fromDBW1"
)
.save()
)
However, when I run it I get:
java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException:
How to do this correctly?
Note: each DBW is in different subscription.
From my point of view, the more scalable way would be to write directly into ADLS instead of using JDBC. But this needs to be done as following:
You need to have a separate storage account for your data. Anyway, use of DBFS Root for storage of the actual data isn't recommended as it's not accessible from outside - that makes things, like, migration, more complicated.
You need to have a way to access that storage account (ADLS or Blob storage). You can use access data directly (via abfss:// or wasbs:// URLs)
In the target workspace you just create a table for your data written - so called unmanaged table. Just do (see doc):
create table <name>
using delta
location 'path_or_url_to data'
I want to access one Databricks environment delta tables from other Databricks environment by creating global Hive meta store in one of the Databricks. Let me know if it is possible or not.
Thanks in advance.
There are two aspects here:
The data itself - they should be available to other workspaces - this is done by having a shared storage account/container, and writing data into it. You can either mount that storage account, or use direct access (via service principal or AAD passtrough) - you shouldn't write data to built-in DBFS Root that isn't available to other workspaces. After you write the data using dataframe.write.format("delta").save("some_path_on_adls"), you can read these data from another workspace that has access to that shared workspace - this could be done either
via Spark API: spark.read.format("delta").load("some_path_on_adls")
via SQL using following syntax instead of table name (see docs):
delta.`some_path_on_adls`
The metadata - if you want to represent saved data as SQL tables with database & table names instead of path, then you can use following choices:
Use the built-in metastore to save data into location on ADLS, and then create so-called external table in another workspace inside its own metastore. In the source workspace do:
dataframe.write.format("delta").option("path", "some_path_on_adls")\
.saveAsTable("db_name.table_name")
and in another workspace execute following SQL (either via %sql in notebook or via spark.sql function:
CREATE TABLE db_name.table_name USING DELTA LOCATION 'some_path_on_adls'
Use external metastore that is shared by multiple workspaces - in this case you just need to save data correctly:
dataframe.write.format("delta").option("path", "some_path_on_adls")\
.saveAsTable("db_name.table_name")
you still need to save it into shared location, so the data is accessible from another workspace, but you don't need to register the table explicitly, as another workspace will read the metadata from the same database.
I am trying to connect databricks to my blob containers in azure data lake gen2.
I can't find what my file-system-name is or my storage-account-name is anywhere for a connection.
dbutils.fs.ls("abfss://file-system-name#storage-account-name.dfs.core.windows.net/")
Thanks. If someone could reference an example that would be great.
Oh man ! they could have provided a better documentation with an example. took some time to click it
file-system-name means the container name. For instance if you have a container 'data' created in a storage account "myadls2021", it would be like below.
val data = spark.read
.option("header", "true")
.csv("abfss://data#myadls2021.dfs.core.windows.net/Baby_Names__Beginning_2007.csv")
I would recommend following this documentation :
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#azure-data-lake-storage-gen2
It explains how to access and/or mount an Azure Datalake Gen2 from Databricks.
file-system-name is the container name. storage-account-name is the azure storage account. Refer image below.
dbutils.fs.mount(
source = "abfss://raw#storageaccadls01.dfs.core.windows.net/",
mount_point = "/mnt/raw",
extra_configs = configs)
Can access the storage account files from the dbfs mount point location. Refer image below.
I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.
From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.
Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.
# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/tenant_id/oauth2/token")
DDL
create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container#account_name.dfs.core.windows.net/dev/data/employee
Error Received
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);
I need help in knowing if this is possible to refer to ADLS location directly in DDL?
Thanks.
Sort of if you can use Python (or Scala).
Start by making the connection:
TenantID = "blah"
def connectLake():
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")
connectLake()
lakePath = "abfss://liquix#mystorageaccount.dfs.core.windows.net/"
Using Python you can register a table using:
spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")
You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.
The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.
You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
Note that this requires using a High Concurrency cluster.
It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.