Is there an easy way to load data from Azure Databricks Spark DB to GCP Databricks Spark DB?
Obtain JDBC details from Azure instance and use them in GCP to pull data just as from any other JDBC source.
// This is run in GCP instance
some_table = spark.read
.format("jdbc")
.option("url", "jdbc:databricks://adb-xxxx.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/xxxx;AuthMech=3;UID=token;PWD=xxxx")
.option("dbtable", "some_table")
.load()
Assuming Azure data is stored in Blob/ADLSv2 storage, mount it in GCP instance's DBFS and read data directly.
// This is run in GCP instance
// Assuming ADLSv2 on Azure side
val configs = Map(
"fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> "<application-id>",
"fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
some_data = spark.read
.format("delta")
.load("/mnt/<mount_name>/<some_schema>/<some_table>")
Related
I created a Parquet file with custom metadata at file level:
Now I'm trying to read that metadata from the Parquet file in (Azure) Databricks. But when I run the following code I don't get any metadata which is present there.
storageaccount = 'zzzzzz'
containername = 'yyyyy'
access_key = 'xxxx'
spark.conf.set(f'fs.azure.account.key.{storageaccount}.blob.core.windows.net', access_key)
path = f"wasbs://{containername}#{storageaccount}.blob.core.windows.net/generated_example_10m.parquet"
data = spark.read.format('parquet').load(path)
print(data.printSchema())
I try to reproduce same thing in my environment. I got this output.
Please follow below code and Use select("*", "_metadata")
path = "wasbs://<container>#<storage_account_name>.blob.core.windows.net/<file_path>.parquet"
data = spark.read.format('parquet').load(path).select("*", "_metadata")
display(data)
or
Mention your schema and load path with .select("*", "_metadata")
df = spark.read \
.format("parquet") \
.schema(schema) \
.load(path) \
.select("*", "_metadata")
display(df)
I am able to read/write files from spark standalone cluster to s3 using the below configuration.
val spark = SparkSession.builder()
.appName("Sample App")
.config("spark.master", "spark://spark:7077")
.config("spark.hadoop.fs.s3a.path.style.access", value = true)
.config("fs.s3a.fast.upload", value = true)
.config("fs.s3a.connection.ssl.enabled", value = false)
.config("mapreduce.fileoutputcommitter.algorithm.version", value = 2)
.config("spark.hadoop.fs.s3a.access.key", "Access Key Value")
.config("spark.hadoop.fs.s3a.secret.key", "Secret Key Value")
.config("spark.hadoop.fs.s3a.endpoint", "End-Point Value")
.getOrCreate()
But my requirement is to reuse the connection to s3 instead of mentioning s3 keys every time I create a spark-session. Like a mount point in data bricks.
first of all thank you for your time for the next question :)
I am trying to connect Databricks Scala Application with Azure Table Storage, however I am getting the following error:
Azure Table Scala APP
Error:
NoSuchMethodError:
reactor.netty.http.client.HttpClient.resolver(Lio/netty/resolver/AddressResolverGroup;)Lreactor/netty/transport/ClientTransport;
at
com.azure.core.http.netty.NettyAsyncHttpClientBuilder.build(NettyAsyncHttpClientBuilder.java:94)
at
com.azure.core.http.netty.NettyAsyncHttpClientProvider.createInstance(NettyAsyncHttpClientProvider.java:18)
at
com.azure.core.implementation.http.HttpClientProviders.createInstance(HttpClientProviders.java:58)
at com.azure.core.http.HttpClient.createDefault(HttpClient.java:50) at
com.azure.core.http.HttpClient.createDefault(HttpClient.java:40) at
com.azure.core.http.HttpPipelineBuilder.build(HttpPipelineBuilder.java:62)
at
com.azure.data.tables.BuilderHelper.buildPipeline(BuilderHelper.java:122)
at
com.azure.data.tables.TableServiceClientBuilder.buildAsyncClient(TableServiceClientBuilder.java:161)
at
com.azure.data.tables.TableServiceClientBuilder.buildClient(TableServiceClientBuilder.java:93)
I attach the code:
val clientCredential: ClientSecretCredential = new ClientSecretCredentialBuilder()
.tenantId(tenantID)
.clientId(client_san_Id)
.clientSecret(client_san_Secret)
.build()
val tableService = new TableServiceClientBuilder()
.endpoint("https://<Resource-Table>.table.core.windows.net")
.credential(clientCredential)
.buildClient()
Thank you very much for your time!
First you need to mount Storage on Azure databricks.
Then use the code below to mount Table Storage.
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
Access table storage using below code:
// scala
val df = spark.read.text("/mnt/<mount-name>/...")
val df = spark.read.text("dbfs:/<mount-name>/...")
You can refer this notebook
Also refer this article by Gauri Mahajan
I am trying to write a data pipeline that reads a .tsv file from Azure Blob Storage and write the data to a MySQL database. I have a sensor that looks for a file with a given prefix within my storage container and then a SparkSubmitOperator which actually reads the data and writes it to the database.
The sensor works fine and when I write the data from local storage to MySQL, that works fine as well. However, I am having quite a bit of trouble reading the data from Blob Storage.
This is the simple Spark job that I am trying to run,
spark = (SparkSession
.builder \
.config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") \
.config("fs.azure.account.key.{}.blob.core.windows.net".format(blob_account_name), blob_account_key) \
.getOrCreate()
)
sc = spark.sparkContext
sc.setLogLevel("WARN")
df_tsv = spark.read.csv("wasb://{}#{}.blob.core.windows.net/{}".format(blob_container, blob_account_name, blob_name), sep=r'\t', header=True)
mysql_url = 'jdbc:mysql://' + mysql_server
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver: "com.mysql.cj.jdbc.Driver" })
This is my SparkSubmitOperator,
spark_job = SparkSubmitOperator(
task_id="my-spark-app",
application="path/to/my/spark/job.py", # Spark application path created in airflow and spark cluster
name="my-spark-app",
conn_id="spark_default",
verbose=1,
conf={"spark.master":spark_master},
application_args=[tsv_file, mysql_server, mysql_user, mysql_password, mysql_table],
jars=azure_hadoop_jar + ", " + mysql_driver_jar,
driver_class_path=azure_hadoop_jar + ", " + mysql_driver_jar,
dag=dag)
I keep getting this error,
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
What exactly am I doing wrong?
I have both mysql-connector-java-8.0.27.jar and hadoop-azure-3.3.1.jar in my application. I have given the path to these in the driver_class_path and jars parameters. Is there something wrong with how I have done that here?
I have tried following the suggestions given here, Saving Pyspark Dataframe to Azure Storage, but they have not been helpful.
I could see spark connectors & guidelines for consuming events from Event Hub using Scala in Azure Databricks.
But, How can we consume events in event Hub from azure databricks using pySpark?
any suggestions/documentation details would help. thanks
Below is the snippet for reading events from event hub from pyspark on azure data-bricks.
// With an entity path
val with = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME"
# Source with default settings
connectionString = "Valid EventHubs connection string."
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readInStreamBody = df.withColumn("body", df["body"].cast("string"))
display(readInStreamBody)
I think there is slight modification that is required if you are using spark version 2.4.5 or greater and version of the Azure event Hub Connector 2.3.15 or above
For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted, So you need to pass it as shown in the code snippet below.
connectionString = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME"
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readInStreamBody = df.withColumn("body", df["body"].cast("string"))
display(readInStreamBody)