Connect to Databricks managed Hive from outside - azure

I have:
An existing Databricks cluster
Azure blob store (wasb) mounted to HDFS
A Database with its LOCATION set to a path on wasb (via mount path)
A Delta table (Which ultimately writes Delta-formatted parquet files to blob store path)
A kubernetes cluster
Reads and writes data in parquet and/or Delta format within the same Azure blob store that Databricks uses (writing as delta format via spark-submit pyspark)
What I want to do:
Utilize the managed Hive metastore in Databricks to act as data catalog for all data within Azure blob store
To this end, I'd like to connect to the metastore from my outside pyspark job such that I can use consistent code to have a catalog that accurately represents my data.
In other words, if I were to prep my db from within Databricks:
dbutils.fs.mount(
source = "wasbs://container#storage.blob.core.windows.net",
mount_point = "/mnt/db",
extra_configs = {..})
spark.sql('CREATE DATABASE db LOCATION "/mnt/db"')
Then from my Kubernetes pyspark cluster, I'd like to execute
df.write.mode('overwrite').format("delta").saveAsTable("db.table_name")
Which should write the data to wasbs://container#storage.blob.core.windows.net/db/table_name as well as register this table with Hive (and thus be able to query it with HiveQL)
How to I connect to the Databricks managed Hive from a pyspark session outside of Databricks environment?

This doesn't answer my question (I don't think it's possible), but it mostly solves my problem: Writing a crawler to create tables from delta files.
Mount Blob container and create a DB as in question
Write a file in delta format from anywhere:
df.write.mode('overwrite').format("delta").save("/mnt/db/table") # equivilantly, save to wasb:..../db/table
Create a Notebook, schedule it as a job to run regularly
import os
def find_delta_dirs(ls_path):
for dir_path in dbutils.fs.ls(ls_path):
if dir_path.isFile():
pass
elif dir_path.isDir() and ls_path != dir_path.path:
if dir_path.path.endswith("_delta_log/"):
yield os.path.dirname(os.path.dirname(dir_path.path))
yield from find_delta_dirs(dir_path.path)
def fmt_name(full_blob_path, mount_path):
relative_path = full_blob_path.split(mount_path)[-1].strip("/")
return relative_path.replace("/", "_")
db_mount_path = f"/mnt/db"
for path in find_delta_dirs(db_mount_path):
spark.sql(f"CREATE TABLE IF NOT EXISTS {db_name}.{fmt_name(path, db_mount_path)} USING DELTA LOCATION '{path}'")

Related

Azure Databricks external Hive Metastore

I checked the [documentation][1] about usage of Azure Databricks external Hive Metastore (Azure SQL database).
I was able to download jars and place them into /dbfs/hive_metastore_jar
My next step is to run cluster with Init file:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<host>.database.windows.net:1433;database=<database> #should I add more parameters?
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName admin
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword p#ssword
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version 2.7.3 #I am not sure about this
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars /dbfs/hive_metastore_jar
I've uploaded ini file to the DBMS and launch cluster. It was failed to read ini. Something wrong..
[1]: https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
I solved this for now. The problems I faced:
I didn't copy Hive jars to the local cluster. This is important, I couldn't refer to the DBMS and should refer spark.sql.hive.metastore.jars to the local copy of Hive. With INI script I can copy them.
connection was good. I also used the Azure template with Vnet, it is more preferable. Then I allow traffic for Azure SQL from my Vnet with databricks.
last issue - I had to create Hive schema before start databricks by copy and run DDL from Git with Hive version 1.2 I deployed it into Azure SQL Database and then I was good to go.
There is a useful notebook with steps to download jars. It is downloading jars to tmp then we should copy it to the own folder. Finally, within cluster creation we should refer to INI script that has all parameters. It has the step of copy jars from DBFS to local file system of cluster.
// This example is for an init script named `external-metastore_hive121.sh`.
dbutils.fs.put(
"dbfs:/databricks/scripts/external-metastore_hive121.sh",
"""#!/bin/sh
|# A temporary workaround to make sure /dbfs is available.
|sleep 10
|# Copy metastore jars from DBFS to the local FileSystem of every node.
|cp -r /dbfs/metastore_jars/hive-v1_2/* /databricks/hive_1_2_1_metastore_jars
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://host--name.database.windows.net:1433;database=tcdatabricksmetastore_dev;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "admin"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "P#ssword"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "1.2.1"
| # Skip this one if ${hive-version} is 0.13.x.
| "spark.sql.hive.metastore.jars" = "/databricks/hive_1_2_1_metastore_jars/*"
|}
|EOF
|""".stripMargin,
overwrite = true)
The command will create a file in DBFS and we will use it as a reference for the cluster creation.
According to the documentation, we should use config:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
In order to create the Hive DDL. It didn't work for me, that's why I used git and create schema and tables myself.
You can test that all works with command:
%sql show databases

fs.s3 configuration with two s3 account with EMR

I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B.
I created EMR in account B and has access to s3 in account B.
I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token.
METHOD1
I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B.
I have pyspark code which reads from s3 (A) and write to parquet s3 (B) I submit job 100 of jobs at time.This pyspark code runs in EMR.
Reading using following setting
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)
spark_df_csv = spark_session.read.option("Header", "True").csv("s3://somepath")
Writing:
I am using s3a protocol s3a://some_bucket/
It works but sometimes i see
_temporary folder present in s3 bucket and not all csv converted to parquet
When i enable EMR concurrency to 256 (EMR-5.28) and submit 100 jobs it this i get _temporary rename error.
Issues:
This method creates temporary folder and sometimes it doesn't deletes it.I can see _temporary folder in s3 bucket.
when i enable EMR concurrency (EMR latest versin5.28) it allows to run steps in parallel, i get rename _temporary error for some of the files.
METHOD2:
I feel s3a is not good for parallel job.
So i want to read and write using fs.s3 as it has better file commiters.
So i did this initially i set hadoop configuration as above to account A and then unset the configuration, so that it can access default account B eventually s3 bucket.
In this way
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")
spark_df_csv.repartition(1).write.partitionBy(['org_id', 'institution_id']). \
mode('append').parquet(write_path)
Issues:
This works but the issue is let say if i trigger lambda which in turn submit job for 100 files (in loop) some 10 odd files result in access denied while writing file to s3 bucket.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n ... 1 more\nCaused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service:
This could be because of either this unset is not working sometimes or
because of parallel run Spark context/session set unset happening in paralleling? I mean spark context for one job is unsettling the hadoop configuration and other is setting it, which may cause this issue, though not sure how spark context works in parallel.
Isn't each job has separate Spark context and session.
Please suggest alternatives for my situation.

Azure Databricks - Unable to read simple blob storage file from notebook

I've set up a cluster with databricks runtime version 5.1 (includes Apache Spark 2.4.0, Scala 2.11) and Python 3. I also installed hadoop azure library (hadoop-azure-3.2.0) to the cluster.
I'm trying to read a blob stored in my blob storage account which is just a text file containing some numeric data delimited by spaces for example. I used the template generated by databricks for reading blob data
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
where file_location is my blob file (https://xxxxxxxxxx.blob.core.windows.net).
I get the following error:
No filesystem named https
I tried using sc.textFile(file_location) to read in an rdd and get the same error.
Your file_location should be in the format:
"wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>"
See: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html
You need to mount the blob with external location to access it via Azure Databricks.
Reference: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs
These three lines of code worked for me:
spark.conf.set("fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.net","BIG_KEY")
df = spark.read.csv("wasbs://CONTAINER#STORAGE_ACCOUNT.blob.core.windows.net/")
df.select('*').show()
NOTE that line 2 ends with .net/ because I do not have a sub-folder.

spark read partitioned data in S3 partly in glacier

I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...
s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
I want to read this dataset, but only the a subset of date that are not yet in glacier, eg:
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
Unfortunately, I have the exception
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but it is ugly like hell and it should not be necessary.
Is there any tip to read available data in the datastore even with old data in glacier?
Error you are getting not related to Apache spark , you are getting exception because of Glacier service in short S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read.
Apache Spark cannot handle directly glacier storage TABLE/PARTITION mapped to an S3 .
java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service:
Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request
ID: C444D508B6042138)
When S3 moves any objects from S3 storage classes
STANDARD,
STANDARD_IA,
REDUCED_REDUNDANCY
to GLACIER storage class, you have object S3 has stored in Glacier which is not visible
to you and S3 will bill only Glacier storage rates.
It is still an S3 object, but has the GLACIER storage class.
When you need to access one of these objects, you initiate a restore,
which temporary copy into S3 .
Move data into S3 bucket read into Apache Spark will resolve your issue.
https://aws.amazon.com/s3/storage-classes/
Note : Apache Spark , AWS athena etc cannot read object directly from glacier if you try will get 403 error.
If you archive objects using the Glacier storage option, you must
inspect the storage class of an object before you attempt to retrieve
it. The customary GET request will work as expected if the object is
stored in S3 Standard or Reduced Redundancy (RRS) storage. It will
fail (with a 403 error) if the object is archived in Glacier. In this
case, you must use the RESTORE operation (described below) to make
your data available in S3.
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
403 error is due to the fact you can not read object that is archieve in Glacier, source
Reading Files from Glacier
If you want to read files from Glacier, you need to restore them to s3 before using them in Apache Spark, a copy will be available on s3 for the time mentioned during restore command, for details see here, you can use S3 console, cli or any language to do that too
Discarding some Glacier files that you do not want to restore
Let's say you do not want to restore all the files from Glacier and discard them during processing, from Spark 2.1.1, 2.2.0 you can ignore those files (with IO/Runtime Exception), by setting spark.sql.files.ignoreCorruptFiles to true source
If you define your table through Hive, and use the Hive metastore catalog to query it, it won't try to go onto the non selected partitions.
Take a look at the spark.sql.hive.metastorePartitionPruning setting
try this setting:
ss.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
or
add the spark-defaults.conf config:
spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER
The S3 connectors from Amazon (s3://) and the ASF (s3a://) don't work with Glacier. Certainly nobody tests s3a against glacier. and if there were problems, you'd be left to fix them yourself. Just copy the data into s3 or onto local HDFS and then work with it there

How to read Azure Table Storage data from Apache Spark running on HDInsight

Is it any way of doing that from a Spark application running on Azure HDInsight? We are using Scala.
Azure Blobs are supported (through WASB). I don't understand why Azure Tables aren't.
Thanks in advance
You can actually read from Table Storage in Spark, here's a project done by a Microsoft guy doing just that:
https://github.com/mooso/azure-tables-hadoop
You probably won't need all the Hive stuff, just the classes at root level:
AzureTableConfiguration.java
AzureTableInputFormat.java
AzureTableInputSplit.java
AzureTablePartitioner.java
AzureTableRecordReader.java
BaseAzureTablePartitioner.java
DefaultTablePartitioner.java
PartitionInputSplit.java
WritableEntity.java
You can read with something like this:
import org.apache.hadoop.conf.Configuration
sparkContext.newAPIHadoopRDD(getTableConfig(tableName,account,key),
classOf[AzureTableInputFormat],
classOf[Text],
classOf[WritableEntity])
def getTableConfig(tableName : String, account : String, key : String): Configuration = {
val configuration = new Configuration()
configuration.set("azure.table.name", tableName)
configuration.set("azure.table.account.uri", account)
configuration.set("azure.table.storage.key", key)
configuration
}
You will have to write a decoding function to transform your WritableEntity to the Class you want.
It worked for me!
Currently Azure Tables are not supported. Only Azure blobs support the HDFS interface required by Hadoop & Spark.

Resources