Azure Databricks - Unable to read simple blob storage file from notebook - apache-spark

I've set up a cluster with databricks runtime version 5.1 (includes Apache Spark 2.4.0, Scala 2.11) and Python 3. I also installed hadoop azure library (hadoop-azure-3.2.0) to the cluster.
I'm trying to read a blob stored in my blob storage account which is just a text file containing some numeric data delimited by spaces for example. I used the template generated by databricks for reading blob data
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
where file_location is my blob file (https://xxxxxxxxxx.blob.core.windows.net).
I get the following error:
No filesystem named https
I tried using sc.textFile(file_location) to read in an rdd and get the same error.

Your file_location should be in the format:
"wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>"
See: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html

You need to mount the blob with external location to access it via Azure Databricks.
Reference: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs

These three lines of code worked for me:
spark.conf.set("fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.net","BIG_KEY")
df = spark.read.csv("wasbs://CONTAINER#STORAGE_ACCOUNT.blob.core.windows.net/")
df.select('*').show()
NOTE that line 2 ends with .net/ because I do not have a sub-folder.

Related

Connect to Databricks managed Hive from outside

I have:
An existing Databricks cluster
Azure blob store (wasb) mounted to HDFS
A Database with its LOCATION set to a path on wasb (via mount path)
A Delta table (Which ultimately writes Delta-formatted parquet files to blob store path)
A kubernetes cluster
Reads and writes data in parquet and/or Delta format within the same Azure blob store that Databricks uses (writing as delta format via spark-submit pyspark)
What I want to do:
Utilize the managed Hive metastore in Databricks to act as data catalog for all data within Azure blob store
To this end, I'd like to connect to the metastore from my outside pyspark job such that I can use consistent code to have a catalog that accurately represents my data.
In other words, if I were to prep my db from within Databricks:
dbutils.fs.mount(
source = "wasbs://container#storage.blob.core.windows.net",
mount_point = "/mnt/db",
extra_configs = {..})
spark.sql('CREATE DATABASE db LOCATION "/mnt/db"')
Then from my Kubernetes pyspark cluster, I'd like to execute
df.write.mode('overwrite').format("delta").saveAsTable("db.table_name")
Which should write the data to wasbs://container#storage.blob.core.windows.net/db/table_name as well as register this table with Hive (and thus be able to query it with HiveQL)
How to I connect to the Databricks managed Hive from a pyspark session outside of Databricks environment?
This doesn't answer my question (I don't think it's possible), but it mostly solves my problem: Writing a crawler to create tables from delta files.
Mount Blob container and create a DB as in question
Write a file in delta format from anywhere:
df.write.mode('overwrite').format("delta").save("/mnt/db/table") # equivilantly, save to wasb:..../db/table
Create a Notebook, schedule it as a job to run regularly
import os
def find_delta_dirs(ls_path):
for dir_path in dbutils.fs.ls(ls_path):
if dir_path.isFile():
pass
elif dir_path.isDir() and ls_path != dir_path.path:
if dir_path.path.endswith("_delta_log/"):
yield os.path.dirname(os.path.dirname(dir_path.path))
yield from find_delta_dirs(dir_path.path)
def fmt_name(full_blob_path, mount_path):
relative_path = full_blob_path.split(mount_path)[-1].strip("/")
return relative_path.replace("/", "_")
db_mount_path = f"/mnt/db"
for path in find_delta_dirs(db_mount_path):
spark.sql(f"CREATE TABLE IF NOT EXISTS {db_name}.{fmt_name(path, db_mount_path)} USING DELTA LOCATION '{path}'")

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() in databricks

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() method in databricks.
json_file = open("abfss://#.dfs.core.windows.net/test.json') databricks is unable to open file present in azure blob container and getting below error :
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://#.dfs.core.windows.net/test.json'
I have done all the configuration setting using service principal. Please suggest other way of opening file using abfss direct path.
the open method works only with local files - it doesn't know anything about abfss or other cloud storages. You have following choice:
use dbutils.fs.cp to copy file from ADLS to local disk of driver node, and then work with it, like: dbutils.fs.cp("abfss:/....", "file:/tmp/my-copy")
Copy file from ADLS to driver node using the Azure SDK
The first method is easier to use than second

Can't access files via Local file API on Databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?
Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

How to read Azure Table Storage data from Apache Spark running on HDInsight

Is it any way of doing that from a Spark application running on Azure HDInsight? We are using Scala.
Azure Blobs are supported (through WASB). I don't understand why Azure Tables aren't.
Thanks in advance
You can actually read from Table Storage in Spark, here's a project done by a Microsoft guy doing just that:
https://github.com/mooso/azure-tables-hadoop
You probably won't need all the Hive stuff, just the classes at root level:
AzureTableConfiguration.java
AzureTableInputFormat.java
AzureTableInputSplit.java
AzureTablePartitioner.java
AzureTableRecordReader.java
BaseAzureTablePartitioner.java
DefaultTablePartitioner.java
PartitionInputSplit.java
WritableEntity.java
You can read with something like this:
import org.apache.hadoop.conf.Configuration
sparkContext.newAPIHadoopRDD(getTableConfig(tableName,account,key),
classOf[AzureTableInputFormat],
classOf[Text],
classOf[WritableEntity])
def getTableConfig(tableName : String, account : String, key : String): Configuration = {
val configuration = new Configuration()
configuration.set("azure.table.name", tableName)
configuration.set("azure.table.account.uri", account)
configuration.set("azure.table.storage.key", key)
configuration
}
You will have to write a decoding function to transform your WritableEntity to the Class you want.
It worked for me!
Currently Azure Tables are not supported. Only Azure blobs support the HDFS interface required by Hadoop & Spark.

Resources