Databricks and Informatica Delta Lake connector spark configuration - databricks

I am working with Informatica Data Integrator and trying to set up a connection with a Databricks cluster. So far everything seems to work fine, but one issue is that under Spark configuration we had to put the SAS key for the ADLS gen 2 storage account.
The reason for this is that when Informatica tries to write to Databricks it first has to write that data into a folder in ADLS gen 2 and then Databricks essentially takes that file and writes it as a Delta Lake table.
Now one issue is that the field where we put the Spark config contains the full SAS value (url plus token and password). That is not really a good thing unless we only make 1 person an admin.
Did anyone work with Informatica and Databricks? Is it possible to put the Spark config as a file and then have the Informatica connector read that file? Or is it possible to add that SAS key to the Spark cluster (the interactive cluster we use) and have that cluster read the info from that file?
Thank you for any help with this.

You really don't need to put SAS key value into Spark configuration, but instead you need to store that value in the Azure KeyVault-baked secret scope (on Azure) or Databricks secret scope (in other clouds), and then refer to that value from Spark configuration using the syntax {{secrets/<secret-scope-name>/<secret-key>}} (see doc) - in this case, SAS key value will be read on the cluster start, and won't available to the users who have access to a cluster UI.

Related

How to access DeltaLake Tables without Databrick Cluster running

I have created DeltaLake Tables on DataBricks Cluster. And I am able to access these tables from external system/application. Though I need to keep the cluster up and running all the time to be able to access the table data.
Question:
Is it possible to access the DeltaLake Tables when Cluster is down?
If Yes, Then how can I setup
I tried to lookup on docs. Found that 'Premium access to DetaBrick' has some Table Access Controls. disabled by otherwise. It says:
Enabling Table Access Control will allow users to control who can
select, create, and modify databases, tables, views, and functions
that they create.
I also found this doc
I don't think this is the option for my requirement.. Please suggest
The solution I found is to store all Delta Lake Tables on Storage Gen2. This will have access to external resources irrespective of DataBrick Clusters.
While reading a file or writing into table we will have our Cluster up and running, rest of time it can be shut down.
From Docs: In databricks we can create delta tables of two types: managed and unmanaged. Managed are those for which data is stored in DBFS (Databricks FileSystem). While Unmanaged are those where an external ADLS Gen-2 location can be specified.
dataframe.write.mode("overwrite").option("path","abfss://[ContainerName]#[StorageAccount].dfs.core.windows.net").saveAsTable("table")

Azure Databricks Cluster Questions

I am new to azure and am trying to understand the below things. It would be helpful if anyone can share their knowledge on this.
Can the table be created in Cluster A be accessed in Cluster B if Cluster A is down?
What is the connection between the cluster and the data in the tables?
You need to have running process (cluster) to be able to access metastore, and read data, because data is stored in the customer's location, not directly accessible from the control plane that runs UI.
When you wrote data into table, then this data should be available in other cluster in following conditions:
the both clusters are using the same metastore
user has correct permissions (could be enforced via Table ACLs)

Is it possible to take snapshot of existing HDInsight cluster in azure

Currently we have a HDInsights cluster which we might have to shut it down or delete for few days. We need the cluster in the same state as we left. What are the ways we can preserve the current snapshot of this cluster and restore it back after few days.
It depends on how have you created the HDInsight cluster. When you created the cluster, did you specify external meta stores, so that your hive meta store is running on your own SQL azure and not the one that HDInsight created?
Check this documentation.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#use-hiveoozie-metastore
If you haven't used external meta stores when you created the cluster, unfortunately, you will lose that state. Your data however, will be persisted in the Azure blob store or Azure data lake store.

HDInsight - Azure blob storage

I have some basic clarifications about azure hdInsight.
The following article gives some basic input on using hdinsight.
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.
It says that HDinsight internally uses azure blob storage .
Having this in mind, my question is as follows:
I have a hdinsight hd1 which uses storage account stg1.
If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily.
So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?
If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.
HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.
Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.
One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.
HDInsight is a family of products, including Hadoop, Spark, HBase, and Storm. They all do different things, and storage is but only one aspect.

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources