Customizing nodes of an Azure Synapse Workspace Spark Cluster - azure

When creating a Spark cluster within an Azure Synapse workspace, is there a means to install arbitrary files and directories onto it's cluster nodes and/or onto the node's underlying distributed filesystem?
By arbitrary files and directories, I literally mean arbitrary files and directories; not just extra Python libraries like demonstrated here.
Databricks smartly provided a means to do this on it's cluster nodes (described in this document). Now I'm trying to see if there's a means to do the same on an Azure Synapse Workspace Spark Cluster.
Thank you.

Unfortunately, Azure Synapse Analytics don't support arbitrary binary installs or writing to Spark local storage.
I would suggest you to provide feedback on the same:
https://feedback.azure.com/forums/307516-azure-synapse-analytics
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

Related

How to get total number of clusters, jobs, libraries installed etc. in existing azure databricks workspace?

Is there a programmatic way to get total number of clusters, jobs, spark & scala runtime versions, libraries installed , other artifacts etc. in existing azure databricks workspace?
For retrieving number of cluster, jobs, runtime versions, libraries etc. you can use databricks API.
For example cluster API for information about cluster.
Access to this API can be authenticated by databricks personal access tokens.

Local instance of Databricks for development

I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical.
Is there a "local" install of Databricks that can be installed for development purposes (it doesn't need to be a scalable version but does need to be essentially fully featured)? In other words, is there a way each developer can create their own development instance of Databricks on their local machine?
Is there another way to provide a dedicated Databricks environment for each developer?
Databricks, as a cloud-deployed platform, leverages many cloud technologies in its deployment. For example, Auto Loader incrementally ingests new data files as they arrive in AWS using EventBridge, SNS and S3, while Azure uses EventHubs, Notification Hubs and ADLS technologies. They aim to create a seamless look and feel across AWS, Azure and GCP but can do this only in the cloud.
For local deployment, you may be able to use Apache Spark and MlFlow and create a similar experience, but the notebook experience isn't open source. The workflow of Databricks is proprietary, though Databricks has open-sourced many of its technologies, like Delta Lake. The local Spark, MlFlow, may suffice for some and then use the cloud sparingly, but the seamless workflow offered by Databricks is challenging to replicate outside of the leading cloud vendors.

How does HDInsight cluster maps to Azure Storage as HDFS?

I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/

Configuration of Azure for Storage and Hadoop Clustered Processing

Hope someone can offer any advice. At the moment I have been asked to scope out a possible infrastructure for a new Azure Platform. We are also going to be using HDFS / Hadoop for our ETL and Storage.
Can anyone offer any advice on the following :
Set up a Storage Optimised Server (eg, L4, 4 Core, 32gb Ram, 678 GB Storage) to hold our raw data, reference tables and final cleansed data within HDFS. This server could be running 24/7 to feed our analytics platforms.
Then, to utilise the power of Hadoop, could we spin up a set of Processing servers (eg, once a week) to read from the Storage Server, process and write back to the storage server and then shutdown until the next load & process task.
Would really appreciate anyone's thoughts advice on this or any possible configurations we could think of?
Many thanks
Fiorano
Whether your Hadoop cluster is on-premises or in the cloud, it contains two main resources: compute resources to process jobs, and storage resources to hold data. In an on-premises cluster, the storage and compute resources are combined into the same hardware tying them together. With HDInsight the storage is wholly separated from the compute resource. This is a very important distinction of HDInsight. It means that I can completely turn off the compute portion of the cluster and the data will remain accessible.
Note: To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Store, or both.
For more details, refer "Azure HDInsight Documentation".

HDInsight - Azure blob storage

I have some basic clarifications about azure hdInsight.
The following article gives some basic input on using hdinsight.
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.
It says that HDinsight internally uses azure blob storage .
Having this in mind, my question is as follows:
I have a hdinsight hd1 which uses storage account stg1.
If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily.
So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?
If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.
HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.
Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.
One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.
HDInsight is a family of products, including Hadoop, Spark, HBase, and Storm. They all do different things, and storage is but only one aspect.

Resources