SPARK : How to access AzureFileSystemInstrumentation when using azure blob storage with spark cluster? - azure

I am working on a spark project where the storage sink is Azure Blob Storage. I write data in parquet format. I need some metrics around storage, eg. numberOfFilesCreated, writtenBytes etc. On searching for it online I came across a particular metrics that the hadoop-azure package has called the AzureFileSystemInstrumentation. I am not sure about how to access the same from spark and can't find any resources for the same. How would one access this instrumentation for the given spark job?

Based on my experience, I think there are three solution can be used in your current scenario, as below.
Directly use Hadoop API for HDFS to get HDFS Metrics Data in Spark, because hadoop-azure just implements the HDFS APIs for using Azure Blob Storage, so please see the Hadoop offical document for Metrics to know what particular metrics you want to use, such as CreateFileOps or FilesCreated as the figure below to get numberOfFilesCreated. Meanwhile, there is a similar SO thread How do I get HDFS bytes read and write for Spark applications? which you can refer to.
Directly use Azure Storage SDK for Java or other languages you used to write a program to do the statistics for files stored in Azure Blob Storage as blobs ordered by creation timestamp or others, please refer to the offical document Quickstart: Azure Blob storage client library v8 for Java to know how to use its SDK.
Use Azure Function with Blob Trigger to monitor the events of files created in Azure Blob Storage, then you can write the code for statistics on every blob created event, please refer to the offical document Create a function triggered by Azure Blob storage to know how to use Blob Trigger. Even, you can send these metrics what you want to Azure Table Storage or Azure SQL Database or other services for statistics later in the Azure Blob Trigger Function.

Related

Azure Synapse spark read from default storage

We are working on an Azure Synapse Analytics project with CI/CD pipeline. I want to read data with serverless spark-pool from storage account, but not specify the storage account name. Is this possible? We are using the default storage account but a separate container for datalake data.
I can read data with spark.read.parquet('abfss://{container_name}#{account_name}.dfs.core.windows.net/filepath.parquet) but since the name of the storage account is different between dev test and prod this will need to be parameterized and I would like to avoid it if possible. Is there any native spark way to do this? I found some documentation about doing this with pandas and FSSPEC but not with only spark.

Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW?

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.
You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

Azure Data lake VS Azure HDInsight

I was going through the Microsoft documents:
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
I'm new to Azure Data lake and HDInsight. There is a statement in the URL which tells that
"Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs."
As per my initial understanding, Data lake store is a store in which any kind of data can be stored. I think, HDInsight also kind of does the same thing.
My question is what is the difference between Azure Data lake and Azure HDInsight? If HDInsight can be used for file storage or any kind of storage then Why to use Data Lake?It would be great if some one could clarify this in details. Thanks.
The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river). Azure Data Lake was introduced to make big data easy for developers, data scientists, and analysts to store data of any size. It removes the complexities of ingesting and storing all your data while making it faster to get up and running with big data. Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc). Azure Data Lake includes three services:
Azure Data Lake Store, a no limits data lake that powers big data
analytics
Azure Data Lake Analytics, a massively parallel on-demand
job service
Azure HDInsight, a full managed Cloud Hadoop and Spark
offering
Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size. We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.
In nutshell,
Hdinsight is a managed hadoop service (to provide compute support)
Azure Data lake(ADL) is a managed storage service (to provide large amount of storage support)
(Instead of ADL, you can alternatively choose to use Blobs in HDinsight, but Blobs have some limitations (like file streaming to storage via hdinsight cluster is not supported)
Here is the definition from Azure documentation (below):
Azure uses "decomposed hardware method"
You can relate or assume HDinsight as a Hadoop Cluster, Azure Data lake (ADL) as HDFS. But they are detached.
If you want to relate with AWS, HDInsight is equivalent to EMR and ADL is equivalent to EMRFS or S3
If you terminate the cluster, ADL storage stays with the files stored in it. You can access the storage directly using another service or tool (like Azure Data bricks) or you can create one another hdinsight cluster on top of the data.
Hdinsight access the ADL using adl:// , and hdinsight never
store the file blocks in the nodes (like Hadoop does), rather it has
mappings to storage service.
Azure Data Lake Store, is just that a data store. HDInsight can also do that in the cluster that you spin up. However, when you stop that cluster, the data also goes away.
It is common that customers use either Azure Data Lake Store, or Azure storage to provide permanent storage separate from the cluster (compute) used to process the data.
Guy
HDInsight is the analytics service whereas the Azure Data Lake Storage is the storage service. You most likely need both to have functional analytics cluster.
HDInsight provides the cluster, fully manages the open-source packages for analytics (Hadoop, Spark ...etc), and you set up your cluster to use Azure Data Lake Storage which support HDFS API ( Hadoop FileSystem ) on top of Cloud Storage.
Azure Data Lake Storage Gen2 is what you are supposed to start looking at which merges the benefits of both Azure Storage and ADLS in one service.
ADLS Gen 2 documentation - https://learn.microsoft.com/en-us/azure/storage/data-lake-storage/introduction
Azure Data Lake Analytics provides server less compute while using Azure Data Lake Store for data storage, whereas in HDInsight,we need to specify and design for Compute Virtual Machine nodes as per processing requirements. It may be advantageous for developers to work with server less compute in Azure Data Lake Analytics, as scaling needs of Analytics Job are taken care out of box.

Is there a way to continuously pipe data from Azure Blob into BigQuery?

I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that all new data in there gets sent over to BigQuery?
BigQuery offers support for querying data directly from these external data sources: Google Cloud Bigtable, Google Cloud Storage, Google Drive. Not include Azure Blob storage. As Adam Lydick mentioned, as a workaround, you could copy data/files from Azure Blob storage to Google Cloud Storage (or other BigQuery-support external data sources).
To copy data from Azure Blob storage to Google Cloud Storage, you can run WebJobs (or Azure Functions), and BlobTriggerred WebJob can trigger a function when a blob is created or updated, in WebJob function you can access the blob content and write/upload it to Google Cloud Storage.
Note: we can install this library: Google.Cloud.Storage to make common operations in client code. And this blog explained how to use Google.Cloud.Storage sdk in Azure Functions.
I'm not aware of anything out-of-the-box (on Google's infrastructure) that can accomplish this.
I'd probably set up a tiny VM to:
Scan your Azure blob storage looking for new content.
Copy new content into GCS (or local disk).
Kick off a LOAD job periodically to add the new data to BigQuery.
If you used GCS instead of Azure Blob Storage, you could eliminate the VM and just have a Cloud Function that is triggered on new items being added to your GCS bucket (assuming your blob is in a form that BigQuery knows how to read). I presume this is part of an existing solution that you'd prefer not to modify though.

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources