batch processing in azure - azure

We are planning to do batch processing on a daily basis. We generate 1 GB of CSV files every day and will manually put them into Azure Data Lake Store. I have read the Microsoft Azure documents regarding the batch processing and I have decided to use Spark as to batch processing. My question is that after we transfer the data using RDD/DF what would be the next step? how we can visualize the data? since this process is supposed to be run every day, once the data transformation done using Spark, do we need to push the data to any kind of data store like hive hdfs or cosmos before we could visualize it?

There are several options doing this on Azure. It really depends on your requirements (e.g. number of users, needed visualizations, etc). Examples for doing it:
Running Spark on Azure Databricks, you could use the Notebook capabilities to visualize your data
Use HDInsight with Jupyter or Zeppelin Notebooks
Define Spark tables on Azure Databricks and visualize them with Power BI
Load the data with Azure Data Factory V2 to Azure SQL DB or Azure SQL Data Warehouse and visualize it with Power BI.
For Time-Series-Data you could push the data via Spark to Azure EventHubs (see Example notebook with Eventhubs Sink in the following documentation) and consume it via Azure Time Series Insights. If you have an EventData-Stream this could also replace your batch oriented architecture in the future. Parquet files will be used by Azure Time Series Insights as Long-term Storage (see the following link). For Spark also have a look at Time Series Package which adds some time series capabilities to spark.

Related

Azure Architecture Implementation Ideas

We've designed a Data Architecture for our client on Azure wherein, We ingest the sources into a Raw Layer consisting of a Azure SQL Database. This Azure SQL Database acts as a Source Mirror and Has Near Real time sync enabled.
We also have an ODS Layer which is populated from the Previously mentioned Azure SQL Database (Source Mirror) as per the given Data Model. This Layer should ideally take anywhere between 30mins and 1 Hour to Load.
May I Know How Do I Handle the Concurrent Writes and Reads from the Raw Layer (Azure SQL Database, Source Mirror) ? It Syncs every 5 mins with the Sources but also read every 30mins - 1 Hour for ODS Layer.
I've to Use Azure Data Factory to Implement my Data Loads
Yes, Azure Data Factory platform is best fit for such scenarios. Its a cloud-based ETL and data integration tool that lets you build data-driven processes for managing data transportation and data transformation at scale. You can use Azure Data Factory to design and plan data-driven processes (also known as pipelines) that may consume data from a variety of sources. With data flows or computing services like Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database, you can design sophisticated ETL processes that convert data graphically.
When using control flow, you can use a GetMetadata activity to get a
list of files in a storage account, then pass that list to a for each
activity with the Sequential flag set to false to process all files
concurrently (in parallel) up to the maximum batch size according to
the activities defined in the for each loop.
Here, is the Microsoft Official Document for the Azure Datafactory Connectors Overview | Docs

Does Azure Databricks and Delta Layer make it a Lakehouse?

Even after going through many resources, I have failed to understand what constitutes a lakehouse, hence my question below.
If we have Azure Gen 2 Storage, ADF, and Azure Databricks with the possibility of converting the incoming CSV files into Delta tables can that be called a "Lakehouse" architecture or is it called a "Delta Lake"?
Or is it the "SQL analytics" engine over and above the Delta Lake layer that makes it a "Lakehouse"?
Please clarify.
At a high level a Lakehouse must contain the following properties:
Open direct access data formats (Apache Parquet, Delta Lake etc.)
First class support for machine learning and data science workloads
state of the art performance
Databricks is the first Lakehouse because it meets the above three properties. Specifically, if you are using Databricks with ADLS and converting all your data (json, csv, parquet, messages etc.) into Delta tables that are available within Databricks. Then that is the making of a Lakehouse, but it still needs to be built and supported. The Databricks platform allows us to satisfy points 2 and 3 above and Delta Lake satisfies 1 ad 3 (performance relies on the engine and the storage which is why 3 is mentioned twice).
Leveraging Databricks and accessing data stored in Delta is a Lakehouse. By adding Databricks SQL (formally SQL Analytics) we allow more users to access and use the Lakehouse. In Databricks SQL users are using the same compute and data as the data engineer does in Databricks, they just have a different UI that they are familiar with. Additionally, Databricks SQL is optimized for SQL and BI workloads while the notebook environment is better for engineering and data science
As a fun read you should check our the Lakehouse whitepaper.

How good is Azure Data Lake for storing an SQL database used for Power BI visualizations?

We have an Azure SQL database where we collect a large amount of sensor data and we regularly extract the data from it and transform it a bit with a python script. The end result is a pandas DataFrame file. We would like to store the transformed data in an Azure database and use it as a source of a power BI dashboard.
On the one hand, we want to show the "almost" real-time data on a dashboard (the latency due to the transformation etc. is acceptable, but the dashboard needs to refresh very frequently, let's say once a minute), but we also want to store the transformed data and query it later e.g. to visualize the data only for a given day.
Is it possible to convert the pandas DataFrame into SQL and store it on Data Lake and stream the data from there? I read that it is possible to store structured data on Data Lake and even query it, but I am unsure if this would be the best solution.
(My current task is to choose the best database for storing the transformed data to enable both streaming and querying it later. I am very new in Azure products and I don't have a sandbox account yet to even try around and identify possible pitfalls. I've just figured out that PowerBI does not support DirectQuery for DataLake and I feel like this can be an issue - meaning we would have to query the data on DataLake at first and store it somewhere if we wanted to visualize a subset, is that correct?)
Azure Datalake is not a database, just a store for the data both structured and unstructured, so as mentioned you can't direct query it unless you have some compute capacity (Databricks, Azure Synapse, Azure DataLake Analytics, Power BI Premium with enhanced compute)
Depending on your approach, it may be best to move from Azure SQL Database and Pandas, to Azure Databricks, that can ingest the streaming data, transform, and provide an outputted table that is stored in the data lake. You will then connect Power BI to the Databricks instance and query that. The data will only be available while the cluster is running.
Moving to Databricks, will involve rewriting your Panda code to Koalas, or preferably Pyspark.
You do have the option of using Databricks to write the items back to a Azure SQL Database table. Depending on what transformations you are doing you could keep it all in Azure SQL, or if it is sensor data streaming, take the data through Azure Event Hubs, to Azure Streaming Analytics (does transformations), to Azure SQL Database (store Realtime and historical).

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?
Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.
Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

Batch processing with spark and azure

I am working for an energy provider company. Currently, we are generating 1 GB data in form of flat files per day. We have decided to use azure data lake store to store our data, in which we want to do batch processing on a daily basis. My question is that what is the best way to transfer the flat files into azure data lake store? and after the data is pushed into azure I am wondering whether it is good idea to process the data with HDInsight spark? like Dataframe API or SparkSQL and finally visualize it with azure?
For a daily load from a local file system I would recommend using Azure Data Factory Version 2. You have to install Integration Runtimes on Premise (more than one for High Avalibility). You have to consider several security topics (local firewalls, network connectivity etc.) A detailed documentation can be found here. There are also some good Tutorials available. With Azure Data Factory you can trigger your upload to Azure with a Get-Metadata-Activity and use e. g. an Azure Databricks Notebook Activity for further Spark processing.

Resources