Databricks or Functions with ADF? - azure

I'm using ADF to output some reports to pdf (at least that's the goal.)
I'm using ADF to output a csv to a storage blob and I would like to ingest that, do some formatting and stats work (with scipy and matplotlib in python) and export as a pdf to the same container. This would be run once a month, and I may do a few other things like this, but they are periodical reports at the most, no streaming or anything like that.
From an architectural stand point, would this be a good application for an Azure Function (which I have some experience), or Azure Databricks (which I will want some experience in).
My first thought is the Azure Functions, since they are serverless and pay-as-you-go. But I don't know too much about Databricks except that it's primarily used for big data and long running jobs.

Databricks would be almost certainly an overkill for this. So yes, Azure Function for Python sounds like a perfect fit for your scenario.

Related

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

How can I decide, if I should use the Power BI API to push data into my streaming dataset or Azure Stream Analytics?

I am very new to Azure. I need to create a Power BI dashboard to visualize some data produced by a sensor. The dashboard needs to get updated "almost" real-time. I have identified that I need a push data set, as I want to visualize some historic data on a line chart. However, from the architecture point of view, I could use the Power BI REST APIs (which would be completely fine in my case, as we process the data with a Python app and I could use that to call Power BI) or Azure Stream Analytics (which could also work, I could dump the data to the Azure Blob storage from the Python app and then stream it).
Can you tell me generally speaking, what are the advantages/disadvantages of the two approaches?
Azure stream analytics lets you have multiple sources and define multiple targets and one of those targets could be Power-BI and Blob ... and at the same time you can use windowing function on the data as it comes in. It also provides you a visual way of managing your pipeline including windowing function.
In your case you are kind of replicating the incoming data to Blob first and secondly to power-BI. But if you have a use case to apply windowing function(1 minutes or so) as your data is coming in from multiple sources e.g. more than one sensor or a senor and other source, you have to fiddle around a lot to get it working manually, where as in stream analytics you can easily do it.
Following article highlights some of the pros and cons of Azure Analytics...
https://www.axonize.com/blog/iot-technology/the-advantages-and-disadvantages-of-using-azure-stream-analytics-for-iot-applications/
If possible, I would recommend streaming data to IoT Hub first, and then ASA can pick it up and render the same on Power BI. It will provide you better latency than streaming data from Blob to ASA and then Power BI. It is the recommended IoT pattern for remote monitoring, predictive maintenance etc , and provides you longer term options to add a lot of logic in the real-time pipelines (ML scoring, windowing, custom code etc).

Azure Functions how much code can be done in one?

I am a complete newbie for Azure and Azure Functions but my team plans to move to Azure soon. Now I'm researching how I could use Azure Functions to basically do what I would normally do in a .Net console application.
My question is, can Azure Functions handle quite a bit of code processing?
Our team uses several console apps that effectively pick up a pipe delimited file, do some business logic, update a database with the data, and log everything along the way. From what I've been reading so far I typically see that Azure Functions are used for little pieces of code. How little do they mean? Is it best practice to have a bunch of Azure Functions to replace a console app EX: have one function that does the reading of a file and create a list of objects, another function to loop through those items and add business logic, and then another to write the data to a database or can I use one Azure Function to do all of that?
Direct answer is yes - you can run bigger pieces of code as Azure Function - this is not a problem as long as you meet their limitations. You can even have dependency injecton. For chained scenarios, you can use Durable Functions. However, Microsoft do not recommend long running functions, cause of unexpected timeouts. See best practices for azure functions.
Because of that, I would consider alternatives:
If all what you need is run console app in Azure you can use WebJobs. Here is example how to deploy console app directly to azure via VisualStudio
For more complex logic you can use .NET Core Worker Service which behaves as Windows Service, and could be deployed to azure as App Service.
If you need long-running jobs but with scheduled runs only I had really great experience with Hangfire which can be hosted in Azure as well.
This is really hard to answer because we don't know what kind of console app you have over there. I usually try to use the same SOLID principles used to any class on my functions too. And whenever you need to coordenate actions or if you need to run things in parallel you always use Durable Functions Framework too.
The only concern is related to execution time. Your function cans get pretty expensive if you're running on consumption plan and do know pay attention to it. I recommend you the reading of the following gread article:
https://dev.to/azure/is-serverless-really-as-cheap-as-everyone-claims-4i9n
You can do all of that in one function.
If you need on-the-fly data processing, you can safely use Azure Functions even if it takes reading files or database communication.
What you need to be careful at and configure, though, is the timeout. Their scalability is an interesting topic as well.
If you need to host an application, you need a machine or a part of the storage space of a machine in Azure to do that.

Copy millions of files form root AZStorage Blob to subfolders

I’ve got multiple Azure storage blob containers each with over 1M JSON files include the root. Impossible to work with (no shocker) so trying to use Data Factory to move them to multiple folders using a timestamp in the files to create a YYYY-MM-DD/HH folder setup as a partition system. But every approach I’ve tried fails with timeouts / too many item limits. Need to open each file, get the timestamp, and use it to move the file to a dynamic path using the timestamp data. Ideas?
UPDATE: I was able to get around this, but I wouldn't call it a "answer" so I'll just update the question. To create smaller collections, I parameterized the pipeline to accept a file name wildcard. I then created another pipeline that uses an array of 0-9,a-z to use that as an parameter on the dataset. Brute force workaround... assume there's got to be a better solution, but this works for now.
Read doc: Move data to and from Azure Blob storage
The following articles describe how to move data to and from Azure Blob storage using different technologies.
Azure Storage-Explorer
AzCopy
Python-SDK (Others: .NET, Java, Node.js, Python, Go, PHP, Ruby.)
SSIS
In your case, I would suggest you to use SDK, which supports .NET, Java, Node.js, Python, Go, PHP, Ruby.
Believe me , if you want to migrate your datas from AzureBlob , DataFactory is not a good way, it makes the problem more complicated.
( This is my suggestion after I migrated over 100 million JSON-files (over 2TB) from AzureBlob)
If you have time... I would do the following:
Create an Azure Function to read the file and get your timestamp and do your move operation. scope the function just to use a single file. Then use events (EventGrid) in the storage account to trigger the function on create of a blob. Then you know for any new files it will move the file to the right spot. (Remember you need to reach a million executions in the consumption model for functions to start billing, so this is a low cost option).
For the current files, create another function (or if you want some more control, use a logic app, but your cost will be a bit more) and set your parralelism on the function or logic app to a low amount (to keep an eye on your executions). that run a simple for each with limits that run your first function. This will slowly move your files out of that container eventually getting you into a reasonable item count to work with on with stuff like ADF. This might just solve your problem for the long run as any new files will be categorized accordingly, and your backlog is slowly being moved as required. If you need to update a DB with a pointer to where your file lives you could put that piece of code also in your function or logic app. Just my two cents :)
It is not clear if you are using the hierarchical folder structure provided by Azure Data Lake Storage Gen2, the generation 1 simulates a folders structure but it is not optimum.
There are several advantages on the ADLSV2 that should help in your case mainly related to move operations.
To migrate from ADLS Gen 1 to ADLS Gen 2 have a look here.
Additionally, you may explore optimizations on your specific case with the following paper here.

Using python and Google cloud engine to process big data

I am an amateur to the world of Python programming and I need help. I have 10GB of data and I have written python codes with Spyder to process the data. a part of codes is provided:
The codes are good with a small sample of data. However, with 10GB of data, my laptop cannot handle it so I need to use Google Cloud Engine. How I can upload the data and use Google Cloud Engine to run codes?
import os
import pandas as pd
import pickle
import glob
import numpy as np
df=pd.read_pickle(r'C:\user\mydata.pkl')
i=2018
while i>=1995:
df=df[df.OverlapYearStart<=i]
df.to_pickle(r'C:\user\done\{}.pkl'.format(i))
i=i-1
I agree with the previous answer, just to complement it you can take a look in AI Platform Notebooks which is a managed service that offers an integrated JupyterLab environment, also has the capacity to pull your data from BigQuery and allow you to scale your application on demand.
On the other hand, I don't know how you have storage your 10GB of data into CSV? in a database? As is mentioned in the first answer Cloud Storage allows you to create buckets to store your data, once the data is in Cloud Storage you may export that data into BigQuery tables to work with that data in your app using Google Cloud App Engine or the earlier suggestion AI Platform Notebooks this will depend of your solution.
Probably the easiest thing to start digging into, is going to be to use App Engine to run the code itself:
https://cloud.google.com/appengine/docs/python/
And use Google Cloud Storage to hold your data objects:
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
I don't know what the output of your application is, so depending on what you want to do with the output, Google Compute Engine may be the right answer if AppEngine doesn't quite fit what you're doing.
https://cloud.google.com/compute/
The first two links take you to the documentation on how to get going with Python for AppEngine and Google Cloud Storage.
Edit to add from comments, that you'll also need to manage the memory footprint of your app. If you're really doing everything in one giant while loop, no matter where you run the application you'll have memory problems as all 10GB of your data will likely get loaded into memory. Definitely still shift that into the Cloud IMO, but yeah, that memory will need to get broken up somehow and handled in smaller chunks.

Resources