Which Azure products are needed for a staging database? - azure

I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.

Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.

Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.

Related

How to tranfer all data from SQL server to Azure API for FHIR database(managed server)?

I'm trying to transfer all of my data from SQL server to azure api for fhir.(Managed Server)
Any idea what should be the starting point for this task. I have thousands of patients in the SQL server.
I need to transfer all of my current data into azure cosmos db. Also, our EMR stores all the data into SQL Server. So once I'm done with this bulk transfer any idea how can I transfer this new data into azure cosmos db every day.
I will really appreciate your help on this. Please let me know if you have any questions.
I have tried to convert my data into FHIR resources. I can insert them manuall into FHIR server. But I cannot do this for all the data.
If you're looking to transfer this data every day, keeping both in sync, you probably need to look at some Middleware to handle this for you, such as Mirth Connect.
A Mirth channel could look for new records in SQL Server, transform the data into a bundle of FHIR resources, then either pass the bundle on to an API of your own which handles the FHIR server update, or run the update itself inside the channel.
The downside with this approach is that the FHIR transformer plugin for Mirth requires a commercial license. You could also use the open source version of Mirth, transform to HL7, then run the HL7 files through the open source Microsoft FHIR Converter to produce FHIR resources, then run the insert/update yourself. (More work on your end.)

What is the best way to automate creating SQL .bak or .bacpac backup files and saving them to an Azure cloud storage container?

Currently I am tasked with researching a solution to easily copying data from one environment to another (QA to DEV for example) as well as having the flexibility of going to different times to compare our data. It is an easy task to do locally with SSMS and I am looking for the best ways to do it using Azure and it's tools.
These are the options that I found so far:
Backup Service and Backup Vault (The MS solution that I am not asking for. They don't generate .bak files)
Azure Function to execute generate and transfer SQL (flexible but the code needs to be maintained + manage authentication)
Powershell process with Azure Automate (Flexible too but needs to be maintained)
Datafactory/SSIS (Still learning and researching)
Anyone got any tools/methods that are worth looking into before I dive deeper with a solution?
For Azure SQL database, SQL Data Sync is one of feature for the data sync between Azure SQL and SQL server(on-premise). Some limits are that Azure SQL database must be hub and each must have a primary key. That may not suit you.
Per my experience, Data Factory is the best one for you. You can copy the data between different environment, in Sink settings, we can using upsert(insert or update) operation to sync the data.
If you only want to schedule the backup automatically for the SQL, the third-part tool also could feed your request: SQL Backup and FTP.
Since you have searched a lot and found almost all the options in Azure, all the ways can achieve that. You need to know your real request, data sync or auto backup create the .bacpac file to storage. That's not a good question to help you find the best way. The way you like, the way is the best.
I went with writing an Azure Automate powershell script. including cmdlts like New-AzureRmSqlDatabaseExport and passing in the parameters was ticky but it finally did the job.

Which Azure storage technology for weather forecast data

I would like some advice/tips about the right technology to select in order to store some forecast data on Azure technologies.
My team and I are scraping some weather forecast data everyday from various sources and store it as is on a Azure File Storage. The files format is "grib2" which is a standard format of weather forecast data.
We are able to extract the data from those "grib2" files using python script running on a Azure VM.
We now have several files that represent hundreds gigabytes of data to store and I'm struggling to find which data store from the Azure technologies suits the best our needs in term of praticity and cost.
We started using "Azure Table Storage" first because it's cheap solution,
but I've read on many posts that it is a bit old and not very adapted to our solution as it for example does not allow more than 1,000 entites per query and no aggregation on data.
I considered using Azure SQL db but it seems that it can become very expensive very fast.
I also considered the Azure Data Lake Storage Gen2 (and HDinsight) technologies but am not very at ease with those blob storages and am not really able to say if it can suit my needs in terms of praticity and if it is "easy to query".
By now we just plan to achieve that :
1) Extract data from grib2 files thanks to a python script running on an Azure VM
2) Insert the transformed data into [Azure storage]
3) Query the [Azure storage] from Azure Machine Learning Service or a local R script (for example)
4) Insert the computed data into [Azure storage]
where [Azure Storage] technology is to determine.
Any help or advice would be much appreciated, thanks.
A couple of things I would see here:
To store the downloaded files in raw format (grib2 in your case), either place them on good ol' Azure Blob Storage. Cheap storage exactly for your needs.
Use Azure Databricks to load the data from the storage account and unpack it into memory. (python or scala)
Load the data in memory - still in Databricks - to run you ML inferencing. You could also use SparkR if you really want to.
Store the computed files in a serving layer. This really depends on what you want to do with it later. Often Azure SQL Database is an obvious choice. There is a native Spark connector which efficiently writes data from Databricks to SQL DB.
In addition to using Databricks as your inferencing environment, it's also a good choice for ML training (e.g. utilizing Azure ML Service).

Google Analytics data in Azure

Has anybody ever moved Google Analytics data into Azure? I have seen a handful of ways to do it but I am not sure what I am getting myself into. The Google Analytics data is becoming quite large and I am wondering if it is best suited to leave it in google storage and access it from Azure or move it to something like HDInsight or Data Lake. I need to join the data across several disparate data stores, SQL Azure, Blob, and Table Storage. I was also looking into Apache Drill and Presto as a possible solution to unify the data access. Just looking to see if anybody out there has dealt with this same issue and has any experience to share. Thanks!
Preface
I don't have experience with Presto so I can only comment on the feasibility of doing this with Drill. Also I have not used Azure services so my advice is theoretical.
Drill Storage Plugins
Drill will allow you to perform any SQL queries you want on data originating from different sources, provided that each data source has a storage plugin. A storage plugin is simply a piece of code in Drill that allows you to interface with a data source. Since you are concerned with performing queries on 3 data sources, we need to determine if each of those 3 data sources have a Storage plugin.
SQL Azure
I assume SQL Azure has a jdbc driver for java. If so then Drill can be configured to use SQL Azure by following these instructions.
Azure Blob
Azure Blob storage has an implementation of the hadoop filesystem api which Drill uses to read data from file systems. So you could theoretically add the hadoop-azure jar and its dependencies https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.7.0 to Drill's class path and configure Drill's DFS storage plugin to use it.
Additionally the data in Azure Blob would have to be stored in a supported file format like: json, parquet, csv, or hadoop sequence files.
Azure Table
This looks like Microsoft's custom NoSQL database. Currently Drill does not support it.
Conclusion
With a bit of work you could use Drill to query data on both Azure SQL and Blob, but not Azure Table.

Is Azure Data Factory suitable for downloading data from non-Azure REST APIs?

Consider a data processing pipeline as follows:
Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
Perform some complex data transformations on the persisted data.
Persist the results of the data transformations on a data store.
Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.
My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?
Technically it might be possible to code a .Net activity that perform the data download and persistence.
No - do not implement step 1 in an Azure Data Factory activity.
Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.
For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.
You might consider the following as an easy to implement alternative:
1 - Create a .NET console app, publish as a WebJob, schedule to run daily.
2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.
I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.
As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.
Hope that helps.
There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector.
Here's the approach recommended by ADF at this time...
Copy data from a REST endpoint by using Azure Data Factory

Resources