ADF, Azure Function or Hybrid [closed] - azure

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 days ago.
Improve this question
I need to download data from several APIs some using basic REST some using GraphQL, parse the data and map some of the fields to various Azure SQL tables discarding the unneeded data (will be later visualised in PowerBI).
I started using Azure Data Factory but got frustrated with the lack of simple functions like converting json field containing html into text.
I then looked at Azure Functions, thought Python (although I’m open to NodeJS) however I’ve got a lot of data to download and upSert into the database and there is mentions on the internet the ADF is the most efficient to bulk upSert data.
Then I thought ADF using function to get the data and ADF to bill copy.
So my question is what should I be using for my use case? I’m open to any suggestions but it needs to be on Azure and cost sensitive. The ingestion needs to run daily upSerting around 300,000 records.

I think this pretty much comes down to taste, as you can probably solve this entirely only using ADF or an azure function, depending on more specific circumstances of your case. In my personal experience I've often ended up using the hybrid variant because I can be easier due to more flexibility compared to the standard API components of ADF, doing the extraction from an API using an azure function, storing the data in blob storage/data lake and then loading the data into a database using ADF. This setup can be pretty cost effective from my experience, depending on if you can use an azure function consumption plan (cheaper than alternatives) and/or can void using data flows in ADF (a significant cost driver in ADF)

Related

what are best practices using azure synapse MPP dedicated sqlPools with large data volumes? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 days ago.
Improve this question
hello stackoverflow community...
regarding Microsoft's Azure Synapse Analytics (ASA) cloud-based OLAP there seems to be a critical oversight with ASA dedicated MPP sqlpools
i'm rolling my own IDENTITY surrogate keys because azure synapse MPP dedicated SQLpools (RDBs) don't seem to support this. in doing so data sets that could be loaded with "order (N)" performance (i.e., pre-sorted data sets) will be degraded to "order (N log N)" (each time i insert a record i must calculate it's surrogate key ( e.g., surrogateKey <- [ MAX(table[surrogateKey]) + 1 ] )
INITIAL LOADING, as well as INCREMENTAL LOADING, of MANY LARGE ODS datasets (sometimes exceeding billions of rows) is a requirement for the DW environments I support.
this currently does not appear to be supported. if this a challenge of using azure synapse dedicated SQLpools using a ASA-dedicated-MPP-sqlpool-db, how do we overcome these very real limitations?
how do we best implement and support necessary surrogate keys with ASA for OLAP data architectures without sound assistance and support from the RDBMS, and have it be performant?
please also help find answers to these important, related questions, stackoverflow...
how do i best ingress and ingest source data into ASA-dedicated-MPP-sqlpool-db analytics tables that:
0.) maintain their own auto-numbered [i.e. MSSQL IDENTITY(SEED, INCREMENT)] ENFORCED, UNIQUE, PRIMARY KEY, surrogate key fields ?
1.) also support NONCLUSTERED indexes ?
2.) bulk insert operations should be an order (N) operation with pre-sorted source data sets. how do i accomplish this for massive insert ops with "order (N)" performance with azure synapse sqlpool DBs ? it sure seems like this is now order (N log N) ?
3.) support PK - FK referential integrity ? how do we approach doing this and have a performant analytics data model with an azure sqlpool DBs ?
thank you
there must be a better way to do this with azure dedicated sqlpool DBs ?
Synapse dedicated SQL pools do support IDENTITY. And there shouldn’t be any reason it won’t support tables of billions of rows. Large data loads should use COPY INTO or Polybase for best performance.

Best Azure serverless service to run python data processing project [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am quite new to Azure and I am getting a bit lost in all the available services.
What do I want to do:
I want to run a Python project serverless on Azure which gets data from a database, processes it, does some analysis and writes it to a database again. After it's done, it should stop the server again. This can be triggered by some data uploaded to a storage location or has to run periodically. Most optimal I would like to be able to build it through CD (GitHub Actions).
What did I find
Reading through the documentation and some other resources, these are the services I think I can use in descending order, but I am not 100% sure.
Azure Functions
Azure Container Instances
Azure Web Apps
Also I found this, but seems outdated.
Question:
Which Azure service matches the best for my use case.
What you are trying to accomplish has a name - ETL (Extract-Transform-Load). This is a general pattern when you need to take data from its source (DB in your case), manipulate it, and offload it to some destination (DB in your case again).
You listed some valid options. From your list, Azure Function will be a truly serverless option as you aren't billed when it's idling. Other options can also accomplish the task, but you will pay also for hours when your code does nothing.
There's also a service just for your need: Azure Data Factory. You can design your data flow by using the UI and include your Python functions as steps. The overall result will be a data pipeline (like CD for data). And of course it's serverless. You will be billed only for the time the pipeline is executing.

Trigger Bigquery Scheduled Queries from Cloud Function [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need to run some Scheduled Queries on-demand.
I believe Cloud Functions triggered by Pubsub events is a solution that provides good decoupling.
However, I can't find a reliable solution.
This solution crashes
BigQuery + Cloud Functions:
This one works only on the documentation page
Method: transferConfigs.startManualRuns
What is the best way to trigger On-Demand Scheduled Queries from cloud function?
I understood that you don't want a schedule queries, you want a query to easily invoque, without rewriting it.
I can propose 2 solutions:
store your query in a file on Cloud Storage. When you invoque your Cloud Function, you read the file content and you perform a bigQuery job on it.
PRO: you simply have to update the file content to update the query.
CONS: you need to read a file from storage and then to call BigQuery -> 2 API to calls and a query file to manage
Use stored procedure
Firstly, create your stored procedure
CREATE OR REPLACE PROCEDURE `my_project.my_dataset.my_procedure`()
BEGIN
SELECT * FROM `my_project.my_dataset.my_table`;
.......
END;
Then invoke it in your Cloud Function (It's a query to BigQuery
CALL `my_project.my_dataset.my_procedure`();
PRO: Simply update the stored procedure to update the query. Can perform complex queries
CONS: you don't have a query history (you can activate the bucket versioning in the previous solution for this)
Are they acceptable solutions?

Good architecture for Azure for streaming analytics? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have JSON data coming from sensors per seconds to Azure IoT hub. Data is time series with 15 variables. I want to process this data real time using c# application which is quite complex and send outputs events to some other service(can be storage or PowerBI)
What do you think is the best architectural approach for it?
1. Try to process the data in stream analytics with c# code, I know there is .Net support for azure stream analytics but i think is very premature? Any experience in this approach?Does azure stream analytics support complex c# algorithms?
2. Store data to azure data lake and use data lake analytics to process the data?
Your experiences and recommendations are very much appreciated.
Many thanks
Try to process the data in stream analytics with c# code
Azure Stream Analytics uses Stream Analytics Query Language to perform transformations and computations over streams of events. The C# SDK is just a way to create and run a Stream Analytics job. All the transformations and computations work should be written in Stream Analytics Query Language.
Store data to azure data lake and use data lake analytics to process the data?
Stream Analytics is better in real-time data handle scenarios. I suggest you combine these 2 ways together. Use Azure Stream Analytics to do a preliminary and necessary data processing and conversion and output the data to azure data lake and use data lake analytics to further process the data.
If you're open to alternative solutions you could also use a HTTP API like Stride, which enables the creation of networks of continuous SQL queries, chained together, with the ability to subscribe to streams of changes as a means to streaming data out to applications.
If your computational needs fit within the confines of SQL this approach might work well for you. You can check out the Stride docs to see some examples.

Comparisons between BigData Solutions. [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I've been researching on BigData for the last couple of months and started doing my FYP which is to analyze BigData using MapReduce and HDInsight in Windows Azure.
I just came to this particular confusion where, which platform could be better to do BigData analytics in terms of cost, performance, stability etc. such as Amazon, Oracle, IBM etc. This question could be too broad but I just wanted to get a basic idea of how they can be differentiated when compared to Azure HDInsight.
To be short, HDInsight vs Other BigData Solutions for BigData analytics. Any help would be appreciated.
Comparison among different hadoop distributors
1]1
You can find reference about Microsoft distribution at this article
One important detail is that it is not only about the platform. I agree that it is important to understand your options but humbly I suggest that you take into consideration your (and your team's) skills.
One platform may be better than another one but if you are starting from scratch then you may fail in achieving your goals in terms of timelines, budget or even fail completely.
Combining Operational and Analytical Technologies- Using Hadoop
New technologies like NoSQL, MPP databases, and Hadoophave emerged to address Big Data challenges and to enable new types of products and services to be delivered by the business.
One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database.
NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data.
As per my knowledge, each cloud service provider has their positives and negatives.
I have a good knowledge about google cloud. so, I tried to compare w.r.t. google cloud.
Below are the two links, which provides product mapping with respect to google cloud.
https://cloud.google.com/free/docs/map-azure-google-cloud-platform
https://cloud.google.com/free/docs/map-aws-google-cloud-platform
For example, Azure HDInsight maps with Google Cloud Dataproc and Google Cloud Dataflow. Here using Dataproc, we can run Hadoop Mapreduce jobs. Dataflow we can use for both batch and stream data processing.
In AWS, Azure HDInsight maps with Amazon Elastic MapReduce (EMR).
Each service provider has different pricing mechanism based on type of CPU, no.of cores and storage option. In google cloud, we have option for preemptible instances, which will cost very cheap but we can use them only for short term. (Max 24 hrs).
You can compare pricing from below links:
https://cloud.google.com/dataproc/pricing
https://cloud.google.com/dataflow/pricing
https://azure.microsoft.com/en-us/pricing/details/hdinsight/
https://aws.amazon.com/emr/pricing/
There is a tool in market to compare different cloud services:
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker

Resources