We have requirement to extract data from ECC tables to Azure data lake in real time.
Azure data factory has connectivity option to SAP ECC system but it will not support real time ingestion.
Please let me know if there is any native implementation available within Azure / SAP to support this type of requirement.
For real-time ingestion scenarios, you can use the SAP change data capture (CDC) solution in Azure Data Factory. See SAP change data capture (CDC) solution in Azure Data Factory The native connector is in Public Preview at the moment.
Related
We've designed a Data Architecture for our client on Azure wherein, We ingest the sources into a Raw Layer consisting of a Azure SQL Database. This Azure SQL Database acts as a Source Mirror and Has Near Real time sync enabled.
We also have an ODS Layer which is populated from the Previously mentioned Azure SQL Database (Source Mirror) as per the given Data Model. This Layer should ideally take anywhere between 30mins and 1 Hour to Load.
May I Know How Do I Handle the Concurrent Writes and Reads from the Raw Layer (Azure SQL Database, Source Mirror) ? It Syncs every 5 mins with the Sources but also read every 30mins - 1 Hour for ODS Layer.
I've to Use Azure Data Factory to Implement my Data Loads
Yes, Azure Data Factory platform is best fit for such scenarios. Its a cloud-based ETL and data integration tool that lets you build data-driven processes for managing data transportation and data transformation at scale. You can use Azure Data Factory to design and plan data-driven processes (also known as pipelines) that may consume data from a variety of sources. With data flows or computing services like Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database, you can design sophisticated ETL processes that convert data graphically.
When using control flow, you can use a GetMetadata activity to get a
list of files in a storage account, then pass that list to a for each
activity with the Sequential flag set to false to process all files
concurrently (in parallel) up to the maximum batch size according to
the activities defined in the for each loop.
Here, is the Microsoft Official Document for the Azure Datafactory Connectors Overview | Docs
We are in the process of analyzing which database will be the best choices for Time Series data (like stock market data / trading data, market sentiments ..etc.)
Is Azure Synapse is a good choice for Time Series Data?
Azure Synapse data explorer (Preview) provides you with a dedicated query engine optimized and built for log and time series data workloads.
With this new capability now part of Azure Synapse's unified analytics platform, you can easily access your machine and user data to surface insights that can directly improve business decisions.
To complement the existing SQL and Apache Spark analytical runtimes, Azure Synapse data explorer is optimized for efficient log analytics, using powerful indexing technology to automatically index structured, semi-structured, and free-text data commonly found in telemetry data.
For more info please refer to below related articles:
https://learn.microsoft.com/en-us/azure/synapse-analytics/data-explorer/data-explorer-overview
Time series solution - Azure Architecture
Please note that the feature is in public preview.
Question
Today I discovered another Azure service called Azure Data Explorer (ADX). Sorry for such comparison of services, I have good understanding of all except ADX. I feel like there is a big functionality overlay, so want to know the exact role of ADX in Azure infrastructure.
What is the use case when ADX is significantly better than Synapse/Databricks?
My understanding of ADX
AFAIK, ADX is a cluster (with per hour billing, like Databricks or Synapse, not like ADLA) that is handling database for you and is optimized for streaming ingestion and ad-hoc queries at scale. It also supports external tables, that has worse performance but cheaper (you pay for Blob/ADLS storage).
Details
I don't understand why do we need ADX if:
Azure Synapse has similar pricing model (cluster, per-hour), also it supports streaming ingestion and ad-hoc querying at scale. Azure Synapse support querying BlobStorage/ADLS through Polybase external tables.
Databricks is another service that is capable of doing it. Using Databricks Ingest and Delta Lake - you can ingest streaming data and consume them in both: streaming and batching way. Actually you can have interactive cluster that will handle ad-hoc queries for you.
Also if you want a real-time analytics - use Azure Stream Analytics. If you want Athena-like experience - use ADLA (still it doesn't support ADLS gen2).
Azure Data Explorer is focused on high velocity, high volume high variance (the 3 Vs of big data). It provides super fast interactive queries over such data that is streaming in. It supports json and text natively, including full text search and indexing.
It is used in a broad set of scenarios associated with sensing activity and time series in a large set of verticals: IoT, API logs, transaction monitoring and ad hoc data exploration.
Microsoft is offering ADX as a service as it is the major service that Microsoft is using for its own telemetry and all the analytical solutions as a service that we offer in Security, operational monitoring, game analytics, product insights usage analytics, Iot, Connected vehicles is built on ADX. You can find a full list in our docs. For clarity, SQL, Synapse, CosmosDB is storing its telemetry in Azure Data explorer...
SQL DW (AKA Synapse SQL pool) is an excellent data warehouse and implements the modern data warehouse pattern. ETL->Curated data model-> Load and serve via analysis services or power BI.
ADX is for real time analytics, enabling applying schema on read (SOR) on data as fresh as seconds old.
Consider ADX as a fully managed platform when replacing SOLR/Lucine based variants used for logs, time series databases and more.
Try it out in large workloads and you will see it is dramatically cheaper than the alternatives and much more powerful and performant.
Reach out to me if you need help.
Azure Data Explorer alias Kusto is focused on high volume data ingestion and almost real-time query and analytics. It is invented at Microsoft for log and telemetry analytics, but can be used for other purposes e.g. Iot, sensor data or web analytics. Same technology is used in Azure internal services like Azure Monitor and Log Analytics.
Similar capabilities could be build on Synapse or Databricks or HDInsight, but I see these as tools that fit much more broad use-cases. ADX has quite narrow focus. ADX does support queries (”KQL”) but has very limited SQL support. It is good for append only data, not for updates. It is not a data warehouse, database or data lake.
Microsoft material refers to the technology behind ADX with name Kusto. More info on this at https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/. A good comparison of services can be found in this blog post: https://vincentlauzon.com/2020/02/19/azure-data-explorer-kusto
I want to copy data from an on-premise oracle db to sql server on a real time basis.
Data Factory has many strengths, but frequency is not one of them. Have you considered a different approach?
For real time integrations I would recommend Functional App, Logic App and/or Service Bus. I would call this App whenever there is a relevant change in the oracle DB. Alternatively you could have an API on top of this on-premise oracle DB that you call from a scheduled APP.
If you expect heavy traffic you might want to consider using a Service Bus. The illustration below shows how Azure Service Bus sends data from a Publisher(on-premise) to a Subscriber(Azure sql DB) using Topic Message.
Illustration of large scale service bus dataflow
Ref. Azure Service Bus
Welcome to Stack Overflow!
You can copy data from an Oracle database to any supported sink data store. For a list of data stores that are supported as sources or sinks by the copy activity, see the Supported data stores table.
To copy data from and to an Oracle database that isn't publicly accessible, you need to set up a Self-hosted Integration Runtime. The integration runtime provides a built-in Oracle driver. Therefore, you don't need to manually install a driver when you copy data from and to Oracle.
For more details and step by step procedure, refer Copy data from and to Oracle by using Azure Data Factory.
You can also use custom activity as per your needs. Refer custom activities.
Hope this helps.
You could copy data incrementally. But there is frequency limitation. Please reference this post. https://social.msdn.microsoft.com/Forums/en-US/54380f98-716b-4a95-88af-cad2ab7e47b5/what-type-of-data-ingestion-does-azure-data-factory-use?forum=AzureDataFactory
I've been trying to copy some data from an on-premises SAP BW to a cloud Azure Data Lake Store. I've already configured the sink as the Data Lake Store but i'm having trouble to configure the source. Already downloaded the netweaver library and put the dlls in my system 32 folder and created the integration runtime which is running on my local machine. Has anyone tried this before?
Thanks
I recommend using SAP Open Hub function SAP BW to generate flat file if you do not have any other SAP Data Services or HANA tools in place. Then the files can be loaded into HDFS, which is Azure data lake storage.
The reason for this recommendation:
1. SAP BW Open hub is easy for developer and even for non-SAP-BW person.
2. I do not recommend use netweaver RFC (dll library) approach to integration with SAP BW as this mostly uses MDX to read BW data with significant coding and understanding of BW metadata.
3. This can ensure no violation of SAP data distribution licensing.
Hope this can help.
Lei
#lei the nw library Fabricio was mentioning has to be the nwlibrfc32.dll which has to be manually injected by customers such that ADF could work with BW data source. This is the shared framework that PowerBI service accesses BW data source. Technically speaking, it's doable, and Fabricio has to ensure the right DLL is injected (32bit vs. 64bit). Without error log we have to speculate.
However, from solution standpoint, please understand that this is not going to be a performing way due to the bottleneck at the underlying MDX engine and result processor of the BW connector. If the volume is small, then there is nos issue. Otherwise, we need to review other options. Open Hub could be an option, if the user is ok to deal with a set of batch job management in BW and separate set of ADF job management in Azure. From IT agility standpoint, coupling two sets of operational processes is not a best approach.
Another option to consider is to hold off on ADF but opt to SSIS instead. Use SSIS to load SAP data with Azure Feature pack, like we do. But this may not be the best approach either, as there is no ADF any more if Fabricio's team already invested in ADF. Or maybe they favor SSIS. All in all, there has to be some level of tradeoff towards sustainable solution.
Back to the original question, please post the error detail and we can help investigate.