Synchronizing on-premises DB with Cloud (Azure )

Synchronizing on-premises DB with Cloud (Azure ) - azure

I have a Mainframe DB2 within my corporate network. I am asked to come up with an approach to create a miniature of this DB in Azure. What would be the best way to implement this? What is the best practice to establish a reliable and secure synchronization between these two DBs?

There are commercial products that do this sort of thing...google "DB2 SQL Server Synchronization". If you want a simple way to start, the ETL (Extract, Transform, Load) vendors - Informatica, Tibco, SyncSort, etc - all have variations on this capability.
It is much more challenging than it sounds because the two databases have such different feature sets. While you might get a simple set of tables to work, soon as you introduce triggers, stored procedures, EBCDIC vs. ASCII issues and so forth, you'll be wanting all the help you can get.

Related

Mapping Dataflow vs SQL Stored Procedure in ADF pipeline

I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?

This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH

One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.

Data access layer patterns using azure function

We are currently working on a design using Azure functions with Azure storage queue binding.
Each message in the queue represents a complete transaction. An Azure function will be bound to that queue so that the function will be triggered as soon as there is a new message in the queue.
The function will then commit the transaction in a SQL DB.
The first-cut implementation is also complete; and it's working fine. However, on retrospective, we are considering the following:
In a typical DAL, there are well-established design patterns using entity framework, repository patterns, etc. However, we didn't find a similar guidance/best practices when implementing DAL within a server-less code.
Therefore, my question is: should such patterns be implemented with Azure functions (this would be challenging :) ), or should the server-less code be kept as light as possible or this is not a use-case for azure functions, at all?

It doesn't take anything too special. We're using a routine set of library DLLs for all kinds of things -- database, interacting with other parts of Azure (like retrieving Key Vault secrets for connection strings), parsing file uploads, business rules, and so on. The libraries are targeting netstandard20 so we can more easily migrate to Functions v2 when the right triggers become available.
Mainly just design your libraries so they're highly modularized, so you can minimize how much you load to get the job done (assuming reuse in other areas of the system is important, which it usually is).
It would be easier if dependency injection was available today. See this for a few ways some of us have hacked it together until we get official DI support. (DI is on the roadmap for Functions, I believe the 3.0 release.)
At first I was a little worried about startup time with the library approach, but the underlying WebJobs stack itself is already pretty heavy, and Functions startup performance seems to vary wildly anyway (on the cheaper tiers, at least). During testing, one of our infrequently-executed Functions has varied from just ~300ms to a peak of about ~3800ms to parse the exact same test file, with all but ~55ms spent on startup).

should such patterns be implemented with Azure functions (this would
be challenging :) ), or should the server-less code be kept as light
as possible or this is not a use-case for azure functions, at all?
My answer is NO.
There should be patterns to follow, but the traditional repository patterns and CRUD operations do not seem to be valid in the cloud era.
Many strong concepts we were raised up to adhere to, became invalid these days.
Denormalizing the data base became something not only acceptable but preferable.
Now designing a pattern will depend on the database you selected for your solution and also depends of the type of your application and the type of your data.
This is a link for general guideline when you do Table Storage design Guidelines.
Is your application read-heavy or write-heavy ? The design will vary accordingly.
Are you using Azure Tables or Mongo? There are design decisions based on that. Indexing is important in Mongo while there is non in Azure table that you can do.
Sharding consideration.
Redundancy Consideration.
In modern development/Architecture many principles has changed, each Microservice has its own database that might be totally different that any other Microservices'.
If you read along the guidelines that I provided, you will see what I mean.
Designing your Table service solution to be read efficient:
Design for querying in read-heavy applications. When you are designing your tables, think about the queries (especially the latency sensitive ones) that you will execute before you think about how you will update your entities. This typically results in an efficient and performant solution.
Specify both PartitionKey and RowKey in your queries. Point queries such as these are the most efficient table service queries.
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with different keys) to enable more efficient queries.
Consider denormalizing your data. Table storage is cheap so consider denormalizing your data. For example, store summary entities so that queries for aggregate data only need to access a single entity.
Use compound key values. The only keys you have are PartitionKey and RowKey. For example, use compound key values to enable alternate keyed access paths to entities.
Use query projection. You can reduce the amount of data that you transfer over the network by using queries that select just the fields you need.
Designing your Table service solution to be write efficient:
Do not create hot partitions. Choose keys that enable you to spread your requests across multiple partitions at any point of time.
Avoid spikes in traffic. Smooth the traffic over a reasonable period of time and avoid spikes in traffic.
Don't necessarily create a separate table for each type of entity. When you require atomic transactions across entity types, you can store these multiple entity types in the same partition in the same table.
Consider the maximum throughput you must achieve. You must be aware of the scalability targets for the Table service and ensure that your design will not cause you to exceed them.
Another good source is this link:

Benefits of using a hosted search service over building your own

I'm building a B2B Node app which has heavily related data models. We currently have our own search queries, but as we scale some of the queries appear to be becoming sluggish.
We will need to support multilingual search as well as content-based searches (searching matching content within related data).
The queries are growing more and more complicated (each has multiple joins on joins on joins) and I'm now considering a hosted search tool such as Algolia.
Given my concerns below, why should I use a hosted cloud search service rather than continue building my own queries?
Data privacy is important
Data is hosted in our own postgres DB - integrations with that are important (e.g.: will I now need to manually maintain our DB data and data in Algolia?)
Speed will be important, but not so much now
Must be able to do content-based searches across multiple languages
We are a tiny team of devs now, so dev resource time is vital
What other things should I be concerned about that can help make a decision in search capabilities?
Regarding maintenance of both DB and Cloud data, it seems it's as simple as getting all data, caching it, and storing it in the cloud:
var index = Algolia.initIndex('contacts');
var contactsJSON = require('./contacts.json');
index.addObjects(contactsJSON, function(err, content) {
if (err) {
console.error(err);
}
});

Search services like Algolia or self-hosted Elasticsearch/solr operate as full text search, not relational db queries.
But it sounds like the bottleneck is the continual rejoining. Which if you can make your relational data act like a full text document db then that could be a more efficient type of index (pre-joined sort of).
You might also look into views, or a data warehouse (maybe star schema).
But if you are going the search route maybe investigate hosting your own elasticsearch.
You could specify database, schema, sql, index, query details if you want more help.

Full Disclosure: I founded a company called SearchStax on the premise that companies and developers should not spend time setting up, managing, scaling or building tools for the search infrastructure (ops) - they are better off investing time of their employees into building value for the company, whether that be features, capabilities, product or customers.
Open Source Search solutions based on top of Lucene (Apache Solr / Elasticsearch) have what you need now and what you might need in near future from a capability perspective from a search engine. Find a mature service provider / AS-A-Service company that has specialization in open source search and let them deal with all. It may look small effort right now, though it's probably not worth time and effort of your devs to spend time on the operations of that.
For your concerns mentioned above:
Data privacy is important
Your concern around Privacy and Security are addressable. There are multiple ways you can secure your Solr environment and the right MSP or a Managed Solution provider should be able to address those.
a. Security at the transport layer can be addressed by SSL certificates. All the data going over the wire is encrypted.
b. IP Filtering and User Based Authentication should address who has access to what. Solr-as-a-Service offering by Measured Search supports both.
c. Security at rest can be addressed in multiple ways - OS level / File encryption, but you can even go further by ensuring not even your services provider has access to that data by using Searchable Encryption technology.
Privacy concerns are all address by Terms & Conditions - I am sure your legal department will address that from a Service Provider's perspective.
Data is hosted in our own postgres DB - integrations with that are important
Solr provides ability to import data directly (DIH) through a traditional relational database (MySQL, Postgres, Oracle, etc). You can either use that so Solr can pull data periodically or write your own simple script to push data through the Solr APIs.
If you are hosted in the cloud (AWS), a tunnel can be created so only the Solr deployments have the ability to pull data from your servers and your database servers are not exposed to the world, if you choose to go the DIH route.
Speed will be important, but not so much now
Solr is built for search speed - I don't think that's where your problems are going to be. Service offering like Measured Search's - you can spin up a cluster in any data center supported by AWS or Azure and make sure your search deployments are closer to your application servers so the latency overhead is minimal.
Must be able to do content-based searches across multiple languages
Yes, Solr supports that. More than 30 languages.
We are a tiny team of devs now, so dev resource time is vital
I am biased here, but I would not have my developers spend much time on operations and let them focus on what they do best - build great product capabilities to push the limits and deliver business value.
If you are interested in doing a comparison and ROI of doing it yourself vs using a solr-as-a-service like offered by SearchStax, check this paper out - https://www.searchstax.com/white-papers/why-measured-search-is-better-than-diy-solr-infrastructure/

Azure Table Storage - Entity Design Best Practices Question

Im writing a 'proof of concept' application to investigate the possibility of moving a bespoke ASP.NET ecommerce system over to Windows Azure during a necessary re-write of the entire application.
Im tempted to look at using Azure Table Storage as an alternative to SQL Azure as the entities being stored are likely to change their schema (properties) over time as the application matures further, and I wont need to make endless database schema changes. In addition we can build refferential integrity into the applicaiton code - so the case for considering Azure Table Storage is a strong one.
The only potential issue I can see at this time is that we do a small amount of simple reporting - i.e. value of sales between two dates, number of items sold for a particular product etc.
I know that Table Storage doesnt support aggregate type functions, and I believe we can achieve what we want with clever use of partitions, multiple entity types to store subsets of the same data and possibly pre-aggregation but Im not 100% sure about how to go about it.
Does anyone know of any in-depth documents about Azure Table Storage design principles so that we make proper and efficient use of Tables, PartitionKeys and entity design etc.
there's a few simplistic documents around, and the current books available tend not to go into this subject in much depth.
FYI - the ecommerce site has about 25,000 customers and takes about 100,000 orders per year.

Have you seen this post ?
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
Pretty thorough coverage of tables

I think there are three potential issues I think in porting your app to Table Storage.
The lack of reporting - including aggregate functions - which you've already identified
The limited availability of transaction support - with 100,000 orders per year I think you'll end up missing this support.
Some problems with costs - $1 per million operations is only a small cost, but you can need to factor this in if you get a lot of page views.
Honestly, I think a hybrid approach - perhaps EF or NH to SQL Azure for critical data, with large objects stored in Table/Blob?
Enough of my opinion! For "in depth":
try the storage team's blog http://blogs.msdn.com/b/windowsazurestorage/ - I've found this very good
try the PDC sessions from Jai Haridas (couldn't spot a link - but I'm sure its there still)
try articles inside Eric's book - http://geekswithblogs.net/iupdateable/archive/2010/06/23/free-96-page-book---windows-azure-platform-articles-from.aspx
there's some very good best practice based advice on - http://azurescope.cloudapp.net/ - but this is somewhat performance orientated

If you have start looking at Azure storage such as table, it would do no harm in looking at other NOSQL offerings in the market (especially around document databases). This would give you insight into NOSQL space and how solution around such storages are designed.
You can also think about a hybrid approach of SQL DB + NOSQL solution. Parts of the system may lend themselves very well to Azure table storage model.
NOSQL solutions such as Azure table have their own challenges such as
Schema changes for data. Check here and here
Transactional support
ACID constraints. Check here

All table design papers I have seen are pretty much exclusively focused on the topics of scalability and search performance. I have not seen anything related to design considerations for reporting or BI.
Now, azure tables are accessible through rest APIs and via the azure SDK. Depending on what reporting you need, you might be able to pull out the information you require with minimal effort. If your reporting requirements are very sophisticated, then perhaps SQL azure together with Windows Azure SQL Reporting services might be a better option to consider?

What design decisions can I make today, that would make a migration to Azure and Azure Tables easier later?

I'm rebuilding an application from the ground up. At some point in the future...not sure if it's near or far yet, I'd like to move it to Azure. What decisions can I make today, that will make that migration easier.
I'm going to be dealing with large amounts of data, and like the idea of Azure Tables...are there some specific persistance choices I can make now that will mimick Azure Tables so that when the time comes the pain of migration will be lessened?

A good place to start is the Windows Azure Guidance
If you want to use Azure Tables eventually, you could design your database where all tables are a primary key, plus a field with XML data.

I would advise to plan along the lines of almost-infinitely scalable solutions (see Pat Helland's paper on Life beyond distributed transactions) and the CQRS approach in general. This way you'll be able to avoid common pitfalls of the distributed apps generally and Azure table storage peculiarities.
This really helps us to work with Azure and Cloud Computing at Lokad (data-sets are quite large plus various levels of scalability are needed).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string