Achieving MasterData deduplication on Azure

Achieving MasterData deduplication on Azure - azure

I am looking at achieving Master Data deduplication based on match percentages in AzureDB...was looking at something equivalent to Master Data Services/ DQS (Data Quality Services) in SQL Server2012
https://channel9.msdn.com/posts/SQL11UPD05-REC-06
Broadly looking for controls on match rules (exact, close match etc), handle dependencies and audit trail(undo capability etc)
I reckon this must be available in Azure cloud, if this is made available in SQL Server. Could you pls point me to how I get this done on AzureDB
Please note- I am NOT looking for data Sources like MelissaDAta, D&B that are listed on the Azure marketplace

Master Data Services is not just a database process: it also centrally involves a website component, which still (as of 2021) requires some Windows server running IIS.
This can be an Azure Virtual Machine (link to documentation) but there is no serverless offering for this at this time.
The database itself can be hosted on an Azure SQL Managed Instance (link to documentation) but not on a standalone Azure SQL DB, as far as I can tell. This is presumably because some of the essential components of MDS sit outside the database, much like other services like SSIS are more than just a database.
Data Quality Services is a similar story: it uses three databases (link to documentation) and seemingly some components outside the databases, so wouldn't be possible to deploy in standalone Azure SQL DBs. It may be possible to run on a Managed Instance, I couldn't find a clear answer to that. And again, there is no fully-serverless offering at this time.
Of course, all of this can easily be run via IaaS (Infrastructure as a Service) using an Azure virtual machine running SQL Server.

Related

What is the good way of working when moving from on-premise to Azure PaaS model?

We plan to migrate to Azure cloud and would like to know what is the recommened way of working, especially when the teams are used to working with dedicated UAT/DEV servers.
We have about 10 UAT physical database servers, each used for a different purpose. Some servers are used for integration testing, few for user acceptance testing, for chain testing etc.
We plan to use services such as Azure Data Factory (SSIS IR), Azure SQL Managed Instance, Virtual machines for custom applications
Creating the same number of database servers on cloud would be too expensive. How can we best deal with such scenario and how it is handled within your teams? Any pointers are appreciated.

As the use case for the migration is lift and shift and since you have multiple cross database queries,SQL Managed instance would be the best fit.
As SQL MI is very costly, so having different MI instances for different environments would prove to be very expensive (that too having <10 databases in 1 instance)
Would suggest have 2 SQL MI Instances :
1 for lower environment like Dev, test,UAT and 1 for Prod environment.
Have database names as xyz_env and create variables within DACPAC and parameterize it avoid any manual code changes for each database deployments.
Though you can also leverage Azure SQL database and the concept of elastic queries(for cross database queries) which would be some additional efforts but worth leveraging complete benefits over PaaS as there wont be any Vnet and databases can be directly accessible via other Azure resources and also would be very cost effective.

Microsoft Assessment and Planning Toolkit doesn't discover MSSQL on Azure VMs

I need to run discovery on many instances of SQL Server with 100s of databases running on Azure VMs.
Microsoft Assessment and Planning Toolkit seems a great tool for that and works fine with on-premises VMs, but doesn't discover MSSQL on Azure VMs. Azure VMs joined to local AD domain, DCs running in same VNet.
I tried AD discovery as well as manual ip range and computer name. It does detect the machines (with unknown host type), but gives empty results in SQL Server discovery - all objects counters (WMI, SQL, Registry) are zero. All ports are open inside VNet.
I can't find any source that explains about such limitation.

I was in the same situation and I can confirm that the Microsoft Assessment and Planning Toolkit still works as per May 2021.
The problem I had is that the user I was using for the discovery was part of the sysadmin role for some DOMAINS\ but not for all of them.
As result in some DOMAIN\ the MAP toolkit was returning plenty of them. In some DOMAIN\ nothing.
So basically the risk is:
If the user you are using has maximum privileges in that DOMAIN\ the MAP toolkit will discover everything
If the user you are using has some privileges in that DOMAIN\ the MAP toolkit will discover something here and there. For example will not have access to SQL Server instance that you cannot access.
If the user you are using has WMI access but no SQL access in that DOMAIN\ expect the MAP toolkit to enumerate the SQL Server instances but without meaningful information like database size, collation, SQL Server version, etc... (it can vary by the privilege)
But I confirm you that I'm using it to discover and assess a SQL Server estate on Azure VMs.
So discuss this with the AD manager.
I also wrote a post about it: https://www.jeeja.biz/2021/07/08/how-to-discover-sql-server-instances-on-azure-vms/

Cheapest way to host MongoDB on Azure

We have been developing a RESTful web api using node and MongoDB. For hosting options, we decided to use Azure through BizSpark. We used DocumentDB with protocol support for MongoDB.
The problem now is DocumentDB is consuming all the credit causing a downtime and we haven't started making money yet. We are now considering switching from DocumentDB to MongoDB. The question now becomes, what is the cheapest way to host MongoDB on Azure?
So far on our research, we have found two options:
Using a VM (Linux or Windows)
Using a worker role
Please advice if there are other options, and how easy can it be to switch between these options at a later stage.

You can use the Azure calculator to get estimates between DocumentDB and a VM with the settings your company needs to see which one is cheaper.
If you are using Bizspark, remember that you have 5 accounts in which you can distribute all your costs to optimize in a better way.
Personal recommendations(subjective view):
Remember that if you are using the PAAS solution(DocumentDB) you
get full functionality out of the box, you don't have to set it up
and you can escale it very easily and plug in to very powerful tools
like PowerBi out of the box.
In the case of IAAS solution(vms) you have to install, mantain and
create all the connection settings for this to work. If you want to
scale you have to me more dedicated, since you have to scale it
through the use of more vms, traffic managers and more robust
architecture. If this is the path you are taking I would recommend
using containers like Docker inside the VM and their power to
manage this.

Is it possible to deploy an application using cassandra database on Windows Azure?

I recently got a trial version of Windows Azure and wanted to know if there is any way I can deploy an application using Cassandra.

I can't speak specifically to Cassandra working or not in Azure unfortuantly. That's likely a question for that product's development team.
But the challenge you'll face with this, mySQL, or any other role hosted database is persistence. Azure Roles are in and of themselves not persistent so whatever back end store Cassandra is using would need to be placed onto soemthing like an Azure Drive (which is persisted to Azure Blob Storage). However, this would limit the scalability of the solution.

Basically, you run Cassandra as a worker role in Azure. Then, you can mount an Azure drive when a worker starts up and unmount when it shuts down.
This provides some insight re: how to use Cassandra on Azure: http://things.smarx.com/#Run Cassandra
Some help w/ Azure drives: http://azurescope.cloudapp.net/CodeSamples/cs/792ce345-256b-4230-a62f-903f79c63a67/
This should not limit your scalability at all. Just spin up another Cassandra instance whenever processing throughput or contiguous storage become an issue.

You might want to check out AppHarbor. AppHarbor is a .Net PaaS built on top of Amazon. It gives users the portability and infrastructure of Amazon and they provide a number of the rich services that Azure offers such as background tasks & load balancing plus some that it doesn't like 3rd party add-ons, dead-simple deployment and more. They already have add-ons for CouchDB, MongoDB and Redis if Cassandra got high enough on the requested features I'm sure they could set it up.

Minimize downtime in Azure

We are experiencing a very serious unscheduled downtime of our Azure application today for what is now coming up to 9 hours. We reported to Azure support and the ops team is actively trying to fix the problem and I do not doubt that. We managed to get our application running on another "test" hosted service that we have and redirected our CNAME to point at the instance so our customers are happy, but the "main" hosted service is still unavailable.
My own "finger in the air" instinct is that the issue is network related within our data center (west europe), and indeed, later on in the day the service dash board has gone red for that region with a message to that effect. (Our application is showing as "Healthy" in the portal, but is unreachable via our cloudapp.net URL. Additionally threads within our application are logging sql connection exceptions into our storage account as it cannot contact the DB)
What is very strange, though, is that the "test" instance I referred to above is also in the same data centre and has no issues contacting the DB and it's external endpoint is fully available.
I would like to ask the community if there is anything that I could have done better to avoid this downtime? I obeyed the guidance with respect to having at least 2 roles instances per role, yet I still got burned. Should I move to a more reliable data centre? Should I deploy my application to multiple data centres? How would I manage the fact that my SQL-Azure DB is in the same datacentre?
Any constructive guidance would be appreciated - being a techie, I've never had a more frustrating day being able to do nothing to help fix the issue.

There was an outage in the European data center today with respect to SQL Azure. Some of our clients got hit and had to move to another data center.
If you are running mission critical applications that cannot be down, I would deploy the application into multiple regions. DNS resolution is obviously a weak link right now in Azure, but can be worked around (if you only run a website it can be done very simply using Response.Redirects or similar)
Now, there is a data synchronization service from Microsoft that will sync up multiple SQL Azure databases. Check here. This way, you can have mirror sites up in different regions and have them be in sync with SQL Azure perspective
Also, be a good idea to employ a 3rd party monitoring service that would detect problems with your deployed instances externally. AzureWatch can notify or even deploy new nodes if you choose to, when some of the instances turn "Unresponsive"
Hope this helps

I can offer some guidance based on our experience:
Host your application in multiple data centers, complete with Sql Azure databases. You can connect each application to its data center specific Sql Server. You can also cache any external assets (images/JS/CSS) on the data center specific Windows Azure machine or leverage Azure Blog Storage. Note: Extra costs will be incurred.
Setup one-way SQL replication between your primary Sql Azure DB and the instance in the other data center. If you want to do bi-rectional replication, take a look at the MSDN site for guidance.
Leverage Azure Traffic Manager to route traffic to the data center closest to the user. It has geo-detection capabilities which will also improve the latency of your application. So you can redirect map http://myapp.com to the internal url of your data center and a user in Europe should automatically get redirected to the European data center and vice versa for USA. Note: At the time of writing this post, there is not a way to automatically detect and failover to a data center. Manual steps will be involved, once a failover is detected and failover is a complete set (i.e. you will failover both the Windows Azure AND Sql Azure instances). If you want micro-level failover, then I suggest putting all your config the in the service config file and encrypt the values so you can edit the connection string to connect instance X to DB Y.
You are all set now. I would create or install a local application to detect the availability of the site. A better solution would be to create a page to check for the availability of application specific components by writing a diagnostic page or web service and then poll it from a local computer.
HTH

As you're deploying to Azure you don't have much control about how SQL server is setup. MS have already set it up so that it is highly available.
Having said that, it seems that MS has been having some issues with SQL Azure over the last few days. We've been told that it only affected "a small number of users". At one point the service dashboard had 5 data centres affected by a problem. I had 3 databases in one of those data centres down twice for about an hour each time, but one database in another affected data centre that had no interruption.
If having a database connection is critical to your app, then the only way in the Azure environment to ensure against problems that MS haven't prepared against (this latest technical problem, earthquakes, meteor strikes) would be to co-locate your sql data in another data centre. At the moment the most practical way to do this is to use the synch framework. There is an ability to copy SQL Azure databases, but this only works within a data centre. With your data located elsewhere you could then point your app at the new database if the main one becomes unavailable.
While this looks good on paper though, this may not have helped you with the latest problem as it did affect multiple data centres. If you'd just been making database copies on a regular basis, that might have been enough to get you through. Or not.
(I would have posted this answer on server fault, but I couldn't find the question)

This is just about a programming/architecture issue, but you amy also want to ask the question on webmasters.stackexchange.com
You need to find out the root cause before drawing any conclusions.
However. my guess one of two things was the problem
The ISP connectivity differs for the test system and your production system. Either they use different ISPs, or different lines from the same ISP. When I worked in a hosting company we made sure that ou IP connectivity went through at least two different ISPS who did not share fibre to our premises (and where we could, they had different physical routes to the building - the homing ability of backhoes when there's a critical piece of fibre to dig up is well proven
Your datacentre had an issue with some shared production infrastructure. These might be edge routers, firewalls, load balancers, intrusion detection systems, traffic shapers etc. These typically are also often only installed on production systems. Defences here involve understanding the architecture and making sure the provider has a (tested!) DR plan for restoring SOME service when things go pair shaped. Neatest hack I saw here was persuading an IPS (intrusion prevention system) that its own management servers were malicious. And so you couldn't reconfigure it at all.
Just a thought - your DC doesn't host any of the Wikileaks mirrors, or Paypal/Mastercard/Amazon (who are getting DDOS'd by wikileaks supporters at the moment)?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string