We are evaluating ksqldb as a ETL tool for our organization. Our entire app is hosted on Microsoft Azure and mostly PaaS offerings are preferable in our organization. However 1 use case is that we have multiple microservices with their own databases and we want to join the tables in the databases together to produce some data in a denormalized format for some other tasks. An example would be Users table containing user data whereas Orders table contains all the orders. Users maybe in SQL format in MySQL whereas Orders maybe in NoSQL format in MongoDB. Now we need to generate some report on by joining Orders and Users tables together based on user_id. This can be done in ksqldb by using some joins on streams/tables and adding source connectors to each of the databases. Then we can write a sink connector to a new MongoDB database that can have the joined Users_Orders info. So if new data is added and the connectors and joins are running our joined data in Users_Orders will also get updated.
With Azure Event Hub I read that using ksqldb in production will not be possible because of some licensing issues. So my questions are:
Before going into some other products like Azure HDInsights or Confluent Cloud is there any way of running ksqldb to achieve the same solution (perhaps like managing your own Kafka cluster)?
You don't necessarily need ksql; you should be able to do something similar with SparkSQL, offered in Azure (Databricks). You don't necessarily need Kafka / EventHub either since Spark could read, join, and write Mongo/JDBC data all on its own (with the appropriate plugins).
The main reason ksqlDB isn't a hosted service by Azure, is that it conflicts with Confluent Licensing, but that does not prevent you from running it yourself, as long as you also adhere to the licensing restrictions of not publicly offering the ksqlDB REST API as a publicly available / paid API. I've not personally tried, but ksqlDB should work against EventHubs on its own, I don't think you need to self manage Kafka as the documentation suggests.
Am currently creating a datawarehouse in Azure Synapse, however Synapse does not allow for the creation of foreign keys. This is vital for referential integrity between the fact and dimension table. Does anyone have any suggestions as to what the alternatives are in synapse to enforce a PK FK relationship?
I searched about this topic and I found that the focus of Synapse is performance and not integrity reinforcement. We can create primary keys and structure the star schema with fact, dimensions and code join tables between them.
It was confused me too until I make this tutorial and read this carefully.
Load Contoso retail data to Synapse SQL
In a star schema any referential integrity should be enforced within the ETL tool used to load the data and not in the DB itself.
Some DBs support logical FKs that can help with query execution plans but they should never be physicalised
We have on prem SQL Server Analysis Services (SSAS) multidimension with lot of custom complex calculation, lot of measure group, complex model with many more features. We process per day few billion rows and have custom Excel add-in to connect custom pivot as well as standard Pivot table functionality used to create reports, run ad-hoc queries etc. and many more.
Below are the possible solutions in Azure
Approach 1: Azure Synapse, SSAS Multidimensional (ROLAP), Excel and Power BI. Note that SSAS Multidimensional will run as IaaS which will be host in VM. Desktop excel/excel 365 will be able to connect and Cloud Power BI.
Approach 2: Azure Synapse, Azure Analysis Services Tabular model direct query, Excel and Power BI. Desktop excel/excel 365 will be able to connect and Cloud Power BI.
Question:
Which approach will be based on the huge data volume, processing, complex logic, maintenance and custom calculation?
Can users access these cloud-based data cubes specially SSAS multidimensional either via their desktop Excel or via Excel 365?
How will be the performance ROLAP vs DAX on direct query mode?
What will be cost of moving and processing fairly large amounts of data?
With 12TB of data you will probably be looking at 500 - 1200GB of compressed Tabular model size unless you can reduce the model size by not keeping all of history, pruning unused rows from dimensions, and skipping unnecessary columns. That’s extremely large even for a Tabular model that’s only processed weekly. So I agree an import model wouldn’t be practical.
My recommendation would be a Tabular model. A ROLAP Multidimensional model still requires MOLAP dimensions to perform decently and your dimension sizes and refresh frequency will make that impractical.
So a Tabular model in Azure Analysis Services in DirectQuery mode should work. If you optimize Synapse well you should hopefully get query response times in the 10-60 second range. If you do an amazing job you could probably get it even faster. But performance will be largely dependent on Synapse. So materialized views, enabling query resultset cache, ensuring proper distributions and ensuring good quality Columnstore compression will be important. If you aren’t an expert in Synapse and Azure Analysis Services, find someone who is to help.
In Azure Analysis Services, ensure you mark relationships to enforce referential integrity which will change SQL queries to inner joins which helps performance. And keep the model and the calculations as simple as possible since your model is so large.
Another alternative if you want very snappy interactive dashboard performance for previously anticipated visualizations would be to use Power BI Premium instead of Azure Analysis Services and to do composite models. That allows you to create some smaller agg tables which are imported and respond fast to queries at an anticipated grain. But then other queries will “miss” aggs and run SQL queries against Synapse. Phil Seamark describes aggregations in Power BI well.
I have an Azure SQL database with many tables that I want to update frequently with any change made, be it an update or an insert, using Azure Data Factory v2.
There is a section in the documentation that explains how to do this.
However, the example is about two tables, and for each table a TYPE needs to be defined, and for each table a Stored Procedure is built.
I don't know how to generalize this for a large number of tables.
Any suggestion would be welcome.
You can follow my answer https://stackoverflow.com/a/69896947/7392069 but I don't know how to generalise creation of table types and stored procedures, but at least the metadata table of the metadata driven copy task provides a lot of comfort to achieve what you need.
I am at the planning stage of a web application that will be hosted in Azure with ASP.NET for the web site and Silverlight within the site for a rich user experience. Should I use Azure Tables or SQL Azure for storing my application data?
Azure Table Storage appears to be less expensive than SQL Azure. It is also more highly scalable than SQL Azure.
SQL Azure is easier to work with if you've been doing a lot of relational database work. If you were porting an application that was already using a SQL database, then moving it to SQL Azure would be the obvious choice, but that's the only situation where I would recommend it.
The main limitation on Azure Tables is the lack of secondary indexes. This was announced at PDC '09 and is currently listed as coming soon, but there hasn't been any time-frame announcement. (See http://windowsazure.uservoice.com/forums/34192-windows-azure-feature-voting/suggestions/396314-support-secondary-indexes?ref=title)
I've seen the proposed use of a hybrid system where you use table and blob storage for the bulk of your data, but use SQL Azure for indexes, searching and filtering. However, I haven't had a chance to try that solution yet myself.
Once the secondary indexes are added to table storage, it will essentially be a cloud based NoSQL system and will be much more useful than it is now.
Despite similar names SQL Azure Tables and Table Storage have very little in common.
Here are a two links that might help you:
Table Storage, a 100x cost factor
Fat Entities on Table Storage
Basically, the first question should wonder about is Does my app really need to scale? If not, then go for SQL Azure.
For those trying to decide between the two options, be sure to factor reporting requirements into the equation. SQL Azure Reporting and other reporting products support SQL Azure out of the box. If you need to generate complex or flexible reports, you'll probably want to avoid Table Storage.
Azure tables are cheaper, simpler and scale better than SQL Azure. SQL Azure is a managed SQL environment, multi-tenant in nature, so you should analyze if your performance requirements are fit for SQL Azure. A premium version of SQL Azure has been announced and is in preview as of this writing (see HERE).
I think the decisive factors to decide between SQL Azure and Azure tables are the following:
Do you need to do complex joins and use secondary indexes? If yes, SQL Azure is the best option.
Do you need stored procedures? If yes, SQL Azure.
Do you need auto-scaling capabilities? Azure tables is the best option.
Rows within an Azure table cannot exceed 4MB in size. If you need to store large data within a row, it is better to store it in blob storage and reference the blob's URI in the table row.
Do you need to store massive amounts of semi-structured data? If yes, Azure tables are advantageous.
Although Azure tables are tremendously beneficial in terms of simplicity and cost, there are some limitations that need to be taken into account. Please see HERE for some initial guidance.
One other consideration is latency. There used to be a site that Microsoft ran with microbenchmarks on throughput and latency of various object sizes with table store and SQL Azure. Since that site's no longer available, I'll just give you a rough approximation from what I recall. Table store tends to have much higher throughput than SQL Azure. SQL Azure tends to have lower latency (by as much as 1/5th).
It's already been mentioned that table store is easy to scale. However, SQL Azure can scale as well with Federations. Note that Federations (effectively sharding) adds a lot of complexity to your application. I'm also not sure how much Federations affects performance, but I imagine there's some overhead.
If business continuity is a priority, consider that with Azure Storage you get cheap geo-replication by default. With SQL Azure, you can accomplish something similar but with more effort with SQL Data Sync. Note that SQL Data Sync also incurs performance overhead since it requires triggers on all of your tables to watch for data changes.
I realize this is an old question, but still a very valid one, so I'm adding my reply to it.
CoderDennis and others have pointed out some of the facts - Azure Tables is cheaper, and Azure Tables can be much larger, more efficient etc. If you are 100% sure you will stick with Azure, go with Tables.
However this assumes you have already decided on Azure. By using Azure Tables, you are locking yourself into the Azure platform. It means writing code very specific to Azure Tables that is not just going to port over to Amazon, you will have to rewrite those areas of your code. On the other hand programming for a SQL database with LINQ will port over much more easily to another cloud service.
This may not be an issue if you've already decided on your cloud platform.
I suggest looking at Azure Cache in combination with Azure Table. Table alone has 200-300ms latencies, with occasional spikes higher, which can significantly slow down response times / UI interactivity. Cache + Table seems to be a winning combination, for me.
For your question, I want to talk about how to decide with logic choose SQL Table and which need to use Azure Table.
As we know SQL Table is a relational database engine. but if you have a big data in one table the SQL Table is not applicable, because SQL query get big data is slow.
At this time you can choose Azure Table, the Azure Table query is so fast then SQL Table for big data, for example, in our website, someone subscribed many articles, we make the article as feed to user, every user have a copy of article title and description, so in the article table there are lots of data, if we use SQL Table, each query execution maybe take more than 30 seconds. But in Azure Table get users article feed by PartitionKey and RowKey is so fast.
From this example you may know how to choose between in SQL Table and Azure Table.
I wonder whether we are going to end up with some "vendor independent" cloud api libraries in due course?
I think that you have first to define what your application usage funnels are. Will your data model be subjected to frequent changes or it is a stable one? You have to be able to perform ultra fast inserts and reads are not so complicated? Do you need advance google like search? Storing BLOBS?
Those are the questions (and not only) that you have to ask and answer yourself in order to decide if you are more likely going to use NoSql or SQL approach in storing your data.
Please consider that both approaches can easily coexist and can be extended with BLOB storage as well.
Both Azure Tables and SQL Azure are two different beasts.Both are meant for different scenarios, one con to azure table is that you cannot move from azure to any other platform, unless you write providers in your code that can handle such shifts.