Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Any one have some idea which is the best way to implement continuous Replication of some DB tables from Azure SQL DB to Azure SQL DB(PaaS) in incremental way.
I have tried Data Sync preview (schema is not loading even after couple of hours),
Data Factory (Copy Data) - Fast but it is always copying entire data(duplicate records) - not an incremental way.
Please suggest.
What is the business requirement behind this request?
1 - Do you have some reference data in database 1 and want to replicate that data to database 2?
If so, then use cross database querying if you are in the same logical server. See my article on this for details.
2 - Can you have a duplicate copy of the database in a different region? If so, use active geo-replication to keep the database in sync. See my article on this for details.
3 - If you just need a couple tables replicated and the data volume is low, then just write a simple PowerShell program (workflow) to trickle load the target from the source.
Schedule the program in Azure Automation on a timing of your choice. I would use a flag to indicate which records have been replicated.
Place the insert into the target and update of the source flag in a transaction to guarantee consistency. This pattern is a row by agonizing row pattern.
You can even batch the records. Look into using the SQLBulkCopy in the system.data.sqlclient library of .Net.
4 - Last but not least, Azure SQL database now supports the OPENROWSET command. Unfortunately, this feature is a read only from blob storage file pattern when you are in the cloud. The older versions of the on premise command allows you to write to a file.
I hope these suggestions help.
Happy Coding.
John
The Crafty DBA
If you wanted to use Azure Data Factory, in order to do incremental updates, you would need to change your query to look at a created/modified date on the source table. You can then take that data and put it into a "Staging Table" on the destination side, then use a stored proc activity to do your insert/update into the "Real table" and finally truncate the staging table.
Hope this helps.
I am able to achive Cloud to Cloud Migration using Data Sync Preview from Azure ASM Portal
Below are the limitations
Maximum number of sync groups any database can belong to : 5
Characters that cannot be used in object names : The names of objects (databases, tables, columns) cannot contain the printable characters period (.), left square bracket ([) or right square bracket (]).
Supported limits on DB Dimensions
Reference: http://download.microsoft.com/download/4/E/3/4E394315-A4CB-4C59-9696-B25215A19CEF/SQL_Data_Sync_Preview.pdf
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm trying to implement a process using Data Factory and Databricks to ingest data into Data Lake and convert it all to a standard format i.e. parquet. So we'll have a raw data tier and a clean/standardized data tier.
When the source system is a DB or delimited files its (relatively) easy, but in some cases we will have excel sources. I've been testing the conversion process with com.crealytics.spark.excel which is ok because we can infer the schema BUT its not able to iterate through multiple sheets OR get the list of sheet names to enable me to iterate thought each one to convert into a single file.
I need this to be as dynamic as possible so that we can ingest almost any file regardless or its type or schema.
Does anyone know of any alternative methods of doing this? I'm open to moving away from databricks if necessary, such as Azure Batch with a custom C# script.
thanks in advance!
Since you are aiming to store the data in Azure Data Lake, another approach may be to use Azure Data Lake Analytics with a custom Excel extractor. U-SQL then can convert it into Parquet. See here for a sample Excel extractor.
How much variability do you expect with the Excel sheets?
The main problem here will be that it is hard to be completely schema agnostic, especially if you have many columns. To handle variability of the schema, you could change the extractor to output the columns either as key/value pairs or - if the number of columns and size of a row is reasonable - as a SqlMap (or a few for different target types). Although you would have to probably pivot it into a column format before creating the Parquet which would either require a second script to create the pivoting script or some custom outputter (instead of the built-in Parquet outputter).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?
IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH
By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.
To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What load can azure table storage handle at most (one account)? For example can it handle 2000 reads/sec, where response must come in less than a second (requests would be made from many different machines and the payload of one entity is something like 500Kb on average)? What are the practices to accommodate for such load (how many tables, partitions, giving that there is only one type of entity and in principle there could be any number of table/partitions. Also the Rowkeys are uniformly distributed 32 character hash strings and PartitionKeys are also uniformly distributed).
Check the Azure Storage Scalability and Performance Targets documentation page. That should answer part of your question.
http://msdn.microsoft.com/en-us/library/azure/dn249410.aspx
I would suggest reading the best practices here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
The following are the scalability targets for a single storage account:
•Transactions – Up to 5,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
◦Up to 500 entities per second
◦Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target).
As long as you correctly partition your data so you don't have a bunch of data all going to one machine, one table should be fine. Also keep in mind how you will query the data, if you don't use the index (PartitionKey|RowKey) it will have to do a full table scan which is very expensive with a large dataset.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working with a system that takes in a 50/s 10kb write stream which runs 24 hours a day. The data is ingested via a messaging system in to a sql database, and then used in an overnight aggregation that takes around 15 hours to produce queryable data for an application.
This is currently all in sql, but we are moving to a new architecture.
The plan is to move the ingested writes in to a distributed database like Cassandra or dynamodb, and then perform the aggregation in hadoop. This makes those parts of the system scalable.
My question is, when people have this architecture, where do they put the data after the writes and aggregation have been performed so that it can be queried.
In more detail:
The query model our application uses is quite complicated, to make the data queryable in cassandra, we would have to denormalise it for all queries, this is possible, but would mean a massive growth in data size. Is this normal practice? Or would you prefer to move the data back in to sql?
We could move the data in to redshift, but this seems to be more for ad hoc data analytics, and its purpose is not to be the backend for a data analytics application. I also think the queries are too complicated in their current form to be written in an orm which is what is required for redshift.
Does this mean that I still need to put the data in to sql server?
I am looking for examples of what people are doing at the moment.
I am sorry this question is a bit abstract, please do not close it, I will add more detail. I have read lots on big data, but most articles are about the ingestion of data using messaging / workers and distributed databases, but I have not found any that show what they do with this ingested data and how it is queried from the application.
*answer to JosefN's comment: Yes, we are not planning to denormalise in to a sql db. The choice is, denormalise in to cassandra, for all clients and queries, this would probably mean 100x the current data size, as there will be so much duplication in the denormalised model. The other option is to store it as it is now, so that it is queryable, but then, is my only option a sql db?
*after more research I have more information. The best options at the moment seems to be:
store back in sql
denormalise in cassandra
use one of the real time sql engines on top of hadoop / hdfs like impala
drpc with storm
I do not have any experience with Impala or DRPC with storm, so if anyone has any info on latency and the type of queries that can be performed with these that would be great.
Please do not refer to documentation or blog posts, I know how these technologies work, I only want to know if someone has used them in production and has their own information on this subject. thanks
I would suggest moving the aggregated data into HDFS. Using Hive, which provides a relational view over data stored inside HDFS, you can very well use adhoc sql like queries. At the same time you will be benefitted from parallelism of MapReduce jobs that gets invoked when you use Hive. This would help you to decrease query latencies that you would be having while using a RDBMS. Also think about doing the aggregation jobs in Hadoop itself.
Since the data after aggregation is small and you are looking for good latency keeping it in hdfs and query it using hive is not preferable.
I have seen people using hbase to store aggregated data and query it but as you mentioned earlier you will have to denormalize the data. For this case I will recommend writing aggregated data back to mysql and query it there if aggregated data is not big.
I think one traditional approach is to run your Hadoop/Hive jobs to aggregate across all possible dimensions, and then store in a key/value store like HBase, and look up at runtime with a key based on the aggregation done ( ie. /state=NJ/dt=20131225/ ) This can cause an explosion in size, especially if there are many dimensions to roll up
If you want/need a more realtime solution as well, take a look at Twitter's summingbird.
I know the Secondary Index(s) is not here yet: It's in wish list and "planed"
I like to get some ideas (or information from the reliable source) about the incoming secondary index(s)
1st question: I noticed MS planed "secondary indexes": is that mean we can create as many indexes as we want on one table
2nd question: Current index is "PartitionKey+RowKey", if above question is not true, will the secondary index be "RowKey+PartitionKey" or we have a good chance that we can customize it?
I like to gain some ideas because I am currently design a table, since the data won't much from beginning, so I think I can wait for the secondary index feature without create multiple tables at this moment.
Please share you ideas or any source you have, thanks.
There's currently no information on secondary indexes, other than what's written at the site you referenced. So, there's no way to answer either of your two questions.
Several customers I work with, that use Table Storage, have taken the multiple-table approach to provide additional indexing. For those requiring extensive index coverage, that data typically has found its way into SQL Azure (or a combination of SQL Azure + Table Storage).
As a Windows Azure MVP I don't have any information about the secondary indexes in table service. If we do need more indexes in table service, but don't want use SQL Azure, (Not just because of the pricing...) then I would like to de-normalization my data, which split the same data into more than one table, with different row key as the indexes.
This question is now two years old. And still no sign of secondary indexes in Azure Table Storage. My guess is that it is now very unlikely to ever eventuate.
Azure Cosmos DB provides the Table API for applications that are written for Azure Table storage and that need capabilities like:
Automatic Secondary Indexing
From: Introduction to Azure Cosmos DB: Table API
So if you are willing to move your tables over to Cosmos, then you will get all the indexing you could ever want.