I have table contains more than 50 million records in Azure. I'm trying to create a nonclustered index on it using follow statment
create nonclustered index market_index_1 on MarketData(symbol, currency) with(online=on)
But I get a error message.
Msg -2, Level 11, State 0, Line 0 Timeout expired. The timeout period
elapsed prior to completion of the operation or the server is not
responding.
Any suggestions would be greatly appreciated.
Check out the Azure SQL Database Resource Limits document. Then compare the error code with the error codes listed on this document.
With data of that size I believe the only way to create new index in that table would be:
Create new table with same structure and only one clustered index
Copy the data from original table into the new one
Truncate the original table
Create desired indexes
Copy data back into original table
Note that moving the data between the tables will potentially once again exceed the resource limits, so you might have to do these operations in chunks.
Other possible approach is to upgrade the database server to the new Preview Version of Azure SQL Database (note: you cannot downgrade the server later!)
Related
I have 100-150 Azure databases with same table schema. There are 300-400 tables in each database. Separate reports are enabled on all these databases.
Now I want to merge these database into a centralized database and generate some different Power BI reports from this centralized database.
The approach I am thinking is -
There will be Master table on target database which will have
DatabaseID and Name.
All the tables on target database will have the composite primary key
created with the Source Primary key and Database ID.
There will be multiple (30-35) instances of Azure data factory
pipeline and each instance will be responsible to merge data from
10-15 databases.
These ADF pipelines will be scheduled to run weekly.
Can anyone please guide me that the above approach will be feasible in this scenario? Or there could any other option we can go for.
Thanks in Advance.
You trying to create a Data Warehouse.
I hope you will never archive to merge 150 Azure SQL Databases because is soon as you try to query that beefy archive what you will see is this:
This because Power BI, as any other tool, comes with limitations:
Limitation of Distinct values in a column: there is a 1,999,999,997 limit on the number of distinct values that can be stored in a column.
Row limitation: If the query sent to the data source returns more than one million rows, you see an error and the query fails.
Column limitation: The maximum number of columns allowed in a dataset, across all tables in the dataset, is 16,000 columns.
A data warehouse is not just the merge of ALL of your data. You need to clean them and import only the most useful ones.
So the approach you are proposing is overall OK, but just import what you need.
I have an Azure Data Factory pipeline that is copying data from an SQL DB to an Azure Blob. It works by scanning an SQL table and querying the BatchRecord table (a sort of metadata table) and finding all records that have the same projectID and UniqueID (just two ids used to identify a customer and an account). Once all these matching records are found, they are grouped together to form a batch. This batch of records is then used to pull records from a second table and move them to the Azure Blob. There are many records that have the same projectID and UniqueID (representing multiple transactions made by the same client on different dates).
Each copy activity (and some other code) is contained inside a forEach loop. The forEach loop is guided by a lookup activity that uses the projectID and UniqueID to determine how many times it needs to loop. All the copy code is found inside the forEach loop.
Because the number of records is quite large, some of the copy activities in each batch fail because of network issues. My question is this: Is there some way to retry only these failed copy activities after all the records have been attempted once? That is, I want to process all records (regardless of failures) then I want to re-attempt only the failed records. If possible, I would also like to be able to control how many times a retry is attempted (maybe two retries, maybe ten depending on what each client needs). How can I do this?
Copy Activity in ADF has retry and retry interval settings, by default retry is set to 0, so it does not retry the copy if failed. you could increase the number of retries on the copy activity.
We've got an Azure SQL table I'm moving from the traditional DB into a Azure Synapse Analytics (DW) table for long term storage and so it can be removed from our production DB. This is a system table for a deprecated system we used to use (Salesforce). I've got a column in this table in the DB that is a varchar(max), and its massive. The MAX(LEN(FIELD) is 1585521. I've tried using Data Factory to move the table into the DW, but it fails on that massive column. I modeled the DW table to be a mirror of the production DB table, but it fails to load and have tried several times. I changed the DW column that is failing to nvarchar(max), but its still failing (thought it might be non-unicode causing the failure). Any ideas? Its confusing me because the data exists in our production DB, but won't be nice and peacefully move to our DW.
I've tried several times and have received these error messages (second one after changing the DW column from varchar(max) to nvarchar(max):
HadoopSqlException: Arithmetic overflow error converting expression to data type NVARCHAR."}
HadoopExecutionException: Too long string in column [-1]: Actual len = [4977]. MaxLEN=[4000]
Currently using Polybase has this limitation of 1mb and length of column is greater than that. The work around is to use bulk insert in copy activity of ADF or chunk the source data into 8K columns and load into target staging table with 8K columns as well. Check this document for more details on this limitation.
If you're using PolyBase external tables to load your tables, the defined length of the table row can't exceed 1 MB. When a row with variable-length data exceeds 1 MB, you can load the row with BCP, but not with PolyBase.
It worked for me when I used "bulk insert" option in the ADF pipeline.
I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.
I have created a linked service that takes the data from on prem and store into the azure blob, but my data is dynamic how can I build a pipeline that takes the updated table into the blob and takes that blob and transfer it into the azure datawarehouse, I need this in such a way so that all my tables are in realtime sync into the azure datawarehouse.
What you are probably looking for is incrementally loading data into your datawarehouse.
The procedure described below is documented here. It assumes you have periodic snapshots of your whole source table into blobstorage.
You need to elect a column to track changes in your table.
If you are only appending and never changing existing rows, the primary key will do the job.
However, if you have to cope with changes in existing rows, you need a way to track those changes (for instance in with a column named "timestamp-of-last-update" - or any better, more succinct name).
Note: if you don't have such a column, you will not be able to track changes and therefore will not be able to load data incrementally.
For a given snapshot, we are interested in the rows added or updated in the source table. This content is called the delta associated to the snapshot. Once delta is computed, it can be upserted into your table with a Copy Activity that invokes an stored procedure. Here you can find details on how this is done.
Assuming the values of the elected column will only grow as rows are added/updated in the source table, it is necessary to keep track of its maximum value through the snapshots. This tracked value is called watermark. This page describes a way to persist the watermark into SQL Server.
Finally, you need to be able to compute the delta for a given snapshot given the last watermark stored. The basic idea is to select the rows where the elected column is greater than the stored watermark. You can do so using SQL Server (as described in the referred documentation), or you can use Hive on HDInsight to do this filtering.
Do not forget to update the watermark with the maximum value of the elected column once delta is upserted into your datawarehouse.