I have a worktable that we use to store the merged/joined selections from other tables with a large volume of records (could be millions of rows). If I have an entity mapping for this table - is there a way to insert results into this table without pulling the data across to my context in the app? I would prefer to do this without handwriting the SQL (that's how it's done now).
A Stored Procedure is a consideration.
Thanks
Related
I have 100-150 Azure databases with same table schema. There are 300-400 tables in each database. Separate reports are enabled on all these databases.
Now I want to merge these database into a centralized database and generate some different Power BI reports from this centralized database.
The approach I am thinking is -
There will be Master table on target database which will have
DatabaseID and Name.
All the tables on target database will have the composite primary key
created with the Source Primary key and Database ID.
There will be multiple (30-35) instances of Azure data factory
pipeline and each instance will be responsible to merge data from
10-15 databases.
These ADF pipelines will be scheduled to run weekly.
Can anyone please guide me that the above approach will be feasible in this scenario? Or there could any other option we can go for.
Thanks in Advance.
You trying to create a Data Warehouse.
I hope you will never archive to merge 150 Azure SQL Databases because is soon as you try to query that beefy archive what you will see is this:
This because Power BI, as any other tool, comes with limitations:
Limitation of Distinct values in a column: there is a 1,999,999,997 limit on the number of distinct values that can be stored in a column.
Row limitation: If the query sent to the data source returns more than one million rows, you see an error and the query fails.
Column limitation: The maximum number of columns allowed in a dataset, across all tables in the dataset, is 16,000 columns.
A data warehouse is not just the merge of ALL of your data. You need to clean them and import only the most useful ones.
So the approach you are proposing is overall OK, but just import what you need.
There is a use case in my company to enable business users with no technical knowledge to use the data from Azure cloud. Back in the SQL server days this was easily solved through OLAP cubes. You could write a query for data that's backing up the cube, and then business people could just connect to the cube and data was downloaded as a pivot table, the only problem with large datasets there was compute (the larger the data, the slower the pivot table) but not really the row limit.
With the current Azure Synapse set up it seems that Excel is trying to download the entire data set and obviously always hits a 1M row limit. Is there anyway to directly use the data in the pivot table without bringing it in full to Excel? Because all my tables are >1M rows.
UPD: You can load data directly to Pivot, but it does load the data to RAM and the actual loading takes time. I am looking for a similar to cube solution, where the pivot table is available immediately and the querying happens once you're adding fields and calculations to the pivot table.
I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.
How can we insert data into the computed columns using ADF V1. Currently we cannot insert the data into the Azure SQL Database.
Suppose the source system is a JDBC connection and the target system is an Azure Sql database table that has a computed column (say extracting year from a date string "20181022"). When ADF tries to load the table it complains about the column mismatch in with the target system.
Is there a way to get around this?
I have a similar situation and also looking for a better solution.
I am copying data from a prod db and updating the QA db with it.
Here's how I have been handling it so far: modify the hashbytes column in the target QA db table so that it is a varchar. But this solution is less than ideal because this will have to be done manually every single time a code change is pushed to the affected table.
Your situation is different.
If it were me, I would do a straight copy of the data into the Azure SQL Database. No computed tables. No modifications. Then I would stage the data into another table that had the computed columns I was after.
It is an Extract, Load, Transform approach, compared to the traditional Extract, Transform, Load method.
This question is really old, but hey, I found this question just now; maybe other will, too.
I have Table A prompted on Year/Month and Table B. Table B also has a Year/Month column. Table A is the default data table (gets pulled in first). I have set up a relationship between Table A and B on the common Year/Month column.
The goal is to get Table B to only pull through data where the Year/Month matches the Year/Month on Table A (what the user entered). The purpose is to keep the user from entering the Year/Month multiple times.
The issue is Table B contains almost 35 million records. What I do not want to do is have Spotfire pull across all 35 Million records. What is currently happening is Spotfire is pulling all those records, then by setting filtering to include Filtered Rows Only on Table B, I am limiting what is seen in the visualization to under 200,000 rows. I would much rather just pull across 200,000 rows to start with.
The question: Is there a way to force Spotfire to filter the data table (Table B) by another data table (Table A) as it pulls the data table (Table B) across, thus only pulling a small number of records into memory?
I'm writing this off the basis that most people utilize information links to get data into Spotfire, especially large data sets where the data is not embedded in the analysis. With that being said, I prefer to handle as much if not all of the joining / filtering / massaging at the data source versus the Spotfire application. Here are my views on the best practices and why.
Tables / Views vs Procedures as Information Links
Most people are familiar with the Table / View structure and get data into Spotfire in one of 2 ways
Create all joins / links in information designer based off data relations defined by the author by selecting individual tables from the data sources avaliable
Create a view (or similar object) at the data source where all joining / data relations are done, thus giving Spotfire a single flat file of data
Personally, option 2 is much easier IF you have access to the data source since the data source is designed to handle this type of work. Spotfire just makes it available but with limited functionality (i.e. complex queries, Intellisense, etc aren't available. No native IDE). What's even better is Stored Procedures IMHO and here is why.
In options 1 and 2 above, if you want to add a column you have to change the view / source code at the data source, or individually add a column in the information designer. This creates dwarfed objects and clutters up your library. For example, when you create an information link there is a folder with all the elements associated with it. If you want to add columns later, you'll have another folder for any columns added, and this gets confusing and hard to manage. If you create a procedure at the data source to return the data you need, and later want to add some columns, you only have to change this at the data source. i.e. change the procedure. Everything else will be inherited by Spotfire... all you have to do is click the "reload data" button in Spotfire. You don't have to change anything in the information designer. Additionally, you can easily add new parameters, set default parameter properties or prompt the user, making this a very efficient method of data retrieval. This is perfect when the data source is an OLTP and not a data-mart/data-warehouse (i.e. the data isn't already aggregated / cleansed) but can also be powerful in data warehouse environments as well.
Ditch the GUI, Edit the SQL
I find managing conditions, parameters, join paths, etc a bit annoying--but that's me. Instead, when possible, I prefer to click "Edit SQL" next to all the elements in my Information Link and alter the SQL there. This will allow database guys to work in an environment which is more familiar.