I am currently working on an excel file("Main") that has two sources of input from other excel files. So we have 3 excel files. ("Main", "Source A", "Source B").
Source A: Load the table in a power query (from Source A to our Main file), follow some transformations on the data and Load the results into a tab in the Main file.
Source B: Load the table in a power query (from Source B to our Main file), follow some transformations on the data and Load the results into a tab in the Main file.
The problem arises when we wish to update the queries (some but not all).
The transformations performed are intensive and a refresh of the queries does take some time.
It seems that when an update is performed in queries that do the transformations on data from Source A, the queries that relate to transformations on data from Source B also update. Since no change is performed on Source B we do not want the refreshing to occur there.
My question is about how could we manage this? (i.e. update the queries that actually relate to changes made at that specific time)
Thank you in advance for your time.
Related
So, I want to create one Excel file each for every manager with only one single sheet and place it on OneDrive.
In order to get the data in one place, I am creating another excel file called combined.xlsx.
Now, I can export each workbook to a tab using Data -> Get Data -> From File -> From Workbook .
This is great so far. So, I can read data of 10 excelfiles on 10 sheets in combined.xlsx.
Now, I want to modify contents of one of the tabs and make sure it is reflected to the original file. How can I do this?
To elaborate on why it is not possible, you need to understand how Power Query deals with data:
You load your data into Power Query via the "Data" tab. The source can be anything Microsoft allows.
You then manipulate the data any which way in Power Query.
As a last step, you decide if and where to load the results. If you only want to create a connection to the query, you select "Close and Load to", which appears after you click on the arrow next to "Close and Load", and you pick that. Otherwise, the only other options are loading the query results to a table, PivotTable report, PivotChart.
Because the output sheets you have are connected to the query that produced them, any time you refresh the query, whatever manual changes you have made in the table that the query created originally will be wiped out and overwritten with the refreshed data.
If you were able to write back to the source here, you'd in effect
create a circular reference.
Check out this article about having Power Query output your data after manipulating it, maybe it helps.
Suppose I have two tables, TableA(embedded data), TableB(external data).
Scenario 1:
TableB is set On-Demand based on the markings from TableA. When you mark something from TableA, it take some "n" seconds to populate the data in TableB. On-Demand setting on external table is like screenshot named LOD.png
Scenario 2:
On-Demand settings have not been induced on TableB(please note TableB still is External). There has been a relationship created between TableA and TableB. TableB is now limited based on marking from TableA by the option"Limit data using Markings".screenshot named ss2
Questions:
1. Which scenario fetches data quicker.
2. From the debug log, the query passed in both the scenario is the same.Does that mean both scenarios are same or are they different?
Scenario 1 is good if Table B is really large, or records take a long time to fetch from the database. In this case, Table B is only returning rows that are based on what you marked in Table A. This means that the number of rows could be significantly less, but this also means that every time the marking changes, those rows have to be fetched at that time. If the database takes a long time, this can become frustrating. On the flip side, if the database is really fast and you are limiting rows down enough, this can be almost seamless. At the end of the day, you are pulling this data into memory after the query runs, so all Spotfire functionality is available.
Scenario 2 is good if calculations are highly complex and need to take advantage of the power of the external DB to perform. This means that any change to the report, a change of visualization etc., will require a new query to be sent to the external data source resulting in a new table of aggregated data. This means that no changes to a visualization using an in-db data table can be made when you are not connected to the external data source. Please note, there is functionality in Spotfire that is available to in memory data like data on demand that is not available to external data.
I personally like to keep data close to Spotfire to take advantage of all Spotfire functionality, but I cannot tell you exactly which is the correct method in your case. Perhaps these TIBCO links on the difference between in memory data and external data can help:
https://docs.tibco.com/pub/spotfire/6.5.1/doc/html/data/data_overview.htm
https://docs.tibco.com/pub/spotfire/6.5.1/doc/html/data/data_working_with_in-database_data.htm
I have Table A prompted on Year/Month and Table B. Table B also has a Year/Month column. Table A is the default data table (gets pulled in first). I have set up a relationship between Table A and B on the common Year/Month column.
The goal is to get Table B to only pull through data where the Year/Month matches the Year/Month on Table A (what the user entered). The purpose is to keep the user from entering the Year/Month multiple times.
The issue is Table B contains almost 35 million records. What I do not want to do is have Spotfire pull across all 35 Million records. What is currently happening is Spotfire is pulling all those records, then by setting filtering to include Filtered Rows Only on Table B, I am limiting what is seen in the visualization to under 200,000 rows. I would much rather just pull across 200,000 rows to start with.
The question: Is there a way to force Spotfire to filter the data table (Table B) by another data table (Table A) as it pulls the data table (Table B) across, thus only pulling a small number of records into memory?
I'm writing this off the basis that most people utilize information links to get data into Spotfire, especially large data sets where the data is not embedded in the analysis. With that being said, I prefer to handle as much if not all of the joining / filtering / massaging at the data source versus the Spotfire application. Here are my views on the best practices and why.
Tables / Views vs Procedures as Information Links
Most people are familiar with the Table / View structure and get data into Spotfire in one of 2 ways
Create all joins / links in information designer based off data relations defined by the author by selecting individual tables from the data sources avaliable
Create a view (or similar object) at the data source where all joining / data relations are done, thus giving Spotfire a single flat file of data
Personally, option 2 is much easier IF you have access to the data source since the data source is designed to handle this type of work. Spotfire just makes it available but with limited functionality (i.e. complex queries, Intellisense, etc aren't available. No native IDE). What's even better is Stored Procedures IMHO and here is why.
In options 1 and 2 above, if you want to add a column you have to change the view / source code at the data source, or individually add a column in the information designer. This creates dwarfed objects and clutters up your library. For example, when you create an information link there is a folder with all the elements associated with it. If you want to add columns later, you'll have another folder for any columns added, and this gets confusing and hard to manage. If you create a procedure at the data source to return the data you need, and later want to add some columns, you only have to change this at the data source. i.e. change the procedure. Everything else will be inherited by Spotfire... all you have to do is click the "reload data" button in Spotfire. You don't have to change anything in the information designer. Additionally, you can easily add new parameters, set default parameter properties or prompt the user, making this a very efficient method of data retrieval. This is perfect when the data source is an OLTP and not a data-mart/data-warehouse (i.e. the data isn't already aggregated / cleansed) but can also be powerful in data warehouse environments as well.
Ditch the GUI, Edit the SQL
I find managing conditions, parameters, join paths, etc a bit annoying--but that's me. Instead, when possible, I prefer to click "Edit SQL" next to all the elements in my Information Link and alter the SQL there. This will allow database guys to work in an environment which is more familiar.
We use a MS Access database which has a linked Excel table. This Excel linked to a SQL database report. From this Excel file we load the data to an internal table of the Access database. The loading process is the following:
run a delete query (this delete all data from internal table)
run an append query which loads the data from excel to the emptied internal table
These two steps cause the file size to grow (~1150 kb/running) even though the amount of data does not change after loading!
Because of this, we need to compact and repair the database frequently.
How can I stop this growth?
Don't import the file.
You have it already linked, so create a simple select query with the linked file as the only source source, and use this to filter and convert (for example text dates to true Date values) the data and create the expressions you may need to proceed.
Then use this query as source for your further processing.
If this is not possible, (re)create a temporary database from scratch, fill a table in this, and link this to your main database.
That said, any Access database will grow when used if it is not write protected. This normally causes no harm.
Background:
I currently run reporting on a monthly basis from a source csv file that's roughly 50 columns by 15k rows. I have an existing system where I am importing the data into sql, using multiple stored procedures to handle the data transformations, then using excel connections to view the reports in excel after the data transformations. These transformations are relatively complex and comprise of ~4 stored procedures at ~5 pages and around 200 lines of code each.
Problem:
The amount of code and tables in sql to handle the transformations is becoming overwhelming. QA is a pain in the ass to track trough all the tables and stored procedures to find out where the problem lies. This whole process including extensive QA is taking me 3 days to complete, where ideally I'd like to to take half a day total. I can run through all the stored procedures and excel connections/formatting in a few hours, but currently it's more efficient to run QA after every single step.
Potential Solutions:
Would integrating SSIS help the automation and QA process?
I am new to SSIS, so how does data transformations work with SSIS?
Do I just link a stored proc as a step in the SSIS flow?
Note: I should specify that the results need to be displayed in Excel on a heavily formatted worksheet. Currently, there is a feeder sheet in excel that fetches data from SQL views, and the report page has formula links to that feeder sheet.
I appreciate all the help in advance.
I've done something similar and partially converted a SQL stored procedure (SP) solution to SSIS.
You can call existing SPs using the SSIS Execute SQL Task, so I would start with an SSIS "Master" package that just executes all your SPs. Right out of the box that gives you control over dependancies, parallel execution, restartability & logging.
Then I would incrementally chip away at replacing the SPs with SSIS Data Flow Tasks - this opens up the full range of SSIS transformation capabilities and is almost always a lot faster to build and run than SPs.
I would replace the Excel layer with an Report Services Report, but this would probably be a lower priority.