Suppose I have two tables, TableA(embedded data), TableB(external data).
Scenario 1:
TableB is set On-Demand based on the markings from TableA. When you mark something from TableA, it take some "n" seconds to populate the data in TableB. On-Demand setting on external table is like screenshot named LOD.png
Scenario 2:
On-Demand settings have not been induced on TableB(please note TableB still is External). There has been a relationship created between TableA and TableB. TableB is now limited based on marking from TableA by the option"Limit data using Markings".screenshot named ss2
Questions:
1. Which scenario fetches data quicker.
2. From the debug log, the query passed in both the scenario is the same.Does that mean both scenarios are same or are they different?
Scenario 1 is good if Table B is really large, or records take a long time to fetch from the database. In this case, Table B is only returning rows that are based on what you marked in Table A. This means that the number of rows could be significantly less, but this also means that every time the marking changes, those rows have to be fetched at that time. If the database takes a long time, this can become frustrating. On the flip side, if the database is really fast and you are limiting rows down enough, this can be almost seamless. At the end of the day, you are pulling this data into memory after the query runs, so all Spotfire functionality is available.
Scenario 2 is good if calculations are highly complex and need to take advantage of the power of the external DB to perform. This means that any change to the report, a change of visualization etc., will require a new query to be sent to the external data source resulting in a new table of aggregated data. This means that no changes to a visualization using an in-db data table can be made when you are not connected to the external data source. Please note, there is functionality in Spotfire that is available to in memory data like data on demand that is not available to external data.
I personally like to keep data close to Spotfire to take advantage of all Spotfire functionality, but I cannot tell you exactly which is the correct method in your case. Perhaps these TIBCO links on the difference between in memory data and external data can help:
https://docs.tibco.com/pub/spotfire/6.5.1/doc/html/data/data_overview.htm
https://docs.tibco.com/pub/spotfire/6.5.1/doc/html/data/data_working_with_in-database_data.htm
Related
I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.
I am very new to microsoft access. I work in an Immunization program where I routinely collect data about vaccinated children. I used to have excel spreadsheets (different spreadsheet for every campaign) but as the number of spreadsheets grew, comparison between data of different campaigns became difficult.
Now I am trying to get all the data into a database program in which I can bring data from multiple campaigns into a single report easily.
After jumping into access, first I need to get the basic things done that excel could do very easily. e.g.
This is sample data from day 1 and day 2 of the campaign. In access I can create a query which shows data from day 1 only and a totals row at the end. But how do I create a single query/report which shows separate totals row for each day. In other words, how do I reproduce the data in pictures above in a single access report.
Edit:
I am planning a single table that contains data from all the campaigns in various columns.
The table in microsoft access looks like this:
Link to the access database file:
link to access database file
June7 has provided the correct advice - though it is cryptic.
Your table structure appears correct - you want a single table with the identity of the campaign and date (or whatever parameters you seek to differentiate them).
Your query is to collect those records into a record set. (Although there is an aggregate query type - it is not meant for what you seek to do.) You don't want to report on the entire database each time probably so one uses a query rather than the table itself to be the record set of the report. How that query is delimited is up to you.
The report object is where one then groups - and a group can have a total. You will want to look into online/text instructions on this for actual implementation.
I have Table A prompted on Year/Month and Table B. Table B also has a Year/Month column. Table A is the default data table (gets pulled in first). I have set up a relationship between Table A and B on the common Year/Month column.
The goal is to get Table B to only pull through data where the Year/Month matches the Year/Month on Table A (what the user entered). The purpose is to keep the user from entering the Year/Month multiple times.
The issue is Table B contains almost 35 million records. What I do not want to do is have Spotfire pull across all 35 Million records. What is currently happening is Spotfire is pulling all those records, then by setting filtering to include Filtered Rows Only on Table B, I am limiting what is seen in the visualization to under 200,000 rows. I would much rather just pull across 200,000 rows to start with.
The question: Is there a way to force Spotfire to filter the data table (Table B) by another data table (Table A) as it pulls the data table (Table B) across, thus only pulling a small number of records into memory?
I'm writing this off the basis that most people utilize information links to get data into Spotfire, especially large data sets where the data is not embedded in the analysis. With that being said, I prefer to handle as much if not all of the joining / filtering / massaging at the data source versus the Spotfire application. Here are my views on the best practices and why.
Tables / Views vs Procedures as Information Links
Most people are familiar with the Table / View structure and get data into Spotfire in one of 2 ways
Create all joins / links in information designer based off data relations defined by the author by selecting individual tables from the data sources avaliable
Create a view (or similar object) at the data source where all joining / data relations are done, thus giving Spotfire a single flat file of data
Personally, option 2 is much easier IF you have access to the data source since the data source is designed to handle this type of work. Spotfire just makes it available but with limited functionality (i.e. complex queries, Intellisense, etc aren't available. No native IDE). What's even better is Stored Procedures IMHO and here is why.
In options 1 and 2 above, if you want to add a column you have to change the view / source code at the data source, or individually add a column in the information designer. This creates dwarfed objects and clutters up your library. For example, when you create an information link there is a folder with all the elements associated with it. If you want to add columns later, you'll have another folder for any columns added, and this gets confusing and hard to manage. If you create a procedure at the data source to return the data you need, and later want to add some columns, you only have to change this at the data source. i.e. change the procedure. Everything else will be inherited by Spotfire... all you have to do is click the "reload data" button in Spotfire. You don't have to change anything in the information designer. Additionally, you can easily add new parameters, set default parameter properties or prompt the user, making this a very efficient method of data retrieval. This is perfect when the data source is an OLTP and not a data-mart/data-warehouse (i.e. the data isn't already aggregated / cleansed) but can also be powerful in data warehouse environments as well.
Ditch the GUI, Edit the SQL
I find managing conditions, parameters, join paths, etc a bit annoying--but that's me. Instead, when possible, I prefer to click "Edit SQL" next to all the elements in my Information Link and alter the SQL there. This will allow database guys to work in an environment which is more familiar.
The objective I am trying to achieve is to have 2 slicers in PowerPivot, ClientID and CSQName. When a ClientID is selected only the CSQnames that are related to that ClientID show up ,and vice versa
Relationship diagram link: https://goo.gl/photos/PnCZrnsXXTx3oFGh8
I am having a problem linking a many to many relationship in PowerPivot. A brief background on the application I am trying to build...
I am trying to combine a SQL database (IDM) and Informix SQL database (Cisco Call Data). The IDM database includes the Client Data and TBAS Open Case Data. Each Client has a specific ClientID. The Cisco database includes Call Detail Info and CSQNames(queue names). A many to many relationship exists, for example, a clientid can have multiple CSQname (clientid 3 has CSQ names of "A" and "B"). Also a csqname can have multiple clientids (csqname "Z" includes clientids "99", "98" and "97". Therefore I created an innerjoin table to create the many to many relationship called "Clients_CSQ".
I am trying to use this innerjoin table for both the "TBAS Open Cases" and "Call Detail". When I use this table for my filters, PowerPivot is stating that no relationships exist. Are there any solutions? If this does not make sense please let me know and I will try to specify. I have ready many posts but am unable to grasp how to make the DAX many to many relationship work with the calculate function. If someone can shed some light on the issue I am having it would be greatly appreciated. Thank you.
This really depends upon the data you are looking to report on.
When you add two slicers to a PowerPivot table, the available selections in each slicer will be affected by the selection in the other slicer IF and ONLY IF all of the fields in the Values section of the Pivot Table are reliant on the entries in both of the slicer fields.
In your case, it is possible to make this work (as an example) by creating 3 measures:
[Call Total]=SUM('TBAS Open Cases'[Case duration])
[Number of Calls]=COUNTA('Call Detail'[appname])
[Calls by Duration]=SUMX('Clients_CSQ',DIVIDE([Call Total],[Number of Calls]))
Place the last of these 3 measures in a pivot table with the slicers set to use 'Clients_IDM'[ic_client_id] and 'CSQ Name'[csqname] and "Hey Presto!"
The first two measures are straightforward enough. The third one is cycling through each entry in the only table that these two slicer fields have in common (Clients_CSQ) and performing a calculation using the data from your FACT tables. I have no idea if the [Calls by Duration] measure that I've come up with makes any sense with your data set, but hopefully the example will help you reach the solution you want. Again depending on what data you want to show it doesn't really matter if this measure returns junk, the important thing is that it's pulling your two data sets together.
Remember that as soon as you add any raw field from either of the fact tables to this 'unifying pivot table', the inter-relationship between the slicers will break. !!!BUT!!! there is nothing to stop you from linking the csqname slicer to another pivot on the same sheet which contains fields from your Call Detail table and likewise linking the ic_client_id slicer to a pivot that contains TBAS Open Cases data. In fact, the 'unifying pivot table' could be on a different sheet from your slicers, so you only see the two sets of data that you are interested in.
And ignore that warning about no relationships existing!
Here is the situation we have:
a) I have an Access database / application that records a significant amount of data. Significant fields would be hours, # of sales, # of unreturned calls, etc
b) I have an Excel document that connects to the Access database and pulls data in to visualize it
As it stands now, the Excel file has a Refresh button that loads new data. The data is loaded into a large PivotTable. The main 'visual form' then uses VLOOKUP to get the results from the form, based on the related hours.
This operation is slow (~10 seconds) and seems to be redundant and inefficient.
Is there a better way to do this?
I am willing to go just about any route - just need directions.
Thanks in advance!
Update: I have confirmed (due to helpful comments/responses) that the problem is with the data loading itself. removing all the VLOOKUPs only took a second or two out of the load time. So, the questions stands as how I can rapidly and reliably get the data without so much time involvement (it loads around 3000 records into the PivotTables).
You need to find out if its the Pivot Table Refresh or the VLOOKUP thats taking the time.
(try removing the VLOOKUP to see how long it take just to do the Refresh).
If its the VLOOKUP you can usually speed that up.
(see http://www.decisionmodels.com/optspeede.htm for some hints)
If its the Pivot table Refresh then it depends on which method you are using to get the data (Microsoft Query, ADO/DAO, ...) and how much data you are transferring.
One way to speed this up is to minimize the amount of data you are reading into the pivot cache by reducing the number of columns and/or predefining a query to subset the rows.