I'm trying to analyze a set of data across 3 excel tables in Tableau.
When I just use 1 table (transactions) the data is correct and works normally, however when I make a connection, it seems like Tableau takes the originally correct data and deletes/duplicated entries seemingly at random (I've looked for patterns to see what the problem is, can't find them).
Here are some screenshots to illustrate the problem. I'm using inner joins. The tables I'm joining should have no impact on the data that I'm currently talking about.
This is a large data set (> 400,000 rows)
Correct Data:
From Excel, ordered by date
Tableau's Data Set (notice the missing entries (PK field) and incorrect order despite order by date)
Looking at PK and Distribution Amount
This is my first time using stackexchange so I apologize if I breached any kind of etiquette requirements! Any suggestions as to how to correct the data set tableau is detecting would be extremely helpful!
EDIT: Also, going deeper into the bad data it seems as though even when sorted chronologically the data in tableau is not in the correct order
Related
I am trying to figure out how to create the most useful PivotTable for a user to view data for BI purposes. Here are two options I was considerating:
(1) Traditional PivotTable, pivot values on top:
(2) Drill-down type PivotTable:
What are the pros and cons of each method? For example, one for each to start might be:
Drilldown
PRO: trivial to add additional drilldown variables.
Pivot:
PRO: can easily sort by the column headers in the table UI.
And, are there any other possible tabular displays of data, either another type of PivotTable or another type altogether?
I'll suggest to keep it simple. If the objective is to present a view of the revenue figures of each region summarized by gender then pivot table in option 1 is the most effective of both, as it shows everything relevant in one simple look, keeping similar data at the same level making easier to compare.
Bear in mind that management requested that view to be able to effectively see how each region is performing on that specific category.
If the focus is revenue by different gender. Option 1 shows that in same row continuously for each region. It can easily be seen that the best performer on revenue generated by females is US, while best performer on revenue generated by males is Canada. While is not easy to see that in option 2.
If the focus is revenue by same gender. Option 1 shows that in same column continuously, which is not the case with option two.
Option 2 will be useful if the primary focus is set on revenue by region then if there is a need to see additional details based on the performance of any region management can drilldown to see the details of what makes the primary number. Which in this case is not the objective as the request is to show both.
Also best advice is to always agree requirements with clients (internal and/or external) you might find that they might have requested only what they believe it is possible to achieve and after they have that they will apply some "manual steps" to achieve their ultimate goal, something you could have done entirely if only you would have known.
Pivot Tables are used to -
summarize data
analyze data
explore data
present summary data
Both ways (traditional and drill down) of Pivot table can do the above listed.
It depends on what you want to achieve in BI.
If detailed data is not required to show or sort then you can use drill down.
Mostly in BI, data used in summarised form. So Drill down method will be good for display of data. Anyways you can double click and see the detailed data. See how to get details of drill
Drill Down:
a. Pros
Summarise Easily
Add sub points for summary
"Get Details" of Pivot for more details
b. Cons
The way you are doing sorting by pivot is not possible. Instead I would suggest to use pivot to drill down. So you can sort (And will move to Pros section :P) and check pivot details in another form.
Traditional Way
This way you are making to use pivot tables of your data. You should explore more in given links below.
5 pivot tables you probably haven't seen before
pivot tables save your job
23 things you should know about Excel pivot tables
Generally every representation of data has a purpose and with this purpose there come certain advantages and disadvantages.
Obviously with any kind of report, the audience matters most. Which would put you in the classical Requirement Analysis situation where you need to figure out what your customer wants (What data is of interest? How should it be sliced? What medium is it consumed on?)
Is the Revenue by Gender an important KPI?
If it is not, why including it at all?
If it is, let's see what a potential reader would do to answer a question like "How does the womens sale for Mexico compare with Canada?"
Drill down table:
Understanding the table will take a couple of second since they have to understand the different levels and their representation, the meaning of the bold and regular lines and realize that the man and women values accumulate to the total value for a region
Find Mexico in the list
Find the row for women and the value of it on the right
Find Canada in the list
Find the row for women and the value of it on the right
Remember the the value for Mexico or look it up again
Important here is also that this process will be repeated more or less exactly for every follow up question.
Traditional table
Understanding the table will be faster, they see a country name on the left and male/female on the top. Generally people are used to these tables since primary school and won't need further explanation.
Find Mexico in the list, go to the right until they find the value for women (if you try it you will see that you automatically see the values and the heading)
Find Canada in the list, (realize that it is only one line above) go to the right and have both values on top of each other.
For all the following questions the structure is easy to remember and it's a find and match game between rows and columns
I know that might be a bit subjective, but I hope the general idea is understandable
If you know have a question like "In which region do we sell more to men than women?" the advantage of a traditional over a drill down table becomes even more obvious.
With the drill down you will have to juggle several rows and their values while with the traditional you just skim through one column and look for the biggest value.
Is the Revenue per Region the main KPI?
You should then rather use a drill down table, possibly with additional levels (ie. North America in case it's international data or US State since I would assume it would be of interest if your product sells better in Alaska than Florida).
Your audience can then decide which granularity they want to see and adjust it accordingly. The gender is on the bottom of the hierarchy so either you have curious people who are interested in just another figure or they don't care and just don't drill down that deep.
The assumption here is that you deliver the table on the highest aggregate level.
One could argue that the same problem of finding row etc. exists as well for this case but I would assume you wouldn't necessarily compare the sales for Yucatán with Alberta so you stay in one group of states for example and again just have to skim up or down to find the states of the same country so you can compare it.
Using drilldowns in pivot tables is, in my opinion, a tool to be used by analysts, and not managers. Pivot tables are not quite intuitive enough to be used on reports that are being sent on for BI review by management. Typically any report which is being circulated for review by the powers that be should be consistent from user to user. That means using drilldowns would display different numbers if different items are selected - which could lead to 2 people talking about different values without knowing it.
Many people in management level positions outside of the core analytical group will still print anything you e-mail them before they look at it. I suspect that this is more likely to be true in a less technologically advanced company (ie: one which uses Excel as its database analysis tool instead of a full ERP-type system). In either case, anything being submitted for high level review should already be formatted exactly as you want them to see it.
The key in Excel deliverables within the workbplace is to make review as easy as possible. That means all necessary information should be immediately visible on each tab, with a minimum of scrolling (maybe scrolling down if necessary, but never scrolling right), and absolutely no clicking required.
Conclusion - Do not force a reviewer to manipulate your Excel file to use it
You may like drilldowns because you see how powerful it is to adjust reports as you are analyzing data - but once you have made your own analytical conclusions, those conclusions should be immediately apperent from the visible workspace that you leave for review.
Therefore, in order to achieve simplicity in high level review documents, you should use the 'traditional' format as you have shown it, which shows all numbers next to eachother in an easy to read table.
We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.
I have recently been working on a data table in Excel containing measurements of fossil specimens. In addition to containing things like the specimen number, species name, etc., the table also contains measurements from the fossils in question. However, because several specimens have data from both the left and right sides of the specimen, I often end up with situations where a single entry spans multiple rows, which means I cannot sort the data.
I have looked elsewhere on the Internet for a solution, and the only response I have gotten is that Excel doesn't really work well with entries spanning multiple rows, and I should reorganize my data. I understand that, and I have been looking for an alternate way of organizing the data. However, I have not been able to find an easy way to organize the data. I have tried reorganizing the information so each entry spans multiple rows, but when I do this it becomes very easy to make mistakes and to lose track of the data. It also becomes difficult to compare the data, since the measurements on the left and right side of the specimen are essentially the same thing and I cannot easily compare them if one specimen has a bone only preserved on the right side and the other specimen has the same bone preserved but only on the left side.
I have also tried organizing the measurements into a separate sheet which could be accessed by a hyperlink from the main sheet, but this has also posed problems. Because in this case the measurement data still cannot be sorted by specimen number of species name, if a specimen number or species name changes (which it has in the past), I have to manually reorganize all the hyperlinks by hand.
Finally, I have also tried adding an identifier to the multi-row entries, but this has a tendency to get screwed up if I sort the data, and it also mixes up any equations I use in the sheet. I might be doing it wrong somehow.
The good news is I am not interested in sorting the specimens by measurements, so if there is any way to organize the table so it is sortable but the measurements cannot be sorted, that is fine. At the same time, because all specimens technically have a left and right side (plus the average measurement between them), I could also work with a system wherein each "entry" spanned a set number of rows or subrows.
I was also wondering if it would be possible to write a macro to sort the data (especially since I am just sorting by the first five columns or so), or else do the database in some other program like Microsoft Access. Any help would be greatly appreciated.
Everything you describe really breaks down to: "Excel is great for analysis; but I really should have stored my source data in a database." Accountants I have worked with almost always come to this conclusion eventually, once their data and reporting needs get sufficiently complex.
I suggest you invest the moderate effort to upload your data to a proper database, and learn how to download I as appropriate to Excel for specific analyses. The time effort will be well spent, and simpler by far than coercing EXCEL into tasks for which it is ill-suited.
MS-Access, MySql, and SQL-Server Express are all suitable for this type of upgrade. MS-Access, if already available in your Office subscription, has the advantage of integrating even more easily with Excel than the other two, and also uses VBA as it's macro language. The other two offer more complete and powerful implementations of the SQL language. All told, use the one most easily available to you.
I have an Excel workbook that utilises a data table (A).
I now want to create another data table (B) that effectively sits on top of the other data table. That is, each "iteration" of B calls A.
This approach fails although I cannot find any documentation about data tables that indicates that this would not work.
Basically I'd like to know if anyone has tried this before and whether I am missing something?
Is there a workaround? Do you know of any documentation that spells out whether and why this is not supported?
No.
I tried this at length some years ago in both xl03 and xl07 and my conclusion was that it can't be done - each data table seems to be an independent one-off run, they don't talk if you try to link them
I couldn't find any documentation on this issues either on the process, or for anyone else looking at a similar problem.
I want to share my experience using the data tables.
We have found a workaround for this problematic.
If you have two variables A & B that need to run into a datatable and get one or multiple result.
What we've done is :
Set any combinaison (binari combinaison) for A & B and put an id for each of this combinaison (A=0 & B=0 => id=1)
So you will then run one data table with a length of A*B.
The default here is the length to calculate those data (7min with 25 data table & 2 data table with a length of 8000 rows).
Hope it help !
I have a Horse Racing Database that has the results for all handicap races for the 2010 flat season. The spreadsheet has now got too big and I want to convert it to a MySQL Databse. I have looked at many sites about normalizing data and database structures but I just can't work out what goes where, and what are PRIMARY KEYS,FOREIGN KEYS ETC I have over 30000 lines in the spreadsheet. the Column headings are :-
RACE_NO,DATE,COURSE,R_TIME,AGE,FURS,CLASS,PRIZE,RAN,Go,BHB,WA,AA,POS,DRW,BTN,HORSE,WGT,SP,TBTN,PPL,LGTHS,BHB,BHBADJ,BEYER
most of the columns are obvious, the following explains the less obvious BHB is the class of race,WA and AA are weight allowances for age and weight,TBTN is total distance beaten,PPL is Pounds per length, the last 4 are ratings.
I managed to export into MySQL as a flat file by saving the spreadsheet as a comma delimited file but I need to structure the
data into a normalized state with the proper KEYS.
I would appreciate any advice
many thyanks
Davey H
To do this in the past, I've done it in several steps...
Import your Excel spreadsheet into Microsoft Access
Import your Microsoft Access database into MySQL using the MySQL Workbench (previously MySQL GUI Tools + MySQL Migration Toolkit)
It's a bit disjointed, but it usually works pretty well and saves me time in the long run.
It's kind of an involved question, and it would be difficult to give you a precise answer without knowing a little bit more about your system, but I can try and give you a high level overview of how Relational Database Mangement Systems (RDBMS's) are structured.
A primary key is some identifier for a particular record - usually it is unique to that record. In this case, your RACE_NO column might be a suitable primary key. That way, you can identify every race by its unique number.
Foreign keys are numbers that describe the relationships between other objects/tables in your database. For example, you may want to create a table that lists all the different classes of races. Each record in that table would have a primary key, unique to that class. If you wanted to indicate in your "races" table which class each race was, you might have a column for each record called class_id. The value of that column would be populated with primary keys from the "classes" table. You can then use join operations to bring all the information together into one view.
For more on data structures and mysql, I suggest the W3C tutorials on SQL: http://www.w3schools.com/sql/sql_intro.asp
Before anything else, You need to define your data: You have to fit every column into a value space known to MySQL.
Numeric value
http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
Textual value
http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html
Date/Time value
http://dev.mysql.com/doc/refman/5.0/en/date-and-time-type-overview.html