Background
I am building a simple dashboard in the Power BI plugins for Excel (Power Query, Power Pivot & Power View) to get some experience with Power BI. The dashboard is for presenting simple time reports made by a consultant (i.e. myself). The format i want to use for inputting data is in a Excel table as follows:
InputData:
Date | Timecode | Duration[hrs] | Tags
-----------|-----------|---------------|----------------------
2016-02-01 | CustomerA | 1.2 | Support;ProductA
2016-02-01 | CustomerB | 0.3 | Support;ProductB
2016-02-02 | Internal | 4.2 | Development;ProductA
The Date field is simple. The date that the time report is for. The Timecode is the "name" of the hours reported. In common software this is usually a Project code or similar, but i want to keep it on a customer basis. The Duration is a float representing the number of hours spent for that Timecode that day. The Tags column is the interesting part: to simplify input is want it to be a (semicolon) delimited string, but that wont do when creating a data model for Power view.
What I am trying to make is a separate table with the all the tags, and a link table to connect the tags to the corresponding rows from the time report. In the Power view report, I want to be able to filter my time reports on the tags, such as analyzing the time spent on ProductA or Support.
Question
How do you take a non-normalized field such as Tags above and replace it with a dimension table and a link-table, using Power BI plugins for Excel? How do I end up with the following Three tables:
TimeReport:
Date | Timecode | Duration[hrs] | TimeReportID
-----------|-----------|---------------|----------------------
2016-02-01 | CustomerA | 1.2 | 1
2016-02-01 | CustomerB | 0.3 | 2
2016-02-02 | Internal | 4.2 | 3
LinkTable:
TimeReportID | TagID
-------------|--------
1 | 1
1 | 2
2 | 1
2 | 3
3 | 4
3 | 2
TagsTable:
TagID | TagName
-------|----------
1 | Support
2 | ProductA
3 | ProductB
4 | Development
Attempt
By picking out only the Tags-column and then splitting, pivoting and removing duplicates (inspired by this link i have managed to create the list of all tags as in:
Tags:
TagName
----------
Support
ProductA
ProductB
Development
But I cant manage to understand how to link the tables to eachother. Please aid me in this.
I think you have 2 options:
using Power Query, add Merge & Expand Column steps to join TimeReport to LinkTable and then TagsTable
using Power Pivot, load all 3 tables then go to the Diagram view and establish relationships between them. Use "Hide from Client Tools" to hide the columns that are meaningless to the User e.g. TagID
I prefer Power Query as the functionality is more flexible and it is much easier to debug.
Related
I am often building pivot table reports for users for products or customers. Users often request that additional information be displayed within the pivot table for those row items. For example, in a pivot table showing item sales by year, they want me to include a column that shows what TYPE of item it is or who the author is. These attributes are additional properties of the item, and they are 1:1 for each item. I don't want them to be grouped. How can I do this?
For example This is a pivot of books by year. How do I add the author attribute without grouping it?
Title | Author | 2019 | 2020 | 2021 | Totals
Awesome Book | J. T. Joel | 100 | 120 | 90 | 320
Crappy Book | Karl Klapp | 50 | 40 | 10 | 100
I'm used to doing this pretty easily in SQL, so I'm confused as to why it is hard in this context? Probably I have some fundamental misunderstanding about how pivots work, but maybe I'm just missing something obvious?
Thanks.
I'm trying to understand the best practices around storing aggregated time series based data.
For instance if I am building a weather service application that's ingesting lots of weather metrics from sensors around the world and storing that weather data in the form of the weather for today, the week, for the month, what's a good way to model that?
Would the day level, week level, and month level each have their own column family?
Then there's the factor of location. Each location would have it's own weather data, so would partitioning by say some zipcode or geohash for a specific area make sense?
The access patterns would be querying for the daily or weekly or monthly weather in a city.
let's say ever 5 minutes. Would that have an impact on the design?
Yes. So sensor updates every 5 minutes happen at 12x per hour or 288x per day.
The access patterns would be querying for the daily or weekly or monthly weather in a city.
That also makes for 2016x per week and 8640x per month (30 days). The reason this is important, is because Cassandra has hard limits of storing 2GB and 2 billion cells per partition. This means that storing time series data by city only, would eventually hit this limit (although things would likely grind to a halt long before that).
But the general idea is that you want to model your tables around:
How you're going to query your data.
Avoiding unlimited partition growth.
So if we're just talking about temperatures and maybe a few other data points (precipitation, etc), partitioning by month and city should work just fine.
CREATE TABLE weather_sensor_data (
city TEXT,
month INT,
temp FLOAT,
recorded_time TIMESTAMP,
PRIMARY KEY ((city,month),recorded_time))
WITH CLUSTERING ORDER BY (recorded_time DESC);
Now, I could query for weather sensor data since 8AM, like this:
> SELECT * FROM weather_sensor_data
WHERE city='Minneapolis, MN'
AND month=202111
AND recorded_time > '2021-11-01 08:00';
city | month | recorded_time | temp
-----------------+--------+---------------------------------+------
Minneapolis, MN | 202111 | 2021-11-01 08:35:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:30:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:25:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:20:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:15:00.000000+0000 | 2
(5 rows)
This should help you get started.
#dipen, you could also refer to this documentation where it walks developers through various data models by their use case. #AlexOtt has great questions to begin with the data models for your use case and #aaron has a great example demonstration.
Here is an example. You could very much customize it for your weather use case. For a given access pattern requirement like in the below example,
we would go ahead and design a Cassandra table as follows to answer them,
When using the summary function "Group" in Saved Searches (SS), is it possible to show the total count inline or even use it inline? For example, I have a SS that counts the number of cases closed in a certain date range and it groups by the assigned employee and the total is listed at the bottom, as per usual. However, when trying to calculate the percent of the total each employee closed, they all show as 100%.
Here is a picture of the results and I have also added the formulas I am currently using. Here is what it looks like when I'm editing the search. The right most columns were my attempt at getting the total inline.
I'm fairly certain this is because I am grouping by the employees (or else there would be almost 3k lines in the report), but I don't think there is a better way to solve that problem other than by grouping by the employee.
We have tried doing an actual report in NetSuite (as opposed to saved search), however, the report times out quickly and we are hoping for a quicker solution. We also considered a KPI scorecard, but the issue would be that we would need to make a SS for each employee which isn't a good long-term solution due to team changes.
Is there a way of calculating the percent of the total when using grouping? Sorry for the long post, I was trying to be as descriptive as possible. The goal is to see how much (percent wise) each employee contributed to the total cases closed.
Take your SUM column that counts the closed cases.
duplicate it but add the function % of total
So
| Field | Summary Type | Function | Formula |
| ----- | ------------ | -------- | ---------------------------------------------------- |
|Formula Numeric | Sum | | Case When {status} like 'Closed%' then 1 else 0 End |
|Formula Numeric | Sum | % of Total | Case When {status} like 'Closed%' then 1 else 0 End |
I want to get number of people absent for a day from a dump file that has data as follows:
Name |no of hrs |production |date
ABC | 6 | 55 |11/26/17
PQR | Absent | Absent |11/26/17
MNO |5 |50 |11/26/17
XYZ |Absent |Absent |11/26/17
In this format rest of the data for rest of the days.
I want to show a recurring graph with percent of people absent for a week. I am unable to get number of people absent on that particular day. I tried using 'countif' but it gives me total people absent from the whole dump file. How can I get it day wise (by using functions)
I'm doing some work across two visualisation packages, primarily Gephi but also a bespoke package called Linkoder.
They can both use .CSV formats, but data must be laid out differently, and I'm trying to find a way to do this quickly in Excel. Transpose, Offset, Direct commands all seem to come close, but not quite.
Based on my transcript, I need to convert this matrix:
Statement | links
No1 | 5 4 3 2 1
No2 | 3 1
No3 | 6 4 2
No4 | 5 2 1
...to this target-link format:
Target | Link
No1 | 5
No1 | 4
No1 | 3
No1 | 2
No1 | 1
No2 | 3
No2 | 1
No3 | 6
I am struggling to find a simple way to do this, but that's likely because I'm less adept at Excel formulations that I would like to be.
Anyone refer me to a command (or set) which can quickly convert between these formats? Because I'm looking at thousands of lines of links to convert...
Thanks in advance!
hey_arno
Even though there are ways to do that in Excel they are semi-automatic AFAIK. I would recommend you to have a look at OpenRefine, which is a tool to manipulate and tidy datasets. It can read from many sources including Excel. What you need is to split the Link column by space and then Transpose the resulting columns.
Check this tutorial and scroll down to Transposing columns for instruction on how to do it.