In Excel, I have a log of web requests that I need to analyze for bandwidth usage. I have parsed the log into a number of fields that I will groupby in different ways for different reports. Each website page load gets multiple resources - each being a separate line. The data structure:
RequestID | SIZE | IsImage | IsStatic | Language
A | 100 | TRUE | TRUE | EN
A | 110 | TRUE | FALSE | EN
A | 90 | FALSE | FALSE | EN
...
Report 1: I need the AVERAGE request size: AVERAGE( SELECT SUM(SIZE) GROUPBY RequestID ). I do not need to see the size of each individual request.
Report 2: More elaborate pivot table reports showing average request req size broken by isStatic / isImage / language / etc. This way I can check "average total images per request per language"
Is there a way to define a field/item "SUM(SIZE) GROUPBY RequestID" ?
As far as I know this is not possible to achieve in a single pivot table. This is because you need to apply two separate aggregations to the same set of number based on a condition (RequestId)
It is possible to get what you are looking for using two pivot tables, however I would not recommend it but this is how you would do it.
Create the first pivot table on your base table, add the requestId to the rows and the size to value, this will give you an intermediate table with the sum of size per requestId, you then build a second pivot table, this time using the first as the source pivot table as the source, in this instance you will only add the ‘sum of size’ value and take the average of this. See below for example
Again I would not recommend this approach for anything but the most simple analysis
A better way to do this is to use powerpivot, a separate yet related technology to the pivot tables that you have used. You will need to import the table, I have assumed with the name [Logs] with columns [RequestId] and [Size] you will then need to add a calculation
AvarageSizeOfRequests:=AVERAGEX(SUMMARIZE(Logs;Logs[RequestId];"sumOfSize";CALCULATE(sum(Logs[Size])));[SumOfSize])
This will give you the following result
The first is the strait sum which you already have, the second is the average which will be the same per requestID but will aggregate differently.
I guess I am not understanding your Q because I expect the group by for Request ID to be automatic (unavoidable in a PT with that as a Row label). Perhaps pick holes in the following and I might understand what I have misunderstood:
I have added i and s to your data just so it is clearer which column is which. It is possible it would be better to convert TRUE and FALSE into 1 and 0 so the PT might count or average these as well.
This seems vaguely along the right lines so let's try a different PT layout. It RequestID is of little or no relevance for the required analysis don't include it in the PT or, as here, park it as a Report Filter:
in which case however many millions of rows of data of the kind in the OP there are, the PT will always in effect be a 2x2 matrix at most (assuming Language is suited to Report Filter also). There is only one value per record (SIZE) and only two, boolean, variables. Language could make a difference but worst case is one such PT per Language (and bearing in mind only one such is shown in the example!...)
Related
I am trying to analyze data based on the following scenario:
A group of places, each with its own ID gets available for visiting from time to time for an exclusive number of people - this number varies according to how well the last visit season performed - so far visit seasons were opened 3 times.
Let's suppose ID_01 in those three seasons had the following available slots/sold-out slots ratio: 25/24, 30/30, and 30/30, ID_02 had: 25/15, 20/18, and 25/21, and ID_03 had: 25/10, 15/15 and 20/13.
What would be the best way to design the database for such analysis on a single table?
So far I have used a table for each ID with all their available slots and sold-out amounts, but as the number of IDs gets higher and the number of visit seasons too (way beyond three at this point) it has been proving to be not ideal, hard to keep track of, and terrible to work with.
The best solution I could come up with was putting all IDs on a column and adding two columns for each season (ID | 1_available | 1_soldout | 2_available | 2_soldout | ...).
The Wikipedia article on database normalization would be a good starting point.
Based on the information you provided in your question, you create one table.
AvailableDate
-------------
AvailableDateID
LocationID
AvailableDate
AvailableSlots
SoldOutSlots
...
You may also have other columns you haven't mentioned. One possibility is SoldOutTimestamp.
The primary key is AvailableDateID. It's an auto-incrementing integer that has no meaning, other than to sort the rows in input order.
You also create a unique index on (LocationID, AvailableDate) and another unique index on (AvailableDate, LocationID). This allows you to retrieve the row by LocationID or by AvailableDate.
I'm storing in a delta table the prices of products. The schema of the table is like this:
id | price | updated
1 | 3 | 2022-03-21
2 | 4 | 2022-03-20
3 | 3 | 2022-03-20
I upsert rows using the id field as primary key and updating the price and updated field.
I'm trying to have the serie of prices over time using databrick time travel. But looking the documentation apparently I can only look 2 versions of a table like this
%sql
SELECT count(distinct id) - (
SELECT count(distinct id)
FROM table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM table
Is there a way to select the different prices off all version ? Like: Distinct prices.
I would really not recommend to use time travel for that for following reasons:
If your data is updated frequently, then you will have a lot of versions, and your performance will degrade over the time, as handling of huge number of versions (10s of thousands) will put a lot of pressure on driver
It's very hard to do historical analysis, as you can see already - for each version you will need to have subqueries and union data.
Instead, you can use two tables - first with actual data, and second - with historical data, ideally, building the SCD Type 2 (Slowly Changing Dimensions) with markers for which period which price was active. You can build that second table using Change Data Feed (CDF) functionality to pull changes from first table, and applying them to the second table using MERGE operation. Databricks documentation includes example of using MERGE to build SCD Type 2 (although without CDF).
With this approach it will be easy for you to perform historical analysis, as all data will be in the same table and you don't need to use time travel
Due to performance issues I need to remove a few distinct counts on my DAX. However, I have a particular scenario and I can't figure out how to do it.
As example, let's say one or more restaurants can be hired at one or more feasts and prepare one or more menus (see data below).
I want a PowerPivot table that shows in how many feasts each restaurant was present (see table below). I achieved this by using distinctcount.
Why not precalculating this on Power Query? The real data I have is a bit more complex (more ID columns) and in order to be able to pivot the data I would have to calculate thousands of possible combinations.
I tried adding to my model a Feast dimensional table (on the example this would only be 1 column of 2 rows). I was hoping to use that relationship to be able to make a straight count, but I haven't been able to come up with the right DAX to do so.
You could use COUNTROWS() combined with VALUES().
Specifically, COUNTROWS() will give you the count of rows in a table. That means COUNTROWS is expecting a table is input. Here's the magic part: VALUES() will return a table as results, and the table it returns are the distinct values in the table/column that you provide as the argument for VALUES().
I'm not sure if I'm explaining it well, so for the sample data you provided, the measure would look like this (assuming the table is named Table1):
Unique Feasts:=COUNTROWS(VALUES('Table1'[Feast Id]))
You can then create a pivot table from Powerpivot, and drag Restaurant Id into Rows, and drag the measure above into Values. Same result as DISTINCTCOUNT, but with less performance overhead (I think).
I have 2 tables.
First table: dimensional table to show available units of cars at start of selling cycle and for how long these units will be available.
Second table: to show how many cars were sold on a given month within their "available cycle".
I'd like to compare the "selling behaviour" within each cycle. Thus, I want to display the total initial units available next to the units sold at each stage within the cycle. The second dimension works fine, but not the first one.
This is what I get:
And this the desired output (note rows 4 and 5 for available_units)
I tried the below DAX code without success:
SumAvailableUnits:=CALCULATE(SUM([available_units]),FILTER(ALL(Table1[month_within_cycle]),[month_within_cycle]>=MAX([months_available])))
First, DAX Formatter is your friend. You may like writing unreadable single line measures, but no one else likes reading them.
I've also taken the liberty of cleaning up your table names and adding fully qualified column references. (Ignoring that your dimension isn't a pure dimension, as it holds numeric values that you aggregate in a measure)
SumAvailableUnits :=
CALCULATE (
SUM ( DimCar[available_units] ),
FILTER (
ALL ( FactSale[month_within_cycle] ),
FactSale[month_within_cycle] >= MAX ( DimCar[months_available] )
)
)
And immediately we see a problem. With the fully qualified column references, it is clear that you're trying to filter the lookup table (the one side) by the base table (the many side). In Power Pivot for Excel, we do not have bi-directional relationships (though they're available in Power BI and coming for Excel 2016). Our relationships' filter context only flows from the lookup table to the base table, typically from the dimension to the fact.
Additionally, your DimCar, by holding [available_units] and [months_available] encodes an implicit assumption that a specific [car_id] can only ever refer to a single, unchanging lot. You will never see another row with [car_id] = 1. This strikes me as highly unlikely. Even if it is the case the better solution is a model change.
In general, anything that goes onto a row or column label should come from a dimension, not a fact. Similarly, anything you're aggregating in a measure should live in a fact table. The usual exception is dimension counts - either bare, or as a denominator in a measure. Following these will get you 80% of the way in terms of dimensional modeling. You can see the tables and model diagram I've ended up with in the image below.
Here are the measure definitions
SumAvailableUnits:=SUM( FactAvailability[available_units] )
SumSold:=SUM( FactSale[cars_sold] )
Here are my source tables, my model diagram with relationships, and a pivot table built from these pieces and the measures above. Note the source of [month_within_cycle] in the pivot.
Finally, you might notice that my grand total behaves in a different way than in your original. Since the values are repeated monthly, we get a much larger grand total. If you need to instead end with the sum from the latest month (which it looks like you have in your sample), you can use an alternate measure, provided below. I don't understand why this would be your desired grand total, but you can achieve it fairly easily. Personally, I'd probably blank the measure at the grand total level.
SumAvailableUnits - GrandTotal:=
SUMX(
TOPN(
1
,FactAvailability
,FactAvailability[month_within_cycle]
,0
)
,FactAvailability[available_units]
)
This uses SUMX() to step through the table provided, defined by TOPN(). TOPN() returns the first row (in this case) in FactAvailability, including ties, after sorting by [month_within_cycle], out of all rows available in the filter context. In the context of a specific month, you get all the rows associated with that month - identical to the simple sum. In the grand total context, you get the rows associated with the last month.
SUMX() iterates over that table and accumulates the values of [available_units] in a sum.
I have a student database, and I'm trying to show different metrics based on a student's score range in a PivotTable. Specifically (this is a simplified example, so don't worry about the content) I want to show this in my pivot:
StudentGPACat | Avg Post-Grad Salary
3-3.2 | 64,323
3.2-3.4 | 71,225
3.4-3.6 | etc
3.6-3.8 | etc
3.8-4.0 | etc
So I want the rows in my pivot table to show the range the student's average score falls in.
In order to generate that metric, right now, I did 2 things:
(1) Added a new column in my master table in PowerPivot called [avgGrade] that shows the value of the [TableAvgGrade] calculated field from the "Grades" table for each student (i.e., each row in the master table)
=CALCULATE([TableAvgGrade],
FILTER(Grades,Grades[studentID]=Master[studentID]))
(2) Created a new column [StudentGPACat] in PowerPivot and the formula goes:
=If([avgGrade]<3,"3",
If([avgGrade]<3.2,"3-3.2",
If([avgGrade]<3.4,"3.2-3.4",
If([avgGrade]<3.6,"3.4-3.6",
If([avgGrade]<3.8,"3.6-3.8","3.8-4.0")))))
This feels bulky and computationally expensive. Is there an easier way to create these ranges to use as rows in my PivotTable?
EDIT: made some edits to clarify my question
EDIT2: type
What you've done is the appropriate pattern for creating this sort of column. If you're concerned about the gnarly nested IF()s, you can replace with a SWITCH(), which is just syntactic sugar for nested IF()s, but what you've posted is all you need.
In a PivotTable (I don't know with PowerPivot), if you use a numeric value as a Row Label, you can Right click the field, choose Group, define the Starting at value, Ending at value and By step, and you will get an equivalent result quite easily.