Using BigQuery to find outliers with standard deviation results combined with WHERE clause - statistics

Standard deviation analysis can be a useful way to find outliers. Is there a way to incorporate the result of this query (finding the value of the fourth standard deviation away from the mean)...
SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as high FROM [publicdata:samples.natality];
result = 12.721342001626912
...Into another query that produces information about which states and dates have the most babies born heavier that 4 standard deviations from average?
SELECT state, year, month ,COUNT(*) AS outlier_count
FROM [publicdata:samples.natality]
WHERE
(weight_pounds > 12.721342001626912)
AND
(state != '' AND state IS NOT NULL)
GROUP BY state, year, month
ORDER BY outlier_count DESC;
Result:
Row state year month outlier_count
1 MD 1990 12 22
2 NY 1989 10 17
3 CA 1991 9 14
Essentially it would be great to combine this into a single query.

You can abuse JOIN for this (and thus performance will suffer):
SELECT n.state, n.year, n.month ,COUNT(*) AS outlier_count
FROM (
SELECT state, year, month, weight_pounds, 1 as key
FROM [publicdata:samples.natality]) as n
JOIN (
SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as giant_baby,
1 as key
FROM [publicdata:samples.natality]) as o
ON n.key = o.key
WHERE
(n.weight_pounds > o.giant_baby)
AND
(n.state != '' AND n.state IS NOT NULL)
GROUP BY n.state, n.year, n.month
ORDER BY outlier_count DESC;

Related

DAX: Using Calendar Month as a column and as a criteria for counting records based on current row

So I have the following pivot table report through my data model. I want my measure 'Branches
Per Cluster' to consider the current column of month or year.
I have the following tables aside from a generated calendar table, these two below are related by 'CODE'.
A dim table named 'Branch Profiles'
CODE
AREA
CLUSTER
DATE OPENED
AAA
Area 1
Cluster 1
01/05/1990
AAB
Area 1
Cluster 1
05/03/2022
ABA
Area 2
Cluster 1
01/03/2005
BAA
Area 3
Cluster 2
01/03/2024
A fact table named 'BasicData'
CODE
Volume
Value
Date
AAA
1000
10000
06/01/1990
AAB
2000
20000
06/01/2020
ABA
3000
30000
06/01/2005
BAA
4000
40000
06/01/2008
This is what I currently have for my Branches Per Cluster measure which might be obvious for experienced users that is syntactically wrong though I believe it shows what I was trying to do as
I'm not quite sure how to reference the column as a filter. Basically, I just want to count the Branches ("CODE") for the specific area that have a date opened before the month specified by the column filters.
=CALCULATE(
DISTINCTCOUNT('Branch Profiles'[CODE]),
ALLEXCEPT('Branch Profiles',
'Branch Profiles'[AREA],
'Branch Profiles'[CODE]
),
YEAR('Branch Profiles'[DATE STARTED]) <= 'Calendar'[Year],
MONTH('Branch Profiles'[DATE STARTED]) <= 'Calendar'[Month Number]
)
Here is the proper code for the Measure that I was trying to accomplished. What I was missing earlier was a way to reference the current value of the Year and Month columns (Context Filters) and use it as a criteria on the [Date Started] column to get the count on [CODES] that are opened on or before currently specified Month or Year.
If this was done in PowerBI the solution would have involved the function named "SELECTEDVALUE" however, the logic can still be implemented in Excel through the longer syntax shown below in my measure.
VAR cur_year =
IF (
HASONEVALUE('Calendar'[Year]),
VALUES('Calendar'[YEAR])
)
VAR cur_month =
IF (
HASONEVALUE('Calendar'[Month]),
VALUES('Calendar'[Month Number])
)
RETURN
IF(
NOT ISBLANK([Send Volume]),
CALCULATE(
DISTINCTCOUNT(Branch Profiles[CODE]),
ALLEXCEPT('Branch Profiles',
'Branch Profiles'[AREA],
'Branch Profiles'[CODE]
),
'Branch Profiles'[DATE STARTED] < DATE(cur_year, cur_month + 1, 1)
)
)

How to do calculated values in Excel Pivot Table

I have a table like this:
Year Num Freq. Exam Grade Course
2014 102846 SM SM Astronomy 3
2015 102846 12,6 1,7 NC Astronomy 2
2017 102846 20 11,8 17 Astronomy 2
2015 102846 SM NC Defence Against the Dark Arts 4
2015 102846 11 4,5 NC Herbology 2
2015 102846 15 13,99 14 Herbology 2
I am trying to get the percentage of approved students (Grade >= 10) for each course by year and global average.
I've been trying for nearly 3 hours to do a calculated field but so far the only thing I could get was the sum of each student per year:
I have tried to do a calculated field with = Grade >= 10 hoping that it would give me a list of approved students but it gives me 1.
What am I doing wrong in here? It's my first time working with pivot tables.
I would really recommend to not mix string type (text) together with numbers. It's a horrifying idea and will cause a lot of headache when data will be used for calculations (both Freq. and Grade). Rather I would use 0 or some numeric value to represent the text.
Not recommended, but yes it's doable =)
You need some dummy variable to point out which row is number and which is text. So I created Grade Type. We can now count only the rows that have a number in the Grade column by using Grade Type = Number.
I create a table of the data and add the column Grade Type. I use this formula to get Grade Type:
=IF(ISNUMBER([#Grade]),"Number","Text")
I then create the following measures:
Nr of Approved Students
=COUNTX(FILTER(Table1, Table1[Grade Type]="Number"),
IF((VALUE(Table1[Grade])>=10),VALUE(Table1[Grade]),BLANK()))
First we filter which rows that should be evaluated (COUNTX(<table>,...)). If yes, then only count for rows that fulfill >=10, where VALUE() converts string number to numeric (COUNTX(...,<expression>)).
Nr of Student (w/ Grade Number)
=COUNTX(FILTER(Table1, Table1[Grade Type]="Number"), VALUE(Table1[Grade]))
Count all rows that have a number
Approved (% of Total)
=[Nr of Approved Students]/[Count of Grade]
Setup the PowerPivot Table
Create the PowerPivot and add the data to the data Model
Then create a new measure by clicking your pivot table and then "Measures" -> "New Measure..."
Fill in all the relevant data.
Result should be something like:

how to solve given Correlated Subqueries is not supported within case when statement

I have this code where I try to count every distinct user listed within 30 days prior rolling window, over the past 40 days.
for example: on the 12th feb (12/02) I need to count all listed user from 13th jan-12th feb (30 days) then on 11th feb I need to count from 12th jan to 11th feb and so on. I need to do this to a lot of other dates, is there a way to do it in presto? seeing as it does not support correlated subquery and when I try the code below, it returns
"Given correlated subquery is not supported"
WITH
get_date as
(
select
distinct date_trunc('day',date(uph.last_login)) as dates
from user_profile_id_history uph
where uph.last_login >= date('2020-02-04')-interval '30' day and uph.last_login <= ( date'2020-02-04')
order by 1 desc
)
select
get_date.dates,
(
case when (dates >= date('2020-02-04')-interval '30' day and dates <= ( date'2020-02-04'))
then
(
select
count(distinct CASE WHEN date_trunc('month',date(up.registration_time)) <= date_trunc('month',uph.last_login) THEN uph.userid END)
FROM
user_profile_id_history uph
LEFT JOIN
user_profile up ON uph.userid = up.userid
where uph.last_login >= dates-interval '30' day and uph.last_login <= dates
) end
) as mauser
from get_date
group by 1`

DAX - Distinct SUM thru 2 dimensions

I am trying to calculate "Distinct Sum" in DAX Powerpivot. I already have found help here: http://stackoverflow.com/questions/22613333/dynamic-sum-in-dax-picking-distinct-values
And my query is similar but extended further. I am seeking to find solution for such distinct Sum, but via two additional dimension (Month + Country)
In data example below there is Revenue performance on Part Number granularity. in Data there is also Shop Dimension, however Revenue is repeating on shops, is duplicated.
In the post mentioned above there is following solution:
Support:=MAX(Table1[Revenue])
DistinctSumOfRev:=SUMX(DISTINCT(Table1[Part_Num]),[Support])
It is work perfectly if you use Filter/Column/Row: Country and Month.
But if aggregate for All countries, or show performance on whole quarter, then solution will set MAX Revenue thru all countries/Months for and Part Number, which is not correct.
How to include into above solution also those two additional dimensions.
Basically to tell DAX that unique combination is PartNum+Country+Month
Country Month Part_Num Shop Revenue
----------------------------------------
UK 1 ABCD X 1000
France 1 ABCD X 500
France 1 ABCD Y 500
UK 2 ABCD X 1500
UK 2 ABCD Y 1500
UK 1 FGHJ X 3000
France 1 FGHJ X 600
UK 2 FGHJ X 2000
Add a calculated column to your Table1:
PartNumCountryMonth = [Part_Num]&[Country]&[Month]
Then create your measure as follows:
DistinctSumOfRev:=SUMX(DISTINCT(Table1[PartNumCountryMonth]),[Support])
Update
Alternative solution, calculated column is NOT required:
DistinctSumOfRev :=
SUMX ( SUMMARIZE ( 'Table1', [Country], [Part_Num], [Month] ), [Support] )

Retention Rate within Cohort

I want some cohort analysis on a userbase. We have 2 tables "signups" and "sessions", where users and sessions both have a "date" field. I'm looking to formulate a query that yields a table of numbers (with some blanks) that shows me: a count of users who created an account on a particular day and ho also have a session created , indicating that he returned on that day, 3rd day, 7th day and 14 day.
created_at d1 d3 d7 d14
05/07/2007 12 * * *
04/07/2007 49 21 1 2
03/07/2007 45 30 * 3
02/07/2007 47 41 18 12
...
In this case, 47 users who created an account on 2/07/2007 returned after 3 days(d3)
Can I perform this in a single MySQL query?
Yes you can:
Select Signups.date as created at,
count (distinct case when datediff(sessions.date, signups.date)=1 then signups.users else null end) as d1,
count (distinct case when datediff(sessions.date, signups.date)=3 then signups.users else null end) as d3,
count (distinct case when datediff(sessions.date, signups.date)=7 then signups.users else null end) as d7,
count (distinct case when datediff(sessions.date, signups.date)=14 then signups.users else null end) as d14 from signups
left join sessions using(users)
group by 1

Resources