Calculations within groupby in pandas, python - python-3.x

My data set contains house price for 4 different house types (A,B,C,D) in 4 different countries (USA, Germany, Uk, sweden). House price can be only three types (Upward, Downward, and Not Changed). I want to calculate Difition index (ID) for different House types (A,B,C,D) for different countries (USA, Germany, Uk, sweden) based on house price.
The formula that I want to use to calculate Difition index (DI) is:
DI = (Total Number of Upward * 1 + Total Number of Downward * 0 + Total Number of Not Changed * 0.5) / (Total Number of Upward + Total Number of Downward + Total Number of Not Changed)
Here is my data:
and the expected result is:
I really need your help.
Thanks.

You can do this by using groupby and assuming your file is named as text.xlsx
df = pd.read_excel('test.xlsx')
df = df.replace({'Upward':1,'Downward':0,'Notchanged':0.5})
df.groupby('Country').mean().reset_index()

Related

DAX - compare selected group averages with category averages

I have data on objects which belong to different categories. I want to be able to compare the average across a selection of objects to averages across the categories where selected objects belong. I have written out measures, but they do not produce the expected results,
My data looks like this. I am using Power Pivot to set up the data model for MS Excel pivot charts.
Table 1 has unique stores (Store names are guaranteed to be unique)
Store
Branch
Region
Store A
North 1
Plain
Store B
North 1
Plain
Store K
West 3
Plain
Store F
West 3
Plain
Store T
East 1
Coast
Store P
East 1
Coast
Table 2
Store
Area, sq ft
Store A
3000
Store B
4000
Store K
2000
Store F
5000
Store T
5000
Store P
4000
Table 3
Store
Year
Month
Expenses
Store A
2022
September
10000
Store A
2022
October
15000
Store B
2022
September
20000
Store B
2022
October
22000
There is more than one year included in the dataset.
Table 2 and 3 are connected to table 1.
First, I write measures that I expect to compute costs per sq ft for selected objects.
Costs:=sum('Table 3'[Expenses)
Area:= sum('Table 2'[Area, sq ft])
Costs_per_sq_ft:= [Costs] / [Area]
Then, I write identical measures for averages:
Costs_avg:=average('Table 3'[Expenses)
Area_avg:= average('Table 2'[Area, sq ft])
Costs_per_sq_ft_avg:= [Costs_avg] / [Area_avg]
Finally, I write define measures to average across a selected group (assuming all selected elements belong to the same category):
Costs_avg_branch:=var StoreBranch = max('Table 1'[Branch] = StoreBranch) return calculate([Costs_avg], filter(all('Table 1');'Table 1'[Branch] = StoreBranch))
Area_avg_branch:=var StoreBranch = max('Table 1'[Branch] = StoreBranch) return calculate([Area_avg], filter(all('Table 1');'Table 1'[Branch] = StoreBranch))
Costs_per_sq_ft_avg_branch:=var StoreBranch = max('Table 1'[Branch] = StoreBranch) return calculate([Costs_per_sq_ft_avg], filter(all('Table 1');'Table 1'[Branch] = StoreBranch))
and identical measures for Region as variable,
On selection of Store A and September 2022, I expected to have
Costs_avg
Costs_avg_branch
Store A
3.33
4.28
i.e. the average for the selected store and the average for the branch where it belongs.
On selection of Store A and September-October I expected to have:
Costs_avg
Costs_avg_branch
Store A
4.17
4.78
( average over chosen period for a selected store and the same average for the branch).
On selection of the entire branch I intended the average across selection to match that of the category. E.g., for stores A, B in September-October 2022:
Costs_avg
Costs_avg_branch
North 1
4.78
4.78
Unfortunately, the averages for individual selected objects seem to be consistently near zero. When I select entire branches, the object average and the branch average do not match.
Is there any way to obtain the correct averages? Is it possible to get the desired behavior when objects from different categories are selected, as I originally wanted?

calculate percentage of occurrences in column pandas

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?
If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

Find a growth rate that creates values adding to a determined total

I am trying to create a forecast tool that shows a smooth growth rate over a determined number of steps while adding up to a determined value. We have variables tied to certain sales values and want to illustrate different growth patterns. I am looking for a formula that would help us to determine the values of each individual step.
as an example: say we wanted to illustrate 100 units sold, starting with sales of 19 units, over 4 months with an even growth rate we would need to have individual month sales of 19, 23, 27 and 31. We can find these values with a lot of trial and error, but I am hoping that there is a formula that I could use to automatically calculate the values.
We will have a starting value (current or last month sales), a total amount of sales that we want to illustrate, and a period of time that we want to evaluate -- so all I am missing is a way to determine the change needed between individual values.
This basically is a problem in sequences and series. If the starting sales number is a, the difference in sales numbers between consecutive months is d, and the number of months is n, then the total sales is
S = n/2 * [2*a + (n-1) * d]
In your example, a=19, n=4, and S=100, with d unknown. That equation is easy to solve for d, and we get
d = 2 * (S - a * n) / (n * (n - 1))
There are other ways to write that, of course. If you substitute your example values into that expression, you get d=4, so the sales values increase by 4 each month.
For excel you can use this formula:
=IF(D1<>"",(D1-1)*($B$1-$B$2*$B$3)/SUMPRODUCT(ROW($A$1:INDEX(A:A,$B$3-1)))+$B$2,"")
I would recommend using Excel.
This is simply a Y=mX+b equation.
Assuming you want a steady growth rate over a time with x periods you can use this formula to determine the slope of your line (growth rate - designated as 'm'). As long as you have your two data points (starting sales value & ending sales value) you can find 'm' using
m = (y2-y1) / (x2-x1)
That will calculate the slope. Y2 represents your final sales goal. Y1 represents your current sales level. X2 is your number of periods in the period of performance (so how many months are you giving to achieve the goal). X1 = 0 since it represents today which is time period 0.
Once you solve for 'm' this will plug into the formula y=mX+b. Your 'b' in this scenario will always be equal to your current sales level (this represents the y intercept).
Then all you have to do to calculate the new 'Y' which represents the sales level at any period by plugging in any X value you choose. So if you are in the first month, then x=1. If you are in the second month X=2. The 'm' & 'b' stay the same.
See the Excel template below which serves as a rudimentary model. The yellow boxes can be filled in by the user and the white boxes should be left as formulas.

DAX - Distinct SUM thru 2 dimensions

I am trying to calculate "Distinct Sum" in DAX Powerpivot. I already have found help here: http://stackoverflow.com/questions/22613333/dynamic-sum-in-dax-picking-distinct-values
And my query is similar but extended further. I am seeking to find solution for such distinct Sum, but via two additional dimension (Month + Country)
In data example below there is Revenue performance on Part Number granularity. in Data there is also Shop Dimension, however Revenue is repeating on shops, is duplicated.
In the post mentioned above there is following solution:
Support:=MAX(Table1[Revenue])
DistinctSumOfRev:=SUMX(DISTINCT(Table1[Part_Num]),[Support])
It is work perfectly if you use Filter/Column/Row: Country and Month.
But if aggregate for All countries, or show performance on whole quarter, then solution will set MAX Revenue thru all countries/Months for and Part Number, which is not correct.
How to include into above solution also those two additional dimensions.
Basically to tell DAX that unique combination is PartNum+Country+Month
Country Month Part_Num Shop Revenue
----------------------------------------
UK 1 ABCD X 1000
France 1 ABCD X 500
France 1 ABCD Y 500
UK 2 ABCD X 1500
UK 2 ABCD Y 1500
UK 1 FGHJ X 3000
France 1 FGHJ X 600
UK 2 FGHJ X 2000
Add a calculated column to your Table1:
PartNumCountryMonth = [Part_Num]&[Country]&[Month]
Then create your measure as follows:
DistinctSumOfRev:=SUMX(DISTINCT(Table1[PartNumCountryMonth]),[Support])
Update
Alternative solution, calculated column is NOT required:
DistinctSumOfRev :=
SUMX ( SUMMARIZE ( 'Table1', [Country], [Part_Num], [Month] ), [Support] )

How to get weighted sum depending on two conditions in Excel?

I have this table in Excel:
I am trying to get weighted sum depending on two conditions:
Whether it is Company 1 or Company 2 (shares quantity differ)
Whether column A (Company 1) and column B (Company 2) has 0 or 1 (multipliers differ)
Example:
Lets calculate weighted sum for row 2:
Sum = 2 (multiplier 1) * 50 (1 share price) * 3 (shares quantity for Company 1) +
+0.5 (multiplier 0) * 50 (1 share price) * 6 (shares quantity for Company 2) = 450
So, Sum for Row 2 = 450.
For now I am checking only for multipliers (1 or 0) using this code:
=COUNTIF(A2:B2,0)*$B$9*$B$8 + COUNTIF(A2:B2,1)*$B$9*$B$7
But it does not take into account the shares quantities for Company 1 or Company 2. I only multiply 1 share price with multipliers, but not with shares quantity).
How can I also check whether it is Company 1 or Company 2 in order to multiply by corresponding Shares quantity?
Upd:
Rasmus0607 gave a solution when there are only two companies:
=$B$9*$E$8*IF(A2=1;$B$7;$B$8)+$B$9*$E$9*IF(B2=1;$B$7;$B$8)
Tom Sharpe gave a more general solution (number of companies can be greater than 2)
I uploaded my Excel file to DropBox:
Excel file
I can offer a more general way of doing it with the benefit of hindsight that you can apply to more than two columns by altering the second CHOOSE statement:-
=SUM(CHOOSE(2-A2:B2,$B$7,$B$8)*CHOOSE(COLUMN(A:B),$E$8,$E$9))*$B$9
Unfortunately it's an array formula that you have to enter with CtrlShiftEnter. But it's a moot point whether or not it would be better just to use one of the other answers with some repetition and keep it simple.
You could also try this:-
=SUMPRODUCT(N(OFFSET($B$6,2-A2:B2,0)),N(OFFSET($E$7,COLUMN(A:B),0)))*$B$9
Here's how it would be for three companies
=SUM(CHOOSE(2-A2:C2,$B$7,$B$8)*CHOOSE(COLUMN(A:C),$F$8,$F$9,$F$10))*$B$9
(array formula) or
=SUMPRODUCT(N(OFFSET($B$6,2-A2:C2,0)),N(OFFSET($F$7,COLUMN(A:C),0)))*$B$9
=$B$9*$E$8*IF(A2=1;$B$7;$B$8)+$B$9*$E$9*IF(B2=1;$B$7;$B$8)
Since in the COUNTIF function, you don't know beforehand, which company column contains a 0, or a 1, I would suggest a longer, but more systematic solution using IF:
=$B$9*$E$8*IF(A2=1;2;0,5)+$B$9*$E$9*IF(B2=1;2;0,5)
This is a bit less general, but should produce the result you expect in this case.

Resources