Power BI Create Running Total using two columns - powerbi-desktop

I have a matrix with dates as rows and data in column values. I am trying to create a running total using values from 2 columns. The only issue is that one of the values is updated on a prior day basis. In other words, both values aren't on the same date. I want to sum the value for TODAY + Value from yesterday.
See example below. I hope it makes sense.
DATES
COL A
COL B
Running Total
3/27
5
3/28
10
15
3/29
10
25
3/30
3
28

Simply join [COL A + DATE] on [COL B + DATE-1Day].
Below is an example on how you can do it in SQL, before it gets into Power BI. If you want it in another manner, please specify language. Personally, I try to do most of my calculations in the dataset itself so the dashboard itself is not that heavy. After this, you simply either set up DAX or an SQL over partition to see the Running Total.
MS SQL Server 2017
http://sqlfiddle.com/#!18/8efed/6
CREATE TABLE testjoin
([date] datetime, [col a] int, [col b] int);
INSERT INTO testjoin
([date], [col a], [col b])
VALUES
('2020-01-01', 1, 1),
('2020-01-02', 1, 0),
('2020-01-03', 2, 2),
('2020-01-04', 3, 1);
select
a.date,
case when a.[col a] is null then 0 else a.[col a] end as 'Col A',
case when b.[col b] is null then 0 else b.[col b] end as 'Col B',
case when a.[col a] is null then 0 else a.[col a] end + case when b.[col b] is null then 0 else b.[col b] end as 'Total'
from testjoin a
left join testjoin b
on DATEADD(day, -1, a.date)=b.date
Result(Do note that th Col B is from the previous date):
Date
Col A
Col B(From previous date)
Total
2020-01-01
1
0
1
2020-01-02
1
1
2
2020-01-03
2
0
2
2020-01-04
3
2
5

Related

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

Compare two classes with range of Marks

I have a dataframe with two classes (A or B) and marks and I want to present the mark ranges per class.
Dataframe:
Class Mark Department
A 74.0 1
A 73.0 2
B 72.0 1
A 75.0 1
B 64.0 2
What I want to achieve:
Class Mark Range
A 73.0-75.0
B 64.0-72.0
and I was thinking of using the min max (creating a new field for the range). But as a start, I tried to just group it:
df['count'] = 1
result = df.pivot_table('count', index='Mark', columns='Class', aggfunc='sum').fillna(0)
which is complex and I abandoned this quickly.
I then I only kept two columns in my dataframe (Mark and Class) and used the following:
df[['Mark','Class']].values
And now I just have to create the Mark range column. I was thinking whether there was a simpler way without the steps to simply pivot the data and check the range (min max of columnA grouped by ColumnB).
We can use GroupBy.apply and get the max and min per group and represent them as string with f-strings:
df = (
df.groupby('Class')['Mark'].apply(lambda x: f'{x.min()}-{x.max()}')
.reset_index(name='Mark Range')
)
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
Simple but ugly:
temp = df.groupby('Class')['Mark'].agg({'min': min, 'max': max})
temp['range'] = temp['min'].map(str) + '-' + temp['max'].map(str)
Result of doing temp[['range']]:
range
Class
A 73.0-75.0
B 64.0-72.0
If you are interested in using pivot_table:
df_new = (df.pivot_table('Mark', 'Class', aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[1543]:
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
As in your comment. To add Deparment, just use the list ['Class', 'Department'] for index as follows
df_new = (df.pivot_table('Mark', ['Class', 'Department'],
aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[259]:
Class Department Mark Range
0 A 1 74.0-75.0
1 A 2 73.0-73.0
2 B 1 72.0-72.0
3 B 2 64.0-64.0

i want to count the -ve in a col and put them in another colm using groupby in an col

count the number of neg values in delay column using groupby
merged_inner['delayed payments']=merged_inner.groupby('Customer Name')['delay'].apply(lambda x: x [x < 0].count())
the delayed payments col is showing null
I believe the problem here is that you are trying to put the results back to the same dataframe as you did .groupby with, which won't have Customer Name as index.
Consider following minified example:
df = pd.DataFrame({
'Customer Name':['a', 'b','c','a', 'c','a','b','a'],
'Delay':[1, 2, -3, 0, -1,-2, -3,2]
})
You can even try:
df.loc[df['col']<0].groupby('Customer Name')['delay'].size()
Output:
Customer Name
a 1
b 1
c 2
Name: Delay, dtype: int64
You can have dataframe using:
df.loc[df['Delay']<0].groupby('Customer Name')['Delay'].size().reset_index(name='delayed_payment')
Output:
Customer Name delayed_payment
0 a 1
1 b 1
2 c 2

Transpose Excel Row Data into columns based on Unique Identifier

I have excel table in below format.
Sr. No. Column 1 (X) Column 2(Y) Column 3(Z)
1 X Y Z
2 Y Z
3 Y
4 X Y
5 X
I want to tranpose it in following format in MS Excel.
Sr. No. Value
1 X
1 Y
1 Z
2 Y
2 Z
3 Y
4 X
4 Y
5 X
Actual data contains more than 30 columns which needs to be transposed into 2 columns.
Please guide me.
Select complete table data and then name it as SourceData using
Formula>Name Manager
Now implement following formula for getting first column:
=INDEX(SourceData,CEILING(ROWS($A$1:A1)/(COLUMNS(SourceData)-1),1),1)
And for second column:
=INDEX(SourceData,CEILING(ROWS($A$1:A1)/(COLUMNS(SourceData)-1),1),MOD(ROWS($A$1:A1)-1,COLUMNS(SourceData)-1)+2)
Copy and paste special values and then delete blanks / zeroes.
You will get result as required.
If you were using other databases, there might be a formal unpivot operator/function available. But in MySQL, this is not a possibility. However, one approach which should work here would be to just take a union of the three columns:
SELECT 1 AS sr_no, col1 AS value WHERE col1 IS NOT NULL
UNION ALL
SELECT 2, col2 WHERE col2 IS NOT NULL
UNION ALL
SELECT 3, col3 WHERE col3 IS NOT NULL
ORDER BY sr_no;

excel formula depending on dynamic values in different columns

I am trying to create an excel formula using SUM and SUMIF but cannot find how to.
I have a first column(A) which is the total time of a piece of work and then for each row the time spent in that task during each day(columns B, C, ...).
For each day(columns B, C, ...), the formula would return the sum of only those values in column A that(for that specific column), relate to task that have been completed that day: the sum of all cells within a row is equals or more than the time the task was allocated.
Example for one 12-hours task:
A B C D E
12 4 6 2 0
Using the formula:
A B C D E
12 4 6 2 0
0 0 0 12 0
where 12 is displayed in column D because 4 + 6 + 2 = 12(Column A)
Second example(3 tasks):
A B C D E
10 9 0 1 0
21 8 8 5 0
5 0 0 3 2
Using the formula:
A B C D E
10 9 0 1 0
21 8 8 5 0
5 0 0 3 2
0 0 0 31 5
Where:
31(Day D) = 10(Task 1 is finished that day) + 21(Task 2 is finished that day too)
5(Day E) = Task 3 is finished that day
Tried this formula (for Day B):
SUMIF(B1:B3,">=A1:A3",A1:A3)
(Sum those values in column A if the cells in that row p to column B(in this case just B) are >= than those iterated).
Then for column C, it would be,
SUMIF(C1:C3 + B1:B3,">=A1:A3",A1:A3)
The above examples did not work(first returns zero, second is an invalid formula),
Any ideas?
Thank you.
Formula below given by user ServerS works fine:
Col B:
=IF(SUM(B2)=A2,A2,0)+IF(SUM(B3)=A3,A3,0)+IF(SUM(B4)=A4,A4,0)+IF(SUM(B5)=A5,A5,0)
Col C:
=IF(SUM(B2:C2)=A2,A2,0)+IF(SUM(B3:C3)=A3,A3,0)+IF(SUM(B4:C4)=A4,A4,0)+IF(SUM(B5:C5)=A5,A5,0)
Col D
=IF(SUM(B2:D2)=A2,A2,0)+IF(SUM(B3:D3)=A3,A3,0)+IF(SUM(B4:D4)=A4,A4,0)+IF(SUM(B5:D5)=A5,A5,0)
However there are two inconvenients:
if new rows are added it needs to be adapted and include another IF(). Would be better to have a generic SUM if IF's
Trying to propagate the formula to adjacent cells is not possible as it would change part of the formula like "=A2,A2,0" to "=A3,A3,0" which needs to keep the same.
Any other ideas that improve this, if possible, are appreciated.
You can avoid using IF with a sumproduct. This method allows use to insert any row you want. Make sure range are correct (eg A2:A5 with 5 the last row used). I would go for this :
in column B :
=SOMMEPROD(($A$2:$A$5)*($A$2:$A$5=(B2:B5)))
in column C :
=SUMPRODUCT(($A$2:$A$5)*($A$2:$A$5=(B2:B5+C2:C5)))-B6
in column D
=SUMPRODUCT(($A$2:$A$5)*($A$2:$A$5=(B2:B5+C2:C5+D2:D5)))-C6-B6
in column E
=SUMPRODUCT(($A$2:$A$5)*($A$2:$A$5=(B2:B5+C2:C5+D2:D5+E2:E5)))-D6-C6-B6

Resources