How to add all values in a column in MS Excel? - excel

I am having a worksheet where I am having the values as
--------------------
col1 | col2
--------------------
1 | 2 min 30 secs
2 | 1 min 24 secs
3 | 0 min 10 secs
4 | 1 min 3 secs
Now I would like to sum up all the values in col2. Sum will be: 4 min 67 secs.

here a working example, simply format your timecells to show "m \m\i\n s \s\e\c\s"
and sum them up..:
http://sourcemonk.com/sum_of_formatted_time.xlsx

Related

Finding a formula to compute partial sums for consecutive rows that satisfy a condition

How do I do this is excel. The example is provided below. If we don't sell any products on a specific day I would like to move those hours to the next date.
If you need to compute partial sums of Hours for all the consecutive rows between rows that satisfy the condition that the value Sold is greater than zero, you could do this with an auxiliary column
A B C D
---------------------------------
1 |Hours Sold Sums Solution
2 | 300 30 300 300
3 | 30 0 300 0
4 | 0 0 300 0
5 | 30 0 300 0
6 | 300 50 660 360
7 | 23 0 660 0
8 | 100 25 783 123
Here Sums is defined by a formula for C2
=IF(B2>0,SUM(A2:$A$2),C1)
You can automatically populate the cells below.
This formula puts in a cell a partial sum of Hours up to the current row if Sold is nonzero, otherwise copies the previous partial sum of hours. We need this to subtract this value on the next step.
When you have the column C filled, it is sufficient to put the following formula in D2 and populate the cells below
=IF(ROW(B2)>2,IF(B2>0,C2-C1,0),C2)
This formula handles correctly both D2 that does not have a preceding row with values and the remaining cells in column D.
In fact you could combine the two formulas together and avoid the need to have an auxiliary column. Put the following formula in C2 and spread it down to the rest of the cells in column C
=IF(ROW(B2)>2,IF(B2>0,SUM(A2:$A$2)-SUM(C1:$C$2),0),A2)
to get
A B C
---------------------------------
1 |Hours Sold Solution
2 | 300 30 300
3 | 30 0 0
4 | 0 0 0
5 | 30 0 0
6 | 300 50 360
7 | 23 0 0
8 | 100 25 123

How can I get the count of sequential events pairs from a Pandas dataframe?

I have a dataframe that looks like this:
ID EVENT DATE
1 1 142
1 5 167
1 3 245
2 1 54
2 5 87
3 3 165
3 2 178
And I would like to generate something like this:
EVENT_1 EVENT_2 COUNT
1 5 2
5 3 1
3 2 1
The idea is how many items (ID) go from one event to the next one. Don't care about previous states, I just want to consider the next state from the current state (e.g.: for ID 1, I don't want to count a transition from 1 to 3 because first, it goes to event 5 and then to 3).
The date format is the number of days from a specific date (sort of like SAS format).
Is there a clean way to achieve this?
Let's try this:
(df.groupby([df['EVENT'].rename('EVENT_1'),
df.groupby('ID')['EVENT'].shift(-1).rename('EVENT_2')])['ID']
.count()).rename('COUNT').reset_index().astype(int)
Output:
| | EVENT_1 | EVENT_2 | COUNT |
|---:|----------:|----------:|--------:|
| 0 | 1 | 5 | 2 |
| 1 | 3 | 2 | 1 |
| 2 | 5 | 3 | 1 |
Details: Groupby on 'EVENT' and shifted 'EVENT' within each ID, then count.
You could use groupby and shift. We'll also use rename_axis and reset_index to tidy up the final output:
(pd.concat([f.groupby([f['EVENT'], f['EVENT'].shift(-1).astype('Int64')]).size()
for _, f in df.groupby('ID')])
.groupby(level=[0, 1]).sum()
.rename_axis(['EVENT_1', 'EVENT_2']).reset_index(name='COUNT'))
[out]
EVENT_1 EVENT_2 COUNT
0 1 5 2
1 3 2 1
2 5 3 1

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

Looping to create a new column based on other column values in Python Dataframe [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 3 years ago.
I want to create a new column in python dataframe based on other column values in multiple rows.
For example, my python dataframe df:
A | B
------------
10 | 1
20 | 1
30 | 1
10 | 1
10 | 2
15 | 3
10 | 3
I want to create variable C that is based on the value of variable A with condition from variable B in multiple rows. When the value of variable B in row i,i+1,..., the the value of C is the sum of variable A in those rows. In this case, my output data frame will be:
A | B | C
--------------------
10 | 1 | 70
20 | 1 | 70
30 | 1 | 70
10 | 1 | 70
10 | 2 | 10
15 | 3 | 25
10 | 3 | 25
I haven't got any idea the best way to achieve this. Can anyone help?
Thanks in advance
recreate the data:
import pandas as pd
A = [10,20,30,10,10,15,10]
B = [1,1,1,1,2,3,3]
df = pd.DataFrame({'A':A, 'B':B})
df
A B
0 10 1
1 20 1
2 30 1
3 10 1
4 10 2
5 15 3
6 10 3
and then i'll create a lookup Series from the df:
lookup = df.groupby('B')['A'].sum()
lookup
A
B
1 70
2 10
3 25
and then i'll use that lookup on the df using apply
df.loc[:,'C'] = df.apply(lambda row: lookup[lookup.index == row['B']].values[0], axis=1)
df
A B C
0 10 1 70
1 20 1 70
2 30 1 70
3 10 1 70
4 10 2 10
5 15 3 25
6 10 3 25
You have to use groupby() method, to group the rows on B and sum() on A.
df['C'] = df.groupby('B')['A'].transform(sum)

Create "leakage-free" Variables in Python?

I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!
With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN
Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.

Resources