How do I add new column that adds and sums counts from existing column? - python-3.x

I have this python code:
counting_bach_new = counting_bach.groupby(['User Name', 'time_diff', 'Logon Time']).size()
print("\ncounting_bach_new")
print(counting_bach_new)
...getting this neat result:
counting_bach_new
User Name time_diff Logon Time
122770 -132 days +21:38:00 1 1
-122 days +00:41:00 1 1
123526 -30 days +12:04:00 1 1
-29 days +16:39:00 1 1
-27 days +18:16:00 1 1
..
201685 -131 days +21:21:00 1 1
202047 -106 days +10:14:00 1 1
202076 -132 days +10:22:00 1 1
-132 days +14:46:00 1 1
-131 days +21:21:00 1 1
So how do I add new column that adds and sums counts from existing column? The rightmost column with 1's should be disregarded, while I--on the other hand--would like to add a new column, summing up counts of 'time diff's per 'User Name', i.e. the result in the new col should sum # of observations listed per user. Either summing up # of time_diffs or Logon Time's. For User Name 122770 the new col should sum up to 2, for 123526 it should sum up to 3, and so on....
I tried several attempts, including (but not working)...
counting_bach_new.groupby('User Name').agg(MySum=('Logon Time', 'sum'), MyCount=('Logon Time', 'count'))
Any help would be appreciated. Thank you, for your kind support. Christmas Greetings from #Hubsandspokes

Use DataFrame.join with Series.reset_index:
df = (counting_bach_new.to_frame('count')
.join((counting_bach_new.reset_index()
.groupby('User Name')
.agg(MySum=('Logon Time', 'sum'),
MyCount=('Logon Time', 'count'))), on='User Name'))
print (df)
count MySum MyCount
User Name time_diff Logon Time
122770 -132 days +21:38:00 1 1 2 2
-122 days +00:41:00 1 1 2 2
123526 -30 days +12:04:00 1 1 3 3
-29 days +16:39:00 1 1 3 3
-27 days +18:16:00 1 1 3 3
201685 -131 days +21:21:00 1 1 1 1
202047 -106 days +10:14:00 1 1 1 1
202076 -132 days +10:22:00 1 1 3 3
-132 days +14:46:00 1 1 3 3
-131 days +21:21:00 1 1 3 3

If I understand the request correctly, try:
counting_bach_new.reset_index().groupby(['User Name'])['Logon Time'].count()
If you need to save starting number of columns, try:
counting_bach_new.reset_index().groupby(['User Name'])['Logon Time'].transform('count')

Related

Row comparison on different tables

friends.
I'm trying to figure out a formula that verifies if there is a matching row from table 2 on table 1. If not, the formula must show that the row were not listed, like stated on column E (CHECK). Is that possible? Or maybe a VBA macro, idk.
TABLE 1
A
B
C
D
29
1
1
1
29
2
1
2
30
3
1
2
15
1
1
1
15
2
1
2
15
3
1
2
20
1
1
1
20
2
1
2
20
3
2
1
20
4
2
2
20
5
1
3
TABLE 2
A
B
C
D
CHECK
29
1
1
1
EXISTS
15
1
1
2
NOT
15
2
1
2
EXISTS
15
3
1
2
EXISTS
20
6
1
1
NOT
100
1
2
3
NOT LISTED
Thanks, guys, would appreciate some help.

How to add days based on another date?

i want to add +2 days to column based on other column i use this table :
Company Type Joinning Date Starting day
1 1 19/01/2019
2 0 19/01/2019
3 0 19/01/2019
4 1 20/01/2019
5 0 20/01/2019
6 1 21/01/2019
i want to add +2 DAYS in column Starting day which is Joining day + 2 days if the company have type 1 how can i do it ?
What i've tried ?
pic
Desired Results
Company Type Joinning Date Starting day
1 1 19/01/2019 21/01/2019
2 0 19/01/2019
3 0 19/01/2019
4 1 20/01/2019 22/01/2019
5 0 20/01/2019
6 1 21/01/2019 23/01/2019
Just to show my comment of:
=IF(B2=1,C2+2,"")
Works. The output cell must be formatted in the desired method:

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

How to randomly generate an unobserved data in Python3

I have an dataframe which contain the observed data as:
import pandas as pd
d = {'humanID': [1, 1, 2,2,2,2 ,2,2,2,2], 'dogID':
[1,2,1,5,4,6,7,20,9,7],'month': [1,1,2,3,1,2,3,1,2,2]}
df = pd.DataFrame(data=d)
The df is follow
humanID dogID month
0 1 1 1
1 1 2 1
2 2 1 2
3 2 5 3
4 2 4 1
5 2 6 2
6 2 7 3
7 2 20 1
8 2 9 2
9 2 7 2
We total have two human and twenty dog, and above df contains the observed data. For example:
The first row means: human1 adopt dog1 at January
The second row means: human1 adopt dog2 at January
The third row means: human2 adopt dog1 at Febuary
========================================================================
My goal is randomly generating two unobserved data for each (human, month) that are not appear in the original observed data.
like for human1 at January, he does't adopt the dog [3,4,5,6,7,..20] And I want to randomly create two unobserved sample (human, month) in triple form
humanID dogID month
1 20 1
1 10 1
However, the follow sample is not allowed since it appear in original df
humanID dogID month
1 2 1
For human1, he doesn't have any activity at Feb, so we don't need to sample the unobserved data.
For human2, he have activity for Jan, Feb and March. Therefore, for each month, we want to randomly create the unobserved data. For example, In Jan, human2 adopt dog1, dog4 and god 20. The two random unobserved samples can be
humanID dogID month
2 2 1
2 6 1
same process can be used for Feb and March.
I want to put all of the unobserved in one dataframe such as follow unobserved
humanID dogID month
0 1 20 1
1 1 10 1
2 2 2 1
3 2 6 1
4 2 13 2
5 2 16 2
6 2 1 3
7 2 20 3
Any fast way to do this?
PS: this is a code interview for a start-up company.
Using groupby and random.choices:
import random
dogs = list(range(1,21))
dfs = []
n_sample = 2
for i,d in df.groupby(['humanID', 'month']):
h_id, month = i
sample = pd.DataFrame([(h_id, dogID, month) for dogID in random.choices(list(set(dogs)-set(d['dogID'])), k=n_sample)])
dfs.append(sample)
new_df = pd.concat(dfs).reset_index(drop=True)
new_df.columns = ['humanID', 'dogID', 'month']
print(new_df)
humanID dogID month
0 1 11 1
1 1 5 1
2 2 19 1
3 2 18 1
4 2 15 2
5 2 14 2
6 2 16 3
7 2 18 3
If I understand you correctly, you can use np.random.permutation() for the dogID column to generate random permutations of the column,
df_new=df.copy()
df_new['dogID']=np.random.permutation(df.dogID)
print(df_new.sort_values('month'))
humanID dogID month
0 1 1 1
1 1 20 1
4 2 9 1
7 2 1 1
2 2 4 2
5 2 5 2
8 2 2 2
9 2 7 2
3 2 7 3
6 2 6 3
Or to create random sampling of missing values within the range of dogID:
df_new=df.copy()
a=np.random.permutation(range(df_new.dogID.min(),df_new.dogID.max()))
df_new['dogID']=np.random.choice(a,df_new.shape[0])
print(df_new.sort_values('month'))
humanID dogID month
0 1 18 1
1 1 16 1
4 2 1 1
7 2 8 1
2 2 4 2
5 2 2 2
8 2 16 2
9 2 14 2
3 2 4 3
6 2 12 3

excel 2016 - delete rows based on multiple

I am trying to delete rows when the date in column B is not present exactly 4 times for a given filekey in column C. Sample data below:
A B C
Row Date Filekey
2 1/6/2014 1
3 1/6/2014 1
4 1/6/2014 1
5 1/6/2014 1
6 1/7/2014 1
7 1/7/2014 1
8 1/8/2014 1
9 1/9/2014 1
10 1/9/2014 1
11 1/9/2014 1
12 1/9/2014 1
13 1/9/2014 1
14 1/6/2014 2
15 1/6/2014 2
16 1/6/2014 2
17 1/6/2014 2
The result I am looking for:
Row Date Filekey
2 1/6/2014 1
3 1/6/2014 1
4 1/6/2014 1
5 1/6/2014 1
14 1/6/2014 2
15 1/6/2014 2
16 1/6/2014 2
17 1/6/2014 2
Please note that Row 6-7 were removed for only having 2 dates the same (too few), Row 8 for 1 date (too few), Rows 9-13 for 5 dates (too many)
Rows 14-17 were kept because:
there are exactly 4 rows with that date and it has a different filekey (column C) than rows 2-5 even though it shares those four dates.
Thanks for your help.
In cell D2 use this formula and copy down:
=COUNTIFS(B:B,B2,C:C,C2)
Then filter on column D for everything other than 4 and delete those rows, then remove the filter and you can delete the formulas in column D

Resources