how to use group by in filter condition in pandas - python-3.x

I have below data stored in a dataframe and I want to remove the rows that has id equal to finalid and for the same id, i have multiple rows.
example:
df_target
id finalid month year count_ph count_sh
1 1 1 2012 12 20
1 2 1 2012 6 18
1 32 1 2012 6 2
2 2 1 2012 2 6
2 23 1 2012 2 6
3 3 1 2012 2 2
output
id finalid month year count_ph count_sh
1 2 1 2012 6 18
1 32 1 2012 6 2
2 23 1 2012 2 6
3 3 1 2012 2 2
functionality is something like:
remove records and get the final dataframe
(df_target.groupby(['id','month','year']).size() > 1) & (df_target['id'] == df_target['finalid'])

I think need transform for same Series as origonal DataFrame and ~ for invert final boolean mask:
df = df_target[~((df_target.groupby(['id','month','year'])['id'].transform('size') > 1) &
(df_target['id'] == df_target['finalid']))]
Alternative solution:
df = df_target[((df_target.groupby(['id','month','year'])['id'].transform('size') <= 1) |
(df_target['id'] != df_target['finalid']))]
print (df)
id finalid month year count_ph count_sh
1 1 2 1 2012 6 18
2 1 32 1 2012 6 2
4 2 23 1 2012 2 6
5 3 3 1 2012 2 2

Related

Row comparison on different tables

friends.
I'm trying to figure out a formula that verifies if there is a matching row from table 2 on table 1. If not, the formula must show that the row were not listed, like stated on column E (CHECK). Is that possible? Or maybe a VBA macro, idk.
TABLE 1
A
B
C
D
29
1
1
1
29
2
1
2
30
3
1
2
15
1
1
1
15
2
1
2
15
3
1
2
20
1
1
1
20
2
1
2
20
3
2
1
20
4
2
2
20
5
1
3
TABLE 2
A
B
C
D
CHECK
29
1
1
1
EXISTS
15
1
1
2
NOT
15
2
1
2
EXISTS
15
3
1
2
EXISTS
20
6
1
1
NOT
100
1
2
3
NOT LISTED
Thanks, guys, would appreciate some help.

pandas - rolling sum last seven days over different rows

Starting from this data frame:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
5
2
01.03.
3
1
01.09.
5
2
01.09.
2
1
01.10.
5
2
01.10.
2
I would like to get a weekly sum of value:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
7
2
01.03.
6
1
01.09.
10
2
01.09.
5
1
01.10.
15
2
01.10.
8
I use this command, but it is not working:
df['value'] = df.groupby('id')['value'].rolling(7).sum()
Any ideas?
You can do groupby and apply.
df['date'] = pd.to_datetime(df['date'], format='%m.%d.')
df['value'] = (df.groupby('id', as_index=False, group_keys=False)
.apply(lambda g: g.rolling('7D', on='date')['value'].sum()))
Note that for 1900-01-10, the rolling window is 1900-01-04, 1900-01-05...1900-01-10
print(df)
id date value
0 1 1900-01-01 2.0
1 2 1900-01-01 3.0
2 1 1900-01-03 7.0
3 2 1900-01-03 6.0
4 1 1900-01-09 10.0
5 2 1900-01-09 5.0
6 1 1900-01-10 10.0
7 2 1900-01-10 4.0

Replacing the first column values according to the second column pattern

How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663
Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans

How to randomly generate an unobserved data in Python3

I have an dataframe which contain the observed data as:
import pandas as pd
d = {'humanID': [1, 1, 2,2,2,2 ,2,2,2,2], 'dogID':
[1,2,1,5,4,6,7,20,9,7],'month': [1,1,2,3,1,2,3,1,2,2]}
df = pd.DataFrame(data=d)
The df is follow
humanID dogID month
0 1 1 1
1 1 2 1
2 2 1 2
3 2 5 3
4 2 4 1
5 2 6 2
6 2 7 3
7 2 20 1
8 2 9 2
9 2 7 2
We total have two human and twenty dog, and above df contains the observed data. For example:
The first row means: human1 adopt dog1 at January
The second row means: human1 adopt dog2 at January
The third row means: human2 adopt dog1 at Febuary
========================================================================
My goal is randomly generating two unobserved data for each (human, month) that are not appear in the original observed data.
like for human1 at January, he does't adopt the dog [3,4,5,6,7,..20] And I want to randomly create two unobserved sample (human, month) in triple form
humanID dogID month
1 20 1
1 10 1
However, the follow sample is not allowed since it appear in original df
humanID dogID month
1 2 1
For human1, he doesn't have any activity at Feb, so we don't need to sample the unobserved data.
For human2, he have activity for Jan, Feb and March. Therefore, for each month, we want to randomly create the unobserved data. For example, In Jan, human2 adopt dog1, dog4 and god 20. The two random unobserved samples can be
humanID dogID month
2 2 1
2 6 1
same process can be used for Feb and March.
I want to put all of the unobserved in one dataframe such as follow unobserved
humanID dogID month
0 1 20 1
1 1 10 1
2 2 2 1
3 2 6 1
4 2 13 2
5 2 16 2
6 2 1 3
7 2 20 3
Any fast way to do this?
PS: this is a code interview for a start-up company.
Using groupby and random.choices:
import random
dogs = list(range(1,21))
dfs = []
n_sample = 2
for i,d in df.groupby(['humanID', 'month']):
h_id, month = i
sample = pd.DataFrame([(h_id, dogID, month) for dogID in random.choices(list(set(dogs)-set(d['dogID'])), k=n_sample)])
dfs.append(sample)
new_df = pd.concat(dfs).reset_index(drop=True)
new_df.columns = ['humanID', 'dogID', 'month']
print(new_df)
humanID dogID month
0 1 11 1
1 1 5 1
2 2 19 1
3 2 18 1
4 2 15 2
5 2 14 2
6 2 16 3
7 2 18 3
If I understand you correctly, you can use np.random.permutation() for the dogID column to generate random permutations of the column,
df_new=df.copy()
df_new['dogID']=np.random.permutation(df.dogID)
print(df_new.sort_values('month'))
humanID dogID month
0 1 1 1
1 1 20 1
4 2 9 1
7 2 1 1
2 2 4 2
5 2 5 2
8 2 2 2
9 2 7 2
3 2 7 3
6 2 6 3
Or to create random sampling of missing values within the range of dogID:
df_new=df.copy()
a=np.random.permutation(range(df_new.dogID.min(),df_new.dogID.max()))
df_new['dogID']=np.random.choice(a,df_new.shape[0])
print(df_new.sort_values('month'))
humanID dogID month
0 1 18 1
1 1 16 1
4 2 1 1
7 2 8 1
2 2 4 2
5 2 2 2
8 2 16 2
9 2 14 2
3 2 4 3
6 2 12 3

python/pandas - converting column headers into index

The data I have to deal with treats hourly data as the columns. I want to convert this as an index. Sample looks like this:
year month day 1 2 3 4 5 ... 24
2015 1 1 a b ................... c
2015 1 2 d e ................... f
2015 1 3 g h ................... i
I want to make the output file something like this:
year month day hour value
2015 1 1 1 a
2015 1 1 2 b
. . . . .
2015 1 1 24 c
2015 1 2 1 d
. . . . .
Currently using python 3.4 with the pandas module
Use set_index with stack:
print (df.set_index(['year','month','day'])
.stack()
.reset_index(name='value')
.rename(columns={'level_3':'hour'}))
year month day hour value
0 2015 1 1 1 a
1 2015 1 1 2 b
2 2015 1 1 24 c
3 2015 1 2 1 d
4 2015 1 2 2 e
5 2015 1 2 24 f
6 2015 1 3 1 g
7 2015 1 3 2 h
8 2015 1 3 24 i
Another solution with melt and sort_values:
print (pd.melt(df, id_vars=['year','month','day'], var_name='hour')
.sort_values(['year', 'month', 'day','hour']))
year month day hour value
0 2015 1 1 1 a
3 2015 1 1 2 b
6 2015 1 1 24 c
1 2015 1 2 1 d
4 2015 1 2 2 e
7 2015 1 2 24 f
2 2015 1 3 1 g
5 2015 1 3 2 h
8 2015 1 3 24 i

Resources