Arrange the dataframe in increasing or decreasing order based on specific group/id in Python - python-3.x

I have a dataframe given as such:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': ['A', 'A', 'A', 'A', 'A','A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B',
'C', 'C', 'C', 'C', 'C', 'C',
'D', 'D', 'D', 'D',
'E', 'E', 'E', 'E', 'E','E', 'E', 'E','E'],
'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
1,2, 3,4,5,6,
1,2, 3, 4, 5,6,
1,2, 3, 4,
1,2, 3, 4, 5,6,7,8,9,],
'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
4, 4, 4,4,5,6,
8,9,10,11,12,13,
8,9,10,11,
7, 7,9,10,11,12,13,14,15,],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'No','Yes', 'Yes', 'No','No', 'Yes',
'Yes', 'No','Yes', 'Yes',
'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
96, 120,120, 141, 52,96,
15,123, 128, 66, 120, 141,
141,123, 128, 66,
123, 128, 66, 123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such:
Here,
id 'A' as 11 cycle
id 'B' as 6 cycle
id 'C' as 6 cycle
id 'D' as 4 cycle
id 'E' as 9 cycle
I need to regroup the dataframe based on following two cases:
Case 1: Increasing order of the cycle
The datafrmae needs to be arraged in the increasing order of the cycle.
i.e. D(4 cycle) comes first, then B(6 cycle), C(6 cycle), E(9 cycle), A(11 cycle)
The dataframe need to look as such:
Case 2: Decreasing order of the cycle
The datafrmae needs to be arraged in the decreasing order of the cycle.
i.e. A(11 cycle) comes first, then E(9 cycle), B(6 cycle), C(6 cycle), D(4 cycle)
The dataframe need to look as such:
In both the cases, id 'B' and 'C' has 6 cycle. So it is immaterial which will come first amongst 'B' and 'C'.
Also, the index number dosen't change in the original and regrouped cases.
Can somebody please let me know hot to achieve this task in Python?

Use groupby.transform('size') as the sorting value:
Either using a temporary column:
(df.assign(size=df.groupby('id')['cycle'].transform('size'))
.sort_values(by=['size', 'id'], kind='stable',
# ascending=False # uncomment for descending order
)
.drop(columns='size')
)
Or, passing as key to sort_values:
df.sort_values(by='id', key=lambda x: df.groupby(x)['cycle'].transform('size'),
kind='stable')
Output:
id cycle Salary Children Days
23 D 1 8 Yes 141
24 D 2 9 No 123
25 D 3 10 Yes 128
26 D 4 11 Yes 66
11 B 1 4 Yes 96
12 B 2 4 Yes 120
13 B 3 4 No 120
14 B 4 4 Yes 141
15 B 5 5 Yes 52
16 B 6 6 Yes 96
17 C 1 8 No 15
18 C 2 9 Yes 123
19 C 3 10 Yes 128
20 C 4 11 No 66
21 C 5 12 No 120
22 C 6 13 Yes 141
27 E 1 7 No 123
28 E 2 7 Yes 128
29 E 3 9 No 66
30 E 4 10 No 123
31 E 5 11 Yes 128
32 E 6 12 Yes 66
33 E 7 13 Yes 120
34 E 8 14 Yes 141
35 E 9 15 No 52
0 A 1 7 No 123
1 A 2 7 Yes 128
2 A 3 7 Yes 66
3 A 4 8 Yes 66
4 A 5 9 Yes 120
5 A 6 10 No 141
6 A 7 11 No 52
7 A 8 12 Yes 96
8 A 9 13 Yes 120
9 A 10 14 Yes 141
10 A 11 15 No 52

col1=df.groupby("id").cycle.transform("max")
case1=df.assign(col1=col1).sort_values(['col1','id'])
case1
case2=df.assign(col1=col1).sort_values(['col1','id'],ascending=False)
case2

Related

Truncate and re-number a column that corresponds to a specific id/group by using Python

I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'runs': [6, 6, 6, 6, 6,6,7,8,9,10, 3, 3, 3,4,5,6, 5, 5,5, 5,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Here, for every 'id', I wish to truncate the columns where 'runs' are being repeated and make the numbering continuous in that id.
For example,
For id=1, truncate the 'runs' at 6 and re-number the dataset starting from 1.
For id=2, truncate the 'runs' at 3 and re-number the dataset starting from 1.
For id=3, truncate the 'runs' at 5 and re-number the dataset starting from 1.
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?
I wish to truncate and re-number a column that corresponds to a specific id/group by using Python
Filter out the duplicates with loc and duplicated, then renumber with groupby.cumcount:
out = (df[~df.duplicated(subset=['id', 'runs'], keep=False)]
.assign(runs=lambda d: d.groupby(['id']).cumcount().add(1))
)
Output:
id runs Children Days
6 1 1 Yes 128
7 1 2 Yes 66
8 1 3 Yes 120
9 1 4 No 141
13 2 1 Yes 141
14 2 2 Yes 52
15 2 3 Yes 96
21 3 1 Yes 36
22 3 2 Yes 58
23 3 3 No 89
You can create a loop to go through each id and run cutoff value, and for each iteration of the loop, determine the new segment of your dataframe by the id and run values of the original dataframe, and append the new dataframe to your final dataframe.
df_truncated = pd.DataFrame(columns=df.columns)
for id,run_cutoff in zip([1,2,3],[6,3,5]):
df_chunk = df[(df['id'] == id) & (df['runs'] > run_cutoff)].copy()
df_chunk['runs'] = range(1, len(df_chunk)+1)
df_truncated = pd.concat([df_truncated, df_chunk])
Result:
id runs Children Days
6 1 1 Yes 128
7 1 2 Yes 66
8 1 3 Yes 120
9 1 4 No 141
13 2 1 Yes 141
14 2 2 Yes 52
15 2 3 Yes 96
21 3 1 Yes 36
22 3 2 Yes 58
23 3 3 No 89
def function1(dd:pd.DataFrame):
dd1=dd.drop_duplicates(subset='runs',keep=False)
return dd1.assign(runs=dd1.runs.rank().astype(int))
df.groupby('id').apply(function1).reset_index(drop=True)
out:
id runs Children Days
0 1 1 Yes 128
1 1 2 Yes 66
2 1 3 Yes 120
3 1 4 No 141
4 2 1 Yes 141
5 2 2 Yes 52
6 2 3 Yes 96
7 3 1 Yes 36
8 3 2 Yes 58
9 3 3 No 89

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Range mapping in Python

MAPPER DATAFRAME
col_data = {'p0_tsize_qbin_':[1, 2, 3, 4, 5] ,
'p0_tsize_min':[0.0, 7.0499999999999545, 16.149999999999977, 32.65000000000009, 76.79999999999973] ,
'p0_tsize_max':[7.0, 16.100000000000023, 32.64999999999998, 76.75, 6759.850000000006]}
map_df = pd.DataFrame(col_data, columns = ['p0_tsize_qbin_', 'p0_tsize_min','p0_tsize_max'])
map_df
in Above data frame is map_df where column 2 and column 3 is the range and column1 is mapper value to the new data frame .
MAIN DATAFRAME
raw_data = {
'id': ['1', '2', '2', '3', '3','1', '2', '2', '3', '3','1', '2', '2', '3', '3'],
'val' : [3, 56, 78, 11, 5000,37, 756, 78, 49, 21,9, 4, 14, 75, 31,]}
df = pd.DataFrame(raw_data, columns = ['id', 'val','p0_tsize_qbin_mapped'])
df
EXPECTED OUTPUT MARKED IN BLUE
look for val of df dataframe in map_df min(column1) and max(columns2) where ever it lies get the p0_tsize_qbin_ value.
For Example : from df data frame val = 3 , lies in the range of p0_tsize_min p0_tsize_max where p0_tsize_qbin_ ==1 . so 1 will return
Try using pd.cut()
bins = map_df['p0_tsize_min'].tolist() + [map_df['p0_tsize_max'].max()]
labels = map_df['p0_tsize_qbin_'].tolist()
df.assign(p0_tsize_qbin_mapped = pd.cut(df['val'],bins = bins,labels = labels))
or
bins = pd.IntervalIndex.from_arrays(map_df['p0_tsize_min'],map_df['p0_tsize_max'])
map_df.loc[bins.get_indexer(df['val'].tolist()),'p0_tsize_qbin_'].to_numpy()
Output:
id val p0_tsize_qbin_mapped
0 1 3 1
1 2 56 4
2 2 78 5
3 3 11 2
4 3 5000 5
5 1 37 4
6 2 756 5
7 2 78 5
8 3 49 4
9 3 21 3
10 1 9 2
11 2 4 1
12 2 14 2
13 3 75 4
14 3 31 3

Increment Count column by 1 base on a another column

I have this data frame where i need to create a count column base on my distance column. I grouped the result by the model column. What i anticipate to get is an increment by 1 on the next count row each time the distance is 100. For example, here is what I have so far but no success yet with the increment
import pandas as pd
df = pd.DataFrame(
[['A', '34', 3], ['A', '55', 5], ['A', '100', 7], ['A', '0', 1],['A', '55', 5],
['B', '90', 3], ['B', '0', 1], ['B', '1', 3], ['B', '21', 1],['B', '0', 1],
['C', '9', 7], ['C', '100', 4], ['C', '50', 1], ['C', '100', 6],['C', '22', 4]],
columns=['Model', 'Distance', 'v1'])
df = df.groupby(['Model']).apply(lambda row: callback(row) if row['Distance'] is not None else callback(row)+1)
print(df)
import numpy as np
(
df.groupby('Model')
.apply(lambda x: x.assign(Bount=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Result with your code solution
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 6.0
My expected result is:
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 7.0
Take a look at the C group that are two distance equal to 100
Setup:
df = pd.DataFrame({'Model': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'C', 11: 'C', 12: 'C', 13: 'C', 14: 'C'},
'Distance': {0: '34', 1: '55', 2: '100', 3: '0', 4: '55', 5: '90', 6: '0', 7: '1', 8: '21', 9: '0', 10: '9', 11: '23', 12: '100', 13: '33', 14: '23'},
'v1': {0: 3, 1: 5, 2: 7, 3: 1, 4: 5, 5: 3, 6: 1, 7: 3, 8: 1, 9: 1, 10: 7, 11: 4, 12: 1, 13: 6, 14: 4},
'Count': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 1, 6: 2, 7: 3, 8: 4, 9: 5, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5}})
If the logic needs to be applied across the Model column, you can use a shift, compare and add 1 for eligibal rows:
df.loc[df.Distance.shift().eq('100'), 'Count'] += 1
If the logic needs to be applied at per Model group, then you can use a groupby:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Distance.shift().eq('100') + x.Count))
.reset_index(level=0, drop=True)
)
Based on #StringZ's updates, below is the updated solution:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Model Distance v1 Count
0 A 34 3 1.0
1 A 55 5 2.0
2 A 100 7 3.0
3 A 0 1 5.0
4 A 55 5 6.0
5 B 90 3 1.0
6 B 0 1 2.0
7 B 1 v3 3.0
8 B 21 1 4.0
9 B 0 1 5.0
10 C 9 7 1.0
11 C 23 4 2.0
12 C 100 1 3.0
13 C 33 6 5.0
14 C 23 4 6.0

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Resources