Related
I have a large multi-indexed dataframe and I want to fill values for a columns based on values in another group of columns. It looks like passing a dataframe or dictionary to groupby.fillna() forces you to match on index. How do I fill values using just the groupby values as an index and ignoring the dataframes index?
import pandas as pd
import numpy as np
nan = np.nan
def print_df(df):
with pd.option_context('display.max_rows', None,
'display.max_columns', None,
):
print(df)
d1 = {'I1': {0: 1, 1: 1, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2},
'I2': {0: 1, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2},
'I3': {0: 1, 1: 1, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 1, 9: 2},
'A': {0: 1, 1: 1, 2: 2, 3: 2, 4: 2, 5: 2, 6: 1, 7: 1, 8: 1, 9: 1},
'B': {0: 3, 1: 2, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 3},
'C': {0: 2, 1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 1, 7: 1, 8: 2, 9: 1},
'D': {0: nan, 1: nan, 2: 7.0, 3: nan, 4: nan, 5: 8.0, 6: 8.0, 7: 4.0, 8: 1.0, 9: nan},
'E': {0: nan, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: nan}}
df1 = pd.DataFrame(d1)
df1.set_index(["I1","I2","I3"], inplace=True)
d2 = {'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2},
'B': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 1, 7: 1, 8: 2, 9: 2, 10: 3, 11: 3},
'C': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2},
'D': {0: 5, 1: 1, 2: 4, 3: 3, 4: 2, 5: 2, 6: 3, 7: 5, 8: 5, 9: 4, 10: 4, 11: 3},
'E': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5, 9: 5, 10: 5, 11: 5}}
df2 =pd.DataFrame(d2)
df2.set_index(["A","B","C"], inplace=True)
print("dataframe with values to impute")
print_df(df1)
print("lookup values to fill dataframe")
print_df(df2)
# what I expected to work that is instead using the I1 I2 I3 index
df1.groupby(["A","B","C"]).fillna(df2.to_dict())
This is desired output (** shows the imputed rows):
data frame with values to impute
A B C D E
I1 I2 I3
1 1 1 1 3 2 2.0 5.0 **
2 1 1 2 1 4.0 5.0 **
2 1 1 2 1 2 7.0 1.0
Relevant lookup values I showed above for reference
D E
A B C
1 1 1 5 5
2 1 5
2 1 4 5**
2 3 5
3 1 2 5
2 2 5**
I would first align the two DataFrames with a merge as they are not sharing the same index:
cols = ['A', 'B', 'C']
df1.fillna(df1[cols].merge(df2, left_on=cols, right_index=True, how='left'))
output:
A B C D E
I1 I2 I3
1 1 1 1 3 2 2.0 5.0
2 1 1 2 1 4.0 5.0
2 1 1 2 1 2 7.0 1.0
2 2 2 1 5.0 5.0
3 2 3 2 3.0 5.0
4 2 3 1 8.0 1.0
5 1 2 1 8.0 1.0
6 1 2 1 4.0 1.0
2 1 1 1 2 1.0 1.0
2 1 3 1 2.0 5.0
I have a dataframe like this, which has about 40 unique in the key column and some of them are integer. I do not want anything aggregated if possible
:
hash ft timestamp id key value
1ab 1 2022-01-02 12:21:11 1 aaa 121289
1ab 1 2022-01-02 12:21:11 1 bbb 13ADF
1ab 1 2022-01-02 12:21:11 2 aaa 13ADH
1ab 1 2022-01-02 12:21:11 2 bbb 13ADH
2ab 2 2022-01-02 12:21:11 3 aaa 121382
2ab 2 2022-01-02 12:21:11 3 bbb 121381
2ab 2 2022-01-02 12:21:11 3 ccc 121389
I am trying to pivot the data only on 2 on columns key and value while keeping the remaining same columns and index. Example:
When I run below code, the column names take on grouped values with columnname being for the following columns id, ft, value. One of the actual column names, with parentheses : ('id', '1', '121289') and I am forced to select an index, which I dont want do.
Code:
df.pivot_table(index='hash',columns=['ft','key'])
I am not sure what I am doing wrong, that I cant use the value column for values. I get an empty dataframe:
df.pivot_table(index='hash',columns=['ft','key'], values='value')
A possible solution, using pandas.DataFrame.pivot:
(df.pivot(index=['hash', 'ft', 'id', 'timestamp'],
columns=['key'], values='value')
.reset_index().rename_axis(None, axis=1))
Output:
hash ft id timestamp aaa bbb ccc
0 1ab 1 1 2022-01-02 12:21:11 121289 13ADF NaN
1 1ab 1 2 2022-01-02 12:21:11 13ADH 13ADH NaN
2 2ab 2 3 2022-01-02 12:21:11 121382 121381 121389
Data:
df = pd.DataFrame.from_dict(
{'hash': {0: '1ab',
1: '1ab',
2: '1ab',
3: '1ab',
4: '2ab',
5: '2ab',
6: '2ab'},
'ft': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2},
'timestamp': {0: '2022-01-02 12:21:11',
1: '2022-01-02 12:21:11',
2: '2022-01-02 12:21:11',
3: '2022-01-02 12:21:11',
4: '2022-01-02 12:21:11',
5: '2022-01-02 12:21:11',
6: '2022-01-02 12:21:11'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 3},
'key': {0: 'aaa', 1: 'bbb', 2: 'aaa', 3: 'bbb', 4: 'aaa', 5: 'bbb', 6: 'ccc'},
'value': {0: '121289',
1: '13ADF',
2: '13ADH',
3: '13ADH',
4: '121382',
5: '121381',
6: '121389'}}
)
EDIT
To overcome an error reported below, in a comment, by the OP, the OP himself suggests the following solution:
(pd.pivot_table(df,index=['hash','ft', 'id'] ,
columns = ['key'] ,
values = "value", aggfunc='sum')
.reset_index().rename_axis(None, axis=1))
I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]
Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]
I have this data frame where i need to create a count column base on my distance column. I grouped the result by the model column. What i anticipate to get is an increment by 1 on the next count row each time the distance is 100. For example, here is what I have so far but no success yet with the increment
import pandas as pd
df = pd.DataFrame(
[['A', '34', 3], ['A', '55', 5], ['A', '100', 7], ['A', '0', 1],['A', '55', 5],
['B', '90', 3], ['B', '0', 1], ['B', '1', 3], ['B', '21', 1],['B', '0', 1],
['C', '9', 7], ['C', '100', 4], ['C', '50', 1], ['C', '100', 6],['C', '22', 4]],
columns=['Model', 'Distance', 'v1'])
df = df.groupby(['Model']).apply(lambda row: callback(row) if row['Distance'] is not None else callback(row)+1)
print(df)
import numpy as np
(
df.groupby('Model')
.apply(lambda x: x.assign(Bount=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Result with your code solution
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 6.0
My expected result is:
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 7.0
Take a look at the C group that are two distance equal to 100
Setup:
df = pd.DataFrame({'Model': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'C', 11: 'C', 12: 'C', 13: 'C', 14: 'C'},
'Distance': {0: '34', 1: '55', 2: '100', 3: '0', 4: '55', 5: '90', 6: '0', 7: '1', 8: '21', 9: '0', 10: '9', 11: '23', 12: '100', 13: '33', 14: '23'},
'v1': {0: 3, 1: 5, 2: 7, 3: 1, 4: 5, 5: 3, 6: 1, 7: 3, 8: 1, 9: 1, 10: 7, 11: 4, 12: 1, 13: 6, 14: 4},
'Count': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 1, 6: 2, 7: 3, 8: 4, 9: 5, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5}})
If the logic needs to be applied across the Model column, you can use a shift, compare and add 1 for eligibal rows:
df.loc[df.Distance.shift().eq('100'), 'Count'] += 1
If the logic needs to be applied at per Model group, then you can use a groupby:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Distance.shift().eq('100') + x.Count))
.reset_index(level=0, drop=True)
)
Based on #StringZ's updates, below is the updated solution:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Model Distance v1 Count
0 A 34 3 1.0
1 A 55 5 2.0
2 A 100 7 3.0
3 A 0 1 5.0
4 A 55 5 6.0
5 B 90 3 1.0
6 B 0 1 2.0
7 B 1 v3 3.0
8 B 21 1 4.0
9 B 0 1 5.0
10 C 9 7 1.0
11 C 23 4 2.0
12 C 100 1 3.0
13 C 33 6 5.0
14 C 23 4 6.0
I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4