How to create a dictionary for a 10,000 dataframe records and then access each dictionary record to make some calculations? - python-3.x

I have a panda dataframe that has 10,000 records. The dataframe consists of 0 and 1 and looks like this:
C1 C2 C3 C4
0 0 1 1
0 1 0 0
1 0 1 1
My aim is to make each record as a dictionary which I assign a value for the dictionary for each column (each column has the same value):
C1 C2 C3 C4
{0: 10} {0: 11} {1: 15} {1: 13}
{0: 10} {1: 11} {0: 15} {0: 13}
{1: 10} {0: 11} {1: 15} {1: 13}
and then access the dictionary and make some calculation row by row and compare between the total of 0 and the total of 1. The number with the highest value will be added to a new column. For example for the first row 0 = 21 and 1 = 28 therefore 1 will be added to the new column. The second row 0 = 38 and 1 = 11 therefore 1 will be added:
C1 C2 C3 C4 New_column
{0: 10} {0: 11} {1: 15} {1: 13} 1
{0: 10} {1: 11} {0: 15} {0: 13} 0
{1: 10} {0: 11} {1: 15} {1: 13} 1

If you don't need the intermediate dicts, you can do some multiplications and sums:
values = [10, 11, 15, 13]
zeros = df.eq(0).mul(values).sum(axis=1)
ones = df.eq(1).mul(values).sum(axis=1)
df['New_column'] = ones.gt(zeros).astype(int)
# C1 C2 C3 C4 New_column
# 0 0 0 1 1 1
# 1 0 1 0 0 0
# 2 1 0 1 1 1
And if you do want the dicts, I would do it backwards and create the dicts after computing New_column:
df[['C1','C2','C3','C4']] = df.filter(like='C').apply(
lambda column: [{x:values[df.columns.get_loc(column.name)]} for x in column])
# C1 C2 C3 C4 New_column
# 0 {0: 10} {0: 11} {1: 15} {1: 13} 1
# 1 {0: 10} {1: 11} {0: 15} {0: 13} 0
# 2 {1: 10} {0: 11} {1: 15} {1: 13} 1

Related

Pandas how to ignore multiindex to use groupby as index when using filna from a lookup dataframe

I have a large multi-indexed dataframe and I want to fill values for a columns based on values in another group of columns. It looks like passing a dataframe or dictionary to groupby.fillna() forces you to match on index. How do I fill values using just the groupby values as an index and ignoring the dataframes index?
import pandas as pd
import numpy as np
nan = np.nan
def print_df(df):
with pd.option_context('display.max_rows', None,
'display.max_columns', None,
):
print(df)
d1 = {'I1': {0: 1, 1: 1, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2},
'I2': {0: 1, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2},
'I3': {0: 1, 1: 1, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 1, 9: 2},
'A': {0: 1, 1: 1, 2: 2, 3: 2, 4: 2, 5: 2, 6: 1, 7: 1, 8: 1, 9: 1},
'B': {0: 3, 1: 2, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 3},
'C': {0: 2, 1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 1, 7: 1, 8: 2, 9: 1},
'D': {0: nan, 1: nan, 2: 7.0, 3: nan, 4: nan, 5: 8.0, 6: 8.0, 7: 4.0, 8: 1.0, 9: nan},
'E': {0: nan, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: nan}}
df1 = pd.DataFrame(d1)
df1.set_index(["I1","I2","I3"], inplace=True)
d2 = {'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2},
'B': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 1, 7: 1, 8: 2, 9: 2, 10: 3, 11: 3},
'C': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2},
'D': {0: 5, 1: 1, 2: 4, 3: 3, 4: 2, 5: 2, 6: 3, 7: 5, 8: 5, 9: 4, 10: 4, 11: 3},
'E': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5, 9: 5, 10: 5, 11: 5}}
df2 =pd.DataFrame(d2)
df2.set_index(["A","B","C"], inplace=True)
print("dataframe with values to impute")
print_df(df1)
print("lookup values to fill dataframe")
print_df(df2)
# what I expected to work that is instead using the I1 I2 I3 index
df1.groupby(["A","B","C"]).fillna(df2.to_dict())
This is desired output (** shows the imputed rows):
data frame with values to impute
A B C D E
I1 I2 I3
1 1 1 1 3 2 2.0 5.0 **
2 1 1 2 1 4.0 5.0 **
2 1 1 2 1 2 7.0 1.0
Relevant lookup values I showed above for reference
D E
A B C
1 1 1 5 5
2 1 5
2 1 4 5**
2 3 5
3 1 2 5
2 2 5**
I would first align the two DataFrames with a merge as they are not sharing the same index:
cols = ['A', 'B', 'C']
df1.fillna(df1[cols].merge(df2, left_on=cols, right_index=True, how='left'))
output:
A B C D E
I1 I2 I3
1 1 1 1 3 2 2.0 5.0
2 1 1 2 1 4.0 5.0
2 1 1 2 1 2 7.0 1.0
2 2 2 1 5.0 5.0
3 2 3 2 3.0 5.0
4 2 3 1 8.0 1.0
5 1 2 1 8.0 1.0
6 1 2 1 4.0 1.0
2 1 1 1 2 1.0 1.0
2 1 3 1 2.0 5.0

How to maintain column names while pivoting data, with no aggregation and not using artificial index?

I have a dataframe like this, which has about 40 unique in the key column and some of them are integer. I do not want anything aggregated if possible
:
hash ft timestamp id key value
1ab 1 2022-01-02 12:21:11 1 aaa 121289
1ab 1 2022-01-02 12:21:11 1 bbb 13ADF
1ab 1 2022-01-02 12:21:11 2 aaa 13ADH
1ab 1 2022-01-02 12:21:11 2 bbb 13ADH
2ab 2 2022-01-02 12:21:11 3 aaa 121382
2ab 2 2022-01-02 12:21:11 3 bbb 121381
2ab 2 2022-01-02 12:21:11 3 ccc 121389
I am trying to pivot the data only on 2 on columns key and value while keeping the remaining same columns and index. Example:
When I run below code, the column names take on grouped values with columnname being for the following columns id, ft, value. One of the actual column names, with parentheses : ('id', '1', '121289') and I am forced to select an index, which I dont want do.
Code:
df.pivot_table(index='hash',columns=['ft','key'])
I am not sure what I am doing wrong, that I cant use the value column for values. I get an empty dataframe:
df.pivot_table(index='hash',columns=['ft','key'], values='value')
A possible solution, using pandas.DataFrame.pivot:
(df.pivot(index=['hash', 'ft', 'id', 'timestamp'],
columns=['key'], values='value')
.reset_index().rename_axis(None, axis=1))
Output:
hash ft id timestamp aaa bbb ccc
0 1ab 1 1 2022-01-02 12:21:11 121289 13ADF NaN
1 1ab 1 2 2022-01-02 12:21:11 13ADH 13ADH NaN
2 2ab 2 3 2022-01-02 12:21:11 121382 121381 121389
Data:
df = pd.DataFrame.from_dict(
{'hash': {0: '1ab',
1: '1ab',
2: '1ab',
3: '1ab',
4: '2ab',
5: '2ab',
6: '2ab'},
'ft': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2},
'timestamp': {0: '2022-01-02 12:21:11',
1: '2022-01-02 12:21:11',
2: '2022-01-02 12:21:11',
3: '2022-01-02 12:21:11',
4: '2022-01-02 12:21:11',
5: '2022-01-02 12:21:11',
6: '2022-01-02 12:21:11'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 3},
'key': {0: 'aaa', 1: 'bbb', 2: 'aaa', 3: 'bbb', 4: 'aaa', 5: 'bbb', 6: 'ccc'},
'value': {0: '121289',
1: '13ADF',
2: '13ADH',
3: '13ADH',
4: '121382',
5: '121381',
6: '121389'}}
)
EDIT
To overcome an error reported below, in a comment, by the OP, the OP himself suggests the following solution:
(pd.pivot_table(df,index=['hash','ft', 'id'] ,
columns = ['key'] ,
values = "value", aggfunc='sum')
.reset_index().rename_axis(None, axis=1))

Conversion of dataframe to required dictionary format

I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]
Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]

Increment Count column by 1 base on a another column

I have this data frame where i need to create a count column base on my distance column. I grouped the result by the model column. What i anticipate to get is an increment by 1 on the next count row each time the distance is 100. For example, here is what I have so far but no success yet with the increment
import pandas as pd
df = pd.DataFrame(
[['A', '34', 3], ['A', '55', 5], ['A', '100', 7], ['A', '0', 1],['A', '55', 5],
['B', '90', 3], ['B', '0', 1], ['B', '1', 3], ['B', '21', 1],['B', '0', 1],
['C', '9', 7], ['C', '100', 4], ['C', '50', 1], ['C', '100', 6],['C', '22', 4]],
columns=['Model', 'Distance', 'v1'])
df = df.groupby(['Model']).apply(lambda row: callback(row) if row['Distance'] is not None else callback(row)+1)
print(df)
import numpy as np
(
df.groupby('Model')
.apply(lambda x: x.assign(Bount=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Result with your code solution
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 6.0
My expected result is:
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 7.0
Take a look at the C group that are two distance equal to 100
Setup:
df = pd.DataFrame({'Model': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'C', 11: 'C', 12: 'C', 13: 'C', 14: 'C'},
'Distance': {0: '34', 1: '55', 2: '100', 3: '0', 4: '55', 5: '90', 6: '0', 7: '1', 8: '21', 9: '0', 10: '9', 11: '23', 12: '100', 13: '33', 14: '23'},
'v1': {0: 3, 1: 5, 2: 7, 3: 1, 4: 5, 5: 3, 6: 1, 7: 3, 8: 1, 9: 1, 10: 7, 11: 4, 12: 1, 13: 6, 14: 4},
'Count': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 1, 6: 2, 7: 3, 8: 4, 9: 5, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5}})
If the logic needs to be applied across the Model column, you can use a shift, compare and add 1 for eligibal rows:
df.loc[df.Distance.shift().eq('100'), 'Count'] += 1
If the logic needs to be applied at per Model group, then you can use a groupby:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Distance.shift().eq('100') + x.Count))
.reset_index(level=0, drop=True)
)
Based on #StringZ's updates, below is the updated solution:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Model Distance v1 Count
0 A 34 3 1.0
1 A 55 5 2.0
2 A 100 7 3.0
3 A 0 1 5.0
4 A 55 5 6.0
5 B 90 3 1.0
6 B 0 1 2.0
7 B 1 v3 3.0
8 B 21 1 4.0
9 B 0 1 5.0
10 C 9 7 1.0
11 C 23 4 2.0
12 C 100 1 3.0
13 C 33 6 5.0
14 C 23 4 6.0

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4

Resources