How to maintain column names while pivoting data, with no aggregation and not using artificial index? - python-3.x

I have a dataframe like this, which has about 40 unique in the key column and some of them are integer. I do not want anything aggregated if possible
:
hash ft timestamp id key value
1ab 1 2022-01-02 12:21:11 1 aaa 121289
1ab 1 2022-01-02 12:21:11 1 bbb 13ADF
1ab 1 2022-01-02 12:21:11 2 aaa 13ADH
1ab 1 2022-01-02 12:21:11 2 bbb 13ADH
2ab 2 2022-01-02 12:21:11 3 aaa 121382
2ab 2 2022-01-02 12:21:11 3 bbb 121381
2ab 2 2022-01-02 12:21:11 3 ccc 121389
I am trying to pivot the data only on 2 on columns key and value while keeping the remaining same columns and index. Example:
When I run below code, the column names take on grouped values with columnname being for the following columns id, ft, value. One of the actual column names, with parentheses : ('id', '1', '121289') and I am forced to select an index, which I dont want do.
Code:
df.pivot_table(index='hash',columns=['ft','key'])
I am not sure what I am doing wrong, that I cant use the value column for values. I get an empty dataframe:
df.pivot_table(index='hash',columns=['ft','key'], values='value')

A possible solution, using pandas.DataFrame.pivot:
(df.pivot(index=['hash', 'ft', 'id', 'timestamp'],
columns=['key'], values='value')
.reset_index().rename_axis(None, axis=1))
Output:
hash ft id timestamp aaa bbb ccc
0 1ab 1 1 2022-01-02 12:21:11 121289 13ADF NaN
1 1ab 1 2 2022-01-02 12:21:11 13ADH 13ADH NaN
2 2ab 2 3 2022-01-02 12:21:11 121382 121381 121389
Data:
df = pd.DataFrame.from_dict(
{'hash': {0: '1ab',
1: '1ab',
2: '1ab',
3: '1ab',
4: '2ab',
5: '2ab',
6: '2ab'},
'ft': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2},
'timestamp': {0: '2022-01-02 12:21:11',
1: '2022-01-02 12:21:11',
2: '2022-01-02 12:21:11',
3: '2022-01-02 12:21:11',
4: '2022-01-02 12:21:11',
5: '2022-01-02 12:21:11',
6: '2022-01-02 12:21:11'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 3},
'key': {0: 'aaa', 1: 'bbb', 2: 'aaa', 3: 'bbb', 4: 'aaa', 5: 'bbb', 6: 'ccc'},
'value': {0: '121289',
1: '13ADF',
2: '13ADH',
3: '13ADH',
4: '121382',
5: '121381',
6: '121389'}}
)
EDIT
To overcome an error reported below, in a comment, by the OP, the OP himself suggests the following solution:
(pd.pivot_table(df,index=['hash','ft', 'id'] ,
columns = ['key'] ,
values = "value", aggfunc='sum')
.reset_index().rename_axis(None, axis=1))

Related

Add count of element for each groups within a list in pandas

I have a dataframe such as:
The_list=["A","B","D"]
Groups Values
G1 A
G1 B
G1 C
G1 D
G2 A
G2 B
G2 A
G3 A
G3 D
G4 Z
G4 D
G4 E
G4 C
And I would like to add for each Groups the number of Values element that are within The_list, and add this number within a New_column
Here I should then get;
Groups Values New_column
G1 A 3
G1 B 3
G1 C 3
G1 D 3
G2 A 2
G2 B 2
G2 A 2
G3 A 1
G3 D 1
G4 Z 0
G4 D 0
G4 E 0
G4 C 0
Thanks a lot for your help
Here is the table in dict format if it can helps:
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G3', 8: 'G3', 9: 'G4', 10: 'G4', 11: 'G4', 12: 'G4'}, 'Values': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'D', 9: 'Z', 10: 'D', 11: 'E', 12: 'C'}}
In your case do transform after isin check
df['new'] = df['Values'].isin(The_list).groupby(df['Groups']).transform('sum')
Out[37]:
0 3
1 3
2 3
3 3
4 3
5 3
6 3
7 2
8 2
9 1
10 1
11 1
12 1
Name: Values, dtype: int64

How to create a dictionary for a 10,000 dataframe records and then access each dictionary record to make some calculations?

I have a panda dataframe that has 10,000 records. The dataframe consists of 0 and 1 and looks like this:
C1 C2 C3 C4
0 0 1 1
0 1 0 0
1 0 1 1
My aim is to make each record as a dictionary which I assign a value for the dictionary for each column (each column has the same value):
C1 C2 C3 C4
{0: 10} {0: 11} {1: 15} {1: 13}
{0: 10} {1: 11} {0: 15} {0: 13}
{1: 10} {0: 11} {1: 15} {1: 13}
and then access the dictionary and make some calculation row by row and compare between the total of 0 and the total of 1. The number with the highest value will be added to a new column. For example for the first row 0 = 21 and 1 = 28 therefore 1 will be added to the new column. The second row 0 = 38 and 1 = 11 therefore 1 will be added:
C1 C2 C3 C4 New_column
{0: 10} {0: 11} {1: 15} {1: 13} 1
{0: 10} {1: 11} {0: 15} {0: 13} 0
{1: 10} {0: 11} {1: 15} {1: 13} 1
If you don't need the intermediate dicts, you can do some multiplications and sums:
values = [10, 11, 15, 13]
zeros = df.eq(0).mul(values).sum(axis=1)
ones = df.eq(1).mul(values).sum(axis=1)
df['New_column'] = ones.gt(zeros).astype(int)
# C1 C2 C3 C4 New_column
# 0 0 0 1 1 1
# 1 0 1 0 0 0
# 2 1 0 1 1 1
And if you do want the dicts, I would do it backwards and create the dicts after computing New_column:
df[['C1','C2','C3','C4']] = df.filter(like='C').apply(
lambda column: [{x:values[df.columns.get_loc(column.name)]} for x in column])
# C1 C2 C3 C4 New_column
# 0 {0: 10} {0: 11} {1: 15} {1: 13} 1
# 1 {0: 10} {1: 11} {0: 15} {0: 13} 0
# 2 {1: 10} {0: 11} {1: 15} {1: 13} 1

Pandas filter a column based on another column

Given this dataframe
df = pd.DataFrame(\
{'name': {0: 'Peter', 1: 'Anna', 2: 'Anna', 3: 'Peter', 4: 'Simon'},
'Status': {0: 'Finished',
1: 'Registered',
2: 'Cancelled',
3: 'Finished',
4: 'Registered'},
'Modified': {0: '2019-03-11',
1: '2019-03-19',
2: '2019-05-22',
3: '2019-10-31',
4: '2019-04-05'}})
How can I compare and filter based on the Status? I want the Modified column to keep the date where the Status is either "Finished" or "Cancelled", and fill in blank where the condition is not met.
Wanted output:
name Status Modified
0 Peter Finished 2019-03-11
1 Anna Registered
2 Anna Cancelled 2019-05-22
3 Peter Finished 2019-10-31
4 Simon Registered
Check with where + isin
df.Modified.where(df.Status.isin(['Finished','Cancelled']),'',inplace=True)
df
Out[68]:
name Status Modified
0 Peter Finished 2019-03-11
1 Anna Registered
2 Anna Cancelled 2019-05-22
3 Peter Finished 2019-10-31
4 Simon Registered

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4

Trouble pivoting in pandas (spread in R)

I'm having some issues with the pd.pivot() or pivot_table() functions in pandas.
I have this:
df = pd.DataFrame({'site_id': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5:
'c',6: 'a', 7: 'a', 8: 'b', 9: 'b', 10: 'c', 11: 'c'},
'dt': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1,6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2},
'eu': {0: 'FGE', 1: 'WSH', 2: 'FGE', 3: 'WSH', 4: 'FGE', 5: 'WSH',6: 'FGE', 7: 'WSH', 8: 'FGE', 9: 'WSH', 10: 'FGE', 11: 'WSH'},
'kw': {0: '8', 1: '5', 2: '3', 3: '7', 4: '1', 5: '5',6: '2', 7: '3', 8: '5', 9: '7', 10: '2', 11: '5'}})
df
Out[140]:
dt eu kw site_id
0 1 FGE 8 a
1 1 WSH 5 a
2 1 FGE 3 b
3 1 WSH 7 b
4 1 FGE 1 c
5 1 WSH 5 c
6 2 FGE 2 a
7 2 WSH 3 a
8 2 FGE 5 b
9 2 WSH 7 b
10 2 FGE 2 c
11 2 WSH 5 c
I want this:
dt site_id FGE WSH
1 a 8 5
1 b 3 7
1 c 1 5
2 a 2 3
2 b 5 7
2 c 2 5
I've tried everything!
df.pivot_table(index = ['site_id','dt'], values = 'kw', columns = 'eu')
or
df.pivot(index = ['site_id','dt'], values = 'kw', columns = 'eu')
should have worked. I also tried unstack():
df.set_index(['dt','site_id','eu']).unstack(level = -1)
Your last try (with unstack) works fine for me, I'm not sure why it gave you a problem. FWIW, I think it's more readable to use the index names rather than levels, so I did it like this:
>>> df.set_index(['dt','site_id','eu']).unstack('eu')
kw
eu FGE WSH
dt site_id
1 a 8 5
b 3 7
c 1 5
2 a 2 3
b 5 7
c 2 5
But again, your way looks fine to me and is pretty much the same as what #piRSquared did (except their answer adds some more code to get rid of the multi-index).
I think the problem with pivot is that you can only pass a single variable, not a list? Anyway, this works for me:
>>> df.set_index(['dt','site_id']).pivot(columns='eu')
For pivot_table, the main issue is that 'kw' is an object/character and pivot_table will attempt to aggregate with numpy.mean by default. You probably got the error message: "DataError: No numeric types to aggregate".
But there are a couple of workarounds. First, you could just convert to a numeric type and then use your same pivot_table command
>>> df['kw'] = df['kw'].astype(int)
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu')
Alternatively you could change the aggregation function:
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu',
aggfunc=sum )
That's using the fact that strings can be summed (concatentated) even though you can't take a mean of them. Really, you can use most functions here (including lambdas) that operate on strings.
Note, however, that pivot_table's aggfunc requires some sort of reduction operation here even though you only have a single value per cell, so there actually isn't anything to reduce! But there is a check in the code that requires a reduction operation, so you have to do one.
df.set_index(['dt', 'site_id', 'eu']).kw \
.unstack().rename_axis(None, 1).reset_index()

Resources