Getting values in pivot table - python-3.x

This is the pivot table and I hope to get value in red and green rectangle
import pandas as pd
import numpy as np
# Create sample dataframes
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 8: 'a6', 9: 'a7', 10: 'a8'}, 'query': {0: 'orange', 1: 'strawberry', 2: 'pear', 3: 'orange', 4: 'strawberry', 5: 'lemon', 6: 'orange', 7: 'banana', 8: 'meat', 9: 'beer', 10: 'juice'}} )
df_user['count']=1
df_pivot=pd.pivot_table(df_user,index=['query'],columns=['user'],values=['count'],aggfunc=np.sum).fillna(0)
#getting value in red rectangle, incorrect
print(df_pivot.loc['banana':'beer','a1':'a2'])
#getting value in green rectangle, error
print(df_pivot.loc[:,'a8'])
What's the right way to get them?

Use:
>> df_pivot.loc[["banana","beer"], ('count',['a1','a2'])]
count
user a1 a2
query
banana 0.0 0.0
beer 0.0 0.0
and
>> df_pivot.loc[:, ('count',['a8'])]
count
user a8
query
banana 0.0
beer 0.0
juice 1.0
lemon 0.0
meat 0.0
orange 0.0
pear 0.0
strawberry 0.0

Try to first drop the first column level, like so:
df_pivot = df_pivot.droplevel(0,1)

Related

How to maintain column names while pivoting data, with no aggregation and not using artificial index?

I have a dataframe like this, which has about 40 unique in the key column and some of them are integer. I do not want anything aggregated if possible
:
hash ft timestamp id key value
1ab 1 2022-01-02 12:21:11 1 aaa 121289
1ab 1 2022-01-02 12:21:11 1 bbb 13ADF
1ab 1 2022-01-02 12:21:11 2 aaa 13ADH
1ab 1 2022-01-02 12:21:11 2 bbb 13ADH
2ab 2 2022-01-02 12:21:11 3 aaa 121382
2ab 2 2022-01-02 12:21:11 3 bbb 121381
2ab 2 2022-01-02 12:21:11 3 ccc 121389
I am trying to pivot the data only on 2 on columns key and value while keeping the remaining same columns and index. Example:
When I run below code, the column names take on grouped values with columnname being for the following columns id, ft, value. One of the actual column names, with parentheses : ('id', '1', '121289') and I am forced to select an index, which I dont want do.
Code:
df.pivot_table(index='hash',columns=['ft','key'])
I am not sure what I am doing wrong, that I cant use the value column for values. I get an empty dataframe:
df.pivot_table(index='hash',columns=['ft','key'], values='value')
A possible solution, using pandas.DataFrame.pivot:
(df.pivot(index=['hash', 'ft', 'id', 'timestamp'],
columns=['key'], values='value')
.reset_index().rename_axis(None, axis=1))
Output:
hash ft id timestamp aaa bbb ccc
0 1ab 1 1 2022-01-02 12:21:11 121289 13ADF NaN
1 1ab 1 2 2022-01-02 12:21:11 13ADH 13ADH NaN
2 2ab 2 3 2022-01-02 12:21:11 121382 121381 121389
Data:
df = pd.DataFrame.from_dict(
{'hash': {0: '1ab',
1: '1ab',
2: '1ab',
3: '1ab',
4: '2ab',
5: '2ab',
6: '2ab'},
'ft': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2},
'timestamp': {0: '2022-01-02 12:21:11',
1: '2022-01-02 12:21:11',
2: '2022-01-02 12:21:11',
3: '2022-01-02 12:21:11',
4: '2022-01-02 12:21:11',
5: '2022-01-02 12:21:11',
6: '2022-01-02 12:21:11'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 3},
'key': {0: 'aaa', 1: 'bbb', 2: 'aaa', 3: 'bbb', 4: 'aaa', 5: 'bbb', 6: 'ccc'},
'value': {0: '121289',
1: '13ADF',
2: '13ADH',
3: '13ADH',
4: '121382',
5: '121381',
6: '121389'}}
)
EDIT
To overcome an error reported below, in a comment, by the OP, the OP himself suggests the following solution:
(pd.pivot_table(df,index=['hash','ft', 'id'] ,
columns = ['key'] ,
values = "value", aggfunc='sum')
.reset_index().rename_axis(None, axis=1))

Change values in a column to np.nan based upon row index

I want to selectively change column values to np.nan.
I have a column with a lot of zero (0) values.
I am getting the row indices of a subset of the total.
I place the indices into a variable (s0).
I then use this to set the column value to np.nan for just the rows whose index is in s0.
It runs, but it is changing every single row (i.e., the entire column) to np.nan.
Here is my code:
print((df3['amount_tsh'] == 0).sum()) # 41639 <-- there are this many zeros to start
# print(df3['amount_tsh'].value_counts()[0])
s0 = df3['amount_tsh'][df3['amount_tsh'].eq(0)].sample(37322).index # grab 37322 row indexes
print(len(s0)) # 37322
df3['amount_tsh'] = df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan # change the value in the column to np.nan if it's index is in s0
print(df3['amount_tsh'].isnull().sum())
Lets try
s0 = df3.loc[df3['amount_tsh'].eq(0), ['amount_tsh']].sample(37322)
df3.loc[df3.index.isin(s0.index), 'amount_tsh'] = np.nan
For a quick fix I used this data I had in a notebook and it worked for me
import pandas as pd
import numpy as np
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buys': {0: np.nan, 1: 2.0, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: np.nan},
'Number of Sell s': {0: 1.0, 1: np.nan, 2: 1.0, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
data
Symbol Number of Buys Number of Sell s Gains/Losses \
0 ABNB NaN 1.0 2106.00
1 DKNG 2.0 NaN -1479.20
2 EXPE NaN 1.0 1863.18
3 MPNGF 1.0 NaN -1980.00
4 RDFN 2.0 NaN -1687.70
5 ROKU 1.0 NaN -1520.52
6 VIACA 1.0 NaN -1282.40
7 Z NaN 1.0 1624.59
Percentage change
0 0.0
1 2.0
2 0.0
3 0.0
4 1.5
5 0.0
6 0.0
7 0.0
By your approach
(data['Number of Buys']==1.0).sum()
s0= data.loc[(data['Number of Buys']==1.0),['Number of Buys']].sample(2)
data.loc[data.index.isin(s0.index),'Number of Buys'] =np.nan
Symbol Number of Buys Number of Sell s Gains/Losses \
0 ABNB NaN 1.0 2106.00
1 DKNG 2.0 NaN -1479.20
2 EXPE NaN 1.0 1863.18
3 MPNGF 1.0 NaN -1980.00
4 RDFN 2.0 NaN -1687.70
5 ROKU NaN NaN -1520.52
6 VIACA NaN NaN -1282.40
7 Z NaN 1.0 1624.59
Percentage change
0 0.0
1 2.0
2 0.0
3 0.0
4 1.5
5 0.0
6 0.0
7 0.0
Hmm...
I removed the re-assignment and it worked??
s0 = df3['amount_tsh'][df3['amount_tsh'].eq(0)].sample(37322).index
df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan
The second line was:
df3['amount_tsh'] = df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan

Find users most frequent recommandations based on input queries

I have a input query table in the following:
query
0 orange
1 apple
2 meat
which I want to make against the user query table as following
user query
0 a1 orange
1 a1 strawberry
2 a1 pear
3 a2 orange
4 a2 strawberry
5 a2 lemon
6 a3 orange
7 a3 banana
8 a6 meat
9 a7 beer
10 a8 juice
Given a query in input query, I want to match it to query by other user in user query table, and return the top 3 ranked by total number of counts.
For example,
orange in input query, it matches user a1,a2,a3 in user query where all have queried orange, other items they have query are strawberry (count of 2), pear, lemon, banana (count of 1).
The answer will be strawberry (since it has max count), pear, lemon (since we only return top 3).
Similar reasoning for apple (no user query therefore output 'nothing') and meat query.
So the final output table is
query recommend
0 orange strawberry
1 orange pear
2 orange lemon
3 apple nothing
4 meat nothing
Here is the code
import pandas as pd
import numpy as np
# Create sample dataframes
df_input = pd.DataFrame( {'query': {0: 'orange', 1: 'apple', 2: 'meat'}} )
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 8: 'a6', 9: 'a7', 10: 'a8'}, 'query': {0: 'orange', 1: 'strawberry', 2: 'pear', 3: 'orange', 4: 'strawberry', 5: 'lemon', 6: 'orange', 7: 'banana', 8: 'meat', 9: 'beer', 10: 'juice'}} )
target_users = df_user[df_user['query'].isin(df_input['query'])]['user']
mask_users=df_user['user'].isin(target_users)
mask_queries=df_user['query'].isin(df_input['query'])
df1=df_user[mask_users & mask_queries]
df2=df_user[mask_users]
df=df1.merge(df2,on='user').rename(columns={"query_x":"query", "query_y":"recommend"})
df=df[df['query']!=df['recommend']]
df=df.groupby(['query','recommend'], as_index=False).count().rename(columns={"user":"count"})
df=df.sort_values(['query','recommend'],ascending=False, ignore_index=False)
df=df.groupby('query').head(3)
df=df.drop(columns=['count'])
df=df_input.merge(df,how='left',on='query').fillna('nothing')
df
Where df is the result. Is there any way to make the code more concise?
Unless there is a particular reason to favor pears over bananas (since they both count for one), I would suggest a more idiomatic way to do it:
import pandas as pd
df_input = pd.DataFrame(...)
df_user = pd.DataFrame(...)
df_input = (
df_input
.assign(
recommend=df_input["query"].map(
lambda x: df_user[
(df_user["user"].isin(df_user.loc[df_user["query"] == x, "user"]))
& (df_user["query"] != x)
]
.value_counts(subset="query")
.index[0:3]
.to_list()
if x in df_user["query"].unique()
else "nothing"
)
)
.explode("recommend")
.fillna("nothing")
.reset_index(drop=True)
)
print(df_input)
# Output
query recommend
0 orange strawberry
1 orange banana
2 orange lemon
3 apple nothing
4 meat nothing

When I try to handle missing values in pandas, some methods are not working

I am trying to handle some missing values in a dataset. This is the link for the tutorial that I am using to learn. Below is the code that I am using to read the data.
import pandas as pd
import numpy as np
questions = pd.read_csv("./archive/questions.csv")
print(questions.head())
This is what my data looks like
These are the methods that I am using to handle the missing values. None of them are working.
questions.replace(to_replace = np.nan, value = -99)
questions = questions.fillna(method ='pad')
questions.interpolate(method ='linear', limit_direction = 'forward')
Then I tried to drop the rows with the missing values. None of them are working either. All of them are returning Empty dataframe.
questions.dropna()
questions.dropna(how = "all")
questions.dropna(axis = 1)
What I am doing wrong?
Edit:
Values from questions.head()
[[1 '2008-07-31T21:26:37Z' nan '2011-03-28T00:53:47Z' 1 nan 0.0]
[4 '2008-07-31T21:42:52Z' nan nan 458 8.0 13.0]
[6 '2008-07-31T22:08:08Z' nan nan 207 9.0 5.0]
[8 '2008-07-31T23:33:19Z' '2013-06-03T04:00:25Z' '2015-02-11T08:26:40Z'
42 nan 8.0]
[9 '2008-07-31T23:40:59Z' nan nan 1410 1.0 58.0]]
Values from questions.head() in a dictionary form.
{'Id': {0: 1, 1: 4, 2: 6, 3: 8, 4: 9}, 'CreationDate': {0: '2008-07-31T21:26:37Z', 1: '2008-07-31T21:42:52Z', 2: '2008-07-31T22:08:08Z', 3: '2008-07-31T23:33:19Z', 4: '2008-07-31T23:40:59Z'}, 'ClosedDate': {0: nan, 1: nan, 2: nan, 3: '2013-06-03T04:00:25Z', 4: nan}, 'DeletionDate': {0: '2011-03-28T00:53:47Z', 1: nan, 2: nan, 3: '2015-02-11T08:26:40Z', 4: nan}, 'Score': {0: 1, 1: 458, 2: 207, 3: 42, 4: 1410}, 'OwnerUserId': {0: nan, 1: 8.0, 2: 9.0, 3: nan, 4: 1.0}, 'AnswerCount': {0: 0.0, 1: 13.0, 2: 5.0, 3: 8.0, 4: 58.0}}
Information regarding the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17203824 entries, 0 to 17203823
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 Id int64
1 CreationDate object
2 ClosedDate object
3 DeletionDate object
4 Score int64
5 OwnerUserId float64
6 AnswerCount float64
dtypes: float64(2), int64(2), object(3)
memory usage: 918.8+ MB
Can you try to specify the axis explicitly and see if it will work? The other fillna() should still work without axis, but for pad you need it so it knows how to fill the missing values.
>>> questions.fillna(method='pad', axis=1)
Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount
0 1 2008-07-31T21:26:37Z 2008-07-31T21:26:37Z 2011-03-28T00:53:47Z 1 1 0
1 4 2008-07-31T21:42:52Z 2008-07-31T21:42:52Z 2008-07-31T21:42:52Z 458 8 13
2 6 2008-07-31T22:08:08Z 2008-07-31T22:08:08Z 2008-07-31T22:08:08Z 207 9 5
3 8 2008-07-31T23:33:19Z 2013-06-03T04:00:25Z 2015-02-11T08:26:40Z 42 42 8
4 9 2008-07-31T23:40:59Z 2008-07-31T23:40:59Z 2008-07-31T23:40:59Z 1410 1 58
just fillna() applied on entire DataFrame works as expected.
>>> questions.fillna('-')
Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount
0 1 2008-07-31T21:26:37Z - 2011-03-28T00:53:47Z 1 - 0.0
1 4 2008-07-31T21:42:52Z - - 458 8 13.0
2 6 2008-07-31T22:08:08Z - - 207 9 5.0
3 8 2008-07-31T23:33:19Z 2013-06-03T04:00:25Z 2015-02-11T08:26:40Z 42 - 8.0
4 9 2008-07-31T23:40:59Z - - 1410 1 58.0

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4

Resources