Complex pandas aggregation - python-3.x

I have a table as below :
User_ID Cricket Football Chess Video_ID Category Time
1 200 150 100 111 A Morning
1 200 150 100 222 B Morning
1 200 150 100 111 A Afternoon
1 200 150 100 333 A Morning
2 100 160 80 444 C Evening
2 100 160 80 222 C Evening
2 100 160 80 333 A Morning
2 100 160 80 333 A Morning
Above table is a transactional table, each entry represents the transaction of a user watching a video.
For Eg. “User_ID” - 1 has watched video’s 4 times.
What all video’s watched are given in “Video_ID” : 111,222,111,333
NOTE :
Video_ID - 111 was watched twice by this user.
Cricket, Football, Chess : The values are duplicate for each row. (I.e) No of times “User_ID” 1 played cricket , football, chess are 200,150,100. ( They are duplicate in other rows for that particular “User_ID”.
Category : Which Category that particular Video_ID belongs to.
Time : What time the Video_ID was watched.
I am trying to get the below information from the table :
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
1 Cricket Football A B Morning
2 Football Cricket C A Evening
NOTE : If the count of Category is same then any one can be kept as Top_1_Category.
Its bit complex though, can anyone help on this ?

First get top values per groups by User_ID and Video_ID with Series.value_counts and index[0]:
df1 = df.groupby(['User_ID','Video_ID']).agg(lambda x: x.value_counts().index[0])
Then get second top Category by GroupBy.nth:
s = df1.groupby(level=0)['Category'].nth(1)
Remove duplicates by User_ID with DataFrame.drop_duplicates:
df1 = df1.reset_index().drop_duplicates('User_ID').drop('Video_ID', axis=1)
cols = ['User_ID','Category','Time']
cols1 = df1.columns.difference(cols)
Get top2 games by this solution:
df2 = pd.DataFrame((cols1[np.argsort(-df1[cols1].values, axis=1)[:,:2]]),
columns=['Top_1_Game','Top_2_Game'],
index=df1['User_ID'])
Filter Category and Time with rename columns names:
df3 = (df1[cols].set_index('User_ID')
.rename(columns={'Category':'Top_1_Cat','Time':'Top_Time'}))
Join together by DataFrame.join and DataFrame.insert Top_2_Cat values:
df = df2.join(df3).reset_index()
df.insert(4, 'Top_2_Cat', s.values)
print (df)
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
0 1 Cricket Football A B Morning
1 2 Football Cricket C A Evening

Related

Make a proper data frame from a pandas crosstab output

I have a multi-indexed output after pandas crosstab function which is shown below
sports cricket football tennis
nationality
IND 180 18 1
UK 10 30 10
US 5 30 65
From the above, I would like to prepare below df.
Expected output:
nationality cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
I tried the below code which is giving the wrong data frame.
df_tab.reset_index().iloc[:, 1:]
sports cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
If need also index and columns names together, first column is index, all another are columns (but looks same):
df = df_tab.rename_axis(index = None, columns= df_tab.index.name)
print (df)
nationality cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
print (df.index)
Index(['IND', 'UK', 'US'], dtype='object')
If need print DataFrame without index:
print (df_tab.reset_index().to_string(index=False))
nationality cricket football tennis
IND 180 18 1
UK 10 30 10
US 5 30 65
EDIT: In DataFrame is always necessary index, so if need column from nationality use:
df = df_tab.reset_index().rename_axis(columns = None)

filtering rows in one dataframe based on two columns of another dataframe

I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb

Change column value based on calculated mean of other column in python pandas

I am newbie to pandas. I have gone through many questions, but doesn't found an answer.
I have the following datasets.
Name || Price || Cuisine Category || City || Region || Cuisine Types || Rating Types || Rating
Pizza || 600 || Fast Food,Pizza || Ajmer || Ana Saga || Quick Bites || Good || 3.9
... ... ... ... ... ... ... ... ...
Chawla's || 300 || Beverages || Ajmer || Sagar Lake || Cafe || Average || 3.3
Masala || 0 || North,South Indian || Ajmer || Ram Ganj || Mess || None || NEW
I want to change the value of:
Rating where it is a NEW, on the basis of Average Rating of that particular Cuisine Types and then Rating Types based on the calculated Ratings
Price where it is a 0, on the basis of Average Price of that particular Region
My try for changing price:
Reading CSV File
data = pd.read_csv('/content/Ajmer.csv')
calculate Region wise mean of Price
gregion = round(data.groupby('Region')['Price'].mean())
Trying to replace 0 of price column
data['Price'] = data['Price'].replace(0, gregion[data['Region']])
But my price column is unchanged.
My try for changing Rating:
Reading CSV file
data2 = pd.read_csv('/content/Ajmer.csv')
Creating separate data frame, so that it wont affect mean value.
filtered_rating = data2[(data2['Rating'] == 'NEW') | (data2['Rating'] == '-') | (data2['Rating'] == 'Opening')]
Dropping from original data2
data2.drop(data2.loc[data['Rating']=='NEW'].index, inplace=True)
data2.drop(data2.loc[data['Rating']=='-'].index, inplace=True)
data2.drop(data2.loc[data['Rating']=='Opening'].index, inplace=True)
Calculating Cuisine Types wise mean of Rating
c = round(data2.groupby('Cuisine Types')['Rating'].mean(),1)
which gives me output as follows:
Cuisine Types
Bakery 3.4
Confectionery 3.4
Dessert Parlor 3.5
...
Quick Bites 3.4
Sweet Shop 3.4
Name: Rating, dtype: float64
Trying to replace values
filtered_rating['Rating'].replace('NEW', c[data2['Region']], inplace=True)
filtered_rating['Rating'].replace('-', c[data2['Region']], inplace=True)
filtered_rating['Rating'].replace('Opening', c[data2['Region']], inplace=True)
But my Rating column is unchanged.
Expected output
Mean Price of that particular Region of row where the price is zero in Price column
Mean Rating of that particular Cuisine Type of row where the Rating is NEW in Rating column
Can anyone help me out to do this?
Thanks in Advance!
I will be more than happy for your help
lets say you have a data like following.
data
name region price cuisine_type rating_type rating
0 pizza NY 500 fast food average 3.3
1 burger NY 350 fast food good 4.1
2 lobster LA 1500 seafood good 4.5
3 mussels LA 1000 seafood average 3.9
4 shawarma NY 300 mediterranean average 3.4
5 kabab LA 600 mediterranean good 4
6 pancake NY 250 breakfast average 3.7
7 waffle LA 450 breakfast good 4.2
8 fries NY 0 fast food None NEW
9 crab LA 0 seafood None Opening
10 tuna sandwich NY 0 seafood None NEW
11 onion rings LA 0 fast food None Opening
Now according to your question we need to replace the rating when it is NEW or Opening with mean rating of respective cuisine_type. And price when its 0 with mean price of respective region. And update rating type for None at the end.
#get a list of cuisine types
cuisine_type_list=data.cuisine_type.unique().tolist()
cuisine_type_list
['fast food', 'seafood', 'mediterranean', 'breakfast']
#get a list of regions
region_list=data.region.unique().tolist()
region_list
['NY', 'LA']
#replace the ratings
for i in cuisine_type_list:
data.loc[(data.cuisine_type==i) & (data.rating.isin(['NEW', 'Opening'])), 'rating']=round(data.loc[(data.cuisine_type==i) & (data.rating.isin(['NEW', 'Opening'])==False)].rating.mean(), 2)
#replace price when 0
for i in region_list:
data.loc[(data.region==i) & (data.price==0), 'price']=round(data.loc[(data.region==i) & (data.price!=0)].price.mean(), 2)
#function to assign rating type (assuming good for rating>=4)
def calculate_rating_type(row):
if row['rating'] >= 4:
return 'good'
else:
return 'average'
#update rating type
data.loc[data.rating_type.isnull(), 'rating_type']=data.loc[data.rating_type.isnull()].apply(lambda row: calculate_rating_type(row), axis=1)
this is the data after updating
data
name region price cuisine_type rating_type rating
0 pizza NY 500 fast food average 3.3
1 burger NY 350 fast food good 4.1
2 lobster LA 1500 seafood good 4.5
3 mussels LA 1000 seafood average 3.9
4 shawarma NY 300 mediterranean average 3.4
5 kabab LA 600 mediterranean good 4
6 pancake NY 250 breakfast average 3.7
7 waffle LA 450 breakfast good 4.2
8 fries NY 350 fast food average 3.7
9 crab LA 887.5 seafood good 4.2
10 tuna sandwich NY 350 seafood good 4.2
11 onion rings LA 887.5 fast food average 3.7
You can try the following code:
gregion = round(data.groupby('Region')['Price'].mean())
# convert your group by to DataFrame
gregion = pd.DataFrame(gregion)
gregion.reset_index(inplace=True)
# merge the datas and drop the new column that is created
data = data.merge(gregion, left_on='Region', right_on='Region', suffixes=('_x', ''))
data = data.drop(columns={'Price_x'})
filtered_rating = data[(data['Rating'] == 'NEW') | (data['Rating'] == '-') | (data['Rating'] == 'Opening')]
# you don't need to re-upload the file
data2 = data.copy()
data2.drop(data2.loc[data2['Rating']=='NEW'].index, inplace=True)
data2.drop(data2.loc[data2['Rating']=='-'].index, inplace=True)
data2.drop(data2.loc[data['Rating']=='Opening'].index, inplace=True)
# do the same with c
c = round(data2.groupby('Cuisine Types')['Rating'].mean(),1)
c = pd.DataFrame(c)
c.reset_index(inplace=True)
filtered_rating = filtered_rating.merge(c, left_on='Cuisine Types', right_on='Cuisine Types', how='left', suffixes=('_x', ''))
filtered_rating = filtered_rating.drop(columns={'Rating_x'})
Hope this helps.

How to fill in between rows gap comparing with other dataframe using pandas?

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:
Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball
if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

1) Issue In Normalize Transformation for Informatica Power Center

I am Trying to Normalize Records of My SOurce table using Normalize Transformation in informatica, But Sequence are not re-generating for different rows.
Below Is SOurce Table :
Store_Name Sales_Quarter1 Sales_Quarter2 Sales_Quarter3 Sales_Quarter4
DELHI 150 240 455 100
MUMBAI 100 500 350 340
Target Table :
Store_name
Sales
Quarter
I am Using Occurrence - 4, on Sales Column for getting GCID Sales.
For Quarter, I am Using GCID Sales column :
O/P :
STORE_NAME SALES_COLUMN QUARTER
Mumbai 100 1
Mumbai 500 2
Mumbai 350 3
Mumbai 340 4
Delhi 150 5
Delhi 240 6
Delhi 455 7
Delhi 100 8
Why Quarter Value is not restarting from 1 for Delhi and is continuing from 5 ?
There is a GK column that keeps sequential numbers for all rows. Definitely, GCID is the right column that keeps numbers per multi-occurrences in a row. So, double check that there is GCID port and not GK that is linked to QUARTER port to target…
It’s good to provide a screenshot for the mapping and for the normalizer transformation (Normalizer tab) to be more informative about your question/issue…
But I suppose you have 'Store_Name' port at level 1 and all 'Sales_Quarter1', 'Sales_Quarter2', 'Sales_Quarter3' and 'Sales_Quarter4' ports grouped at level 2 on Normalizer tab (using >> button at top left area). And at group level (for these four ports) you set the Occurrence to 4.

Resources