Removing outliers using percentile in panda dataframe groupby - python-3.x

I have dataframe df
Transportation_Mode time_delta trip_id segmentid Vincenty_distance velocity acceleration jerk
walk 1 1 1 1.551676553 1.551676553 0.550163852 -1.017629555
walk 1 1 1 1.70920675 1.70920675 0.16257622 -0.39166534
walk 1 1 1 1.871782971 1.871782971 -0.22908912 -0.734438511
walk 12 1 1 23.16466284 1.93038857 0.324972586 -0.331839143
walk 1 1 1 5.830059603 5.830059603 -3.657097132 2.614438854
bus 1 16 5 8.418372046 8.418372046 -7.259019484 7.40735053
bus 23 16 5 26.66510892 1.159352562 0.148331046 -0.036318522
bus 1 16 5 4.570966614 4.570966614 -0.68699497 -0.889126918
I want to remove outlier values within each group of Transportation_Mode based on percentile values [0.05,0.95]
My problem is similar to discussion Remove outliers in Pandas dataframe with groupby
The code I write is
res = df.groupby("Transportation_Mode")["Vincenty_distance"].quantile([0.05, 0.95]).unstack(level=1)
df.loc[ (res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95]) ]
but I get the error, ValueError: cannot reindex from a duplicate axis. I don't know where I am wrong here.
Complete input data is available at the link https://drive.google.com/file/d/1JjvS7igTmrtLA4E5Rs5D6tsdAXqzpYqX/view?usp=sharing

Actually if we see,
(res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95])
returns a series of type bool which can be to select rows in original df. We just need to give the value of the series for which just add .values while giving it to the df.loc[]. Below should work:
df.loc[ ((res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95])).values]

Use map for Series with same size as original DataFrame, so possible filtering:
m1 = (df.Transportation_Mode.map(res[0.05]) < df.Vincenty_distance)
m2 = (df.Vincenty_distance.values < df.Transportation_Mode.map(res[0.95]))
df = df[m1 & m2]
print (df)
Transportation_Mode time_delta trip_id segmentid Vincenty_distance \
1 walk 1 1 1 1.709207
2 walk 1 1 1 1.871783
4 walk 1 1 1 5.830060
5 bus 1 16 5 8.418372
velocity acceleration jerk
1 1.709207 0.162576 -0.391665
2 1.871783 -0.229089 -0.734439
4 5.830060 -3.657097 2.614439
5 8.418372 -7.259019 7.407351

Related

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

groupby and ranking based on the string in one column

I am working on a data frame, which contains 70 over actions. I have a column that groups those 70 actions. I want to create a new column that is the rank of string from an existing column. The following the sample of the data frame:
DF = pd.DataFrame()
DF ['template']= ['Attk','Attk','Attk','Attk','Attk','Attk','Def','Def','Def','Def','Def','Def','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy']
DF ['Stats'] = ['Goal','xG','xA','Goal','xG','xA','Block','interception','tackles','Block','interception','tackles','Acc.passes','Acc.actions','Acc.crosses','Acc.passes','Acc.actions','Acc.crosses']
DF=DF.sort_values(['template','Stats'])
The new column that I wanted to create is groupby [template] and ranking the Stats alphabetical order.
The expected data frame is as follow:
I have 10 to 15 of Stats under each of the template.
Use GroupBy.transform with lambda function and factorize, also because python counts from 0 is added 1:
f = lambda x: pd.factorize(x)[0]
DF['Order'] = DF.groupby('template')['Stats'].transform(f) + 1
print (DF)
template Stats Order
13 Accuracy Acc.actions 1
16 Accuracy Acc.actions 1
14 Accuracy Acc.crosses 2
17 Accuracy Acc.crosses 2
12 Accuracy Acc.passes 3
15 Accuracy Acc.passes 3
0 Attk Goal 1
3 Attk Goal 1
2 Attk xA 2
5 Attk xA 2
1 Attk xG 3
4 Attk xG 3
6 Def Block 1
9 Def Block 1
7 Def interception 2
10 Def interception 2
8 Def tackles 3
11 Def tackles 3

how to split dataframe into equal number of subset in python

I have a dataframe
import pandas as pd
d = {'user': [1, 1, 2,2,2,2 ,2,2,2,2], 'friends':
[1,2,1,5,4,6,7,20,9,7]}
df = pd.DataFrame(data=d)
I try to split the df into several n pieces in a loop. For example, for n=3
n=3
for i in range(3):
subdata = dosomething(df)
print(subdata)
the output will be someting like
# first loop
user friends
0 1 1
1 1 2
2 2 1
3 2 5
# second loop
user friends
0 2 4
1 2 6
2 2 7
3 2 20
#third loop
user friends
0 2 9
1 2 7
You can use iloc and loop through the dataframe, put each new dataframe in a dictionary for recall later.
dfs = {}
chunk = 4
Loop through the dataframe by chunk sizes. Create df and add to dict.
for n in range((df.shape[0] // chunk + 1)):
df_temp = df.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Use this if statement for any left over rows at the end.
if df.shape[0] % chunk != 0:
df_temp = df.iloc[-int(df.shape[0] % chunk):]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
else:
pass
Access the dataframes in the dictionary.
print(dfs[0])
user friends
0 1 1
1 1 2
2 2 1
3 2 5
print(dfs[1])
user friends
0 2 4
1 2 6
2 2 7
3 2 20
print(dfs[2])
user friends
0 2 9
1 2 7

How to find if a column value is undervalued or overvalued based on other column?

So i'm trying to solve this pandas exercise. I got this data set of Real estate firm from Kaggle and the data frame df looks like this.
id location type price
0 44525 Golden Mile House 4400000
1 44859 Nagüeles House 2400000
2 45465 Nagüeles House 1900000
3 50685 Nagüeles Plot 4250000
4 130728 Golden Mile House 32000000
5 130856 Nagüeles Plot 2900000
6 130857 Golden Mile House 3900000
7 130897 Golden Mile House 3148000
8 3484102 Marinha Plot 478000
9 3484124 Marinha Plot 2200000
10 3485461 Marinha House 1980000
So now,I have to find which property is undervalued or overvalued and which one has the genuine price on the basis of columns location and type. The desired result should look like this:
id location type price Over_val Under_val Norm_val
0 44525 Golden Mile House 4400000 0 0 1
1 44859 Nagüeles House 2400000 0 0 1
2 45465 Nagüeles House 1900000 0 0 1
3 50685 Nagüeles Plot 4250000 0 1 0
4 130728 Golden Mile House 32000000 1 0 0
5 130856 Nagüeles Plot 2900000 0 1 0
6 130857 Golden Mile House 3900000 0 0 1
7 130897 Golden Mile House 3148000 0 0 1
8 3484102 Marinha Plot 478000 0 0 1
9 3484124 Marinha Plot 2200000 0 0 1
10 3485461 Marinha House 1980000 0 1 0
Have been stuck on it for a while. What logic should I try in solving this problem?
Here's my solution. Explanation included as inline comments. There are probably ways to do this in lesser number of steps; I'll be interested to learn too.
import pandas as pd
# Replace this with whatever you have to load your data. This is set up for a sample data file I used
df = pd.read_csv('my_sample_data.csv', encoding='latin-1')
# Mean by location - type
mdf = df.set_index('id').groupby(['location','type'])['price'].mean().rename('mean').to_frame().reset_index()
# StdDev by location - type
sdf = df.set_index('id').groupby(['location','type'])['price'].std().rename('sd').to_frame().reset_index()
# Merge back into the original dataframe
df = df.set_index(['location','type']).join(mdf.set_index(['location','type'])).reset_index()
df = df.set_index(['location','type']).join(sdf.set_index(['location','type'])).reset_index()
# Add the indicator columns
df['Over_val'] = 0
df['Under_val'] = 0
df['Normal_val'] = 0
# Update the indicators
df.loc[df['price'] > df['mean'] + 2 * df['sd'], 'Over_val'] = 1
df.loc[df['price'] < df['mean'] - 2 * df['sd'], 'Under_val'] = 1
df['Normal_val'] = df['Over_val'] + df['Under_val']
df['Normal_val'] = df['Normal_val'].apply(lambda x: 1 if x == 0 else 0)
Here is another possible method. At 2 standard deviations there are no qualifying properties. There is one property at one std dev.
import pandas as pd
df = pd.DataFrame(data={}, columns=["id", "location", "type", "price"])
# data is already entered, left out for this example
df["id"] = prop_id
df["location"] = location
df["type"] = prop_type
df["price"] = price
# a function that returns the mean and standard deviation
def mean_std_dev(row):
mask1 = df["location"] == row["location"]
mask2 = df["type"] == row["type"]
df_filt = df[mask1 & mask2]
mean_price = df_filt["price"].mean()
std_dev_price = df_filt["price"].std()
return [mean_price, std_dev_price]
# create two columns and populate with the mean and std dev from function mean_std_dev
df[["mean", "standard deviation"]] = df.apply(
lambda row: pd.Series(mean_std_dev(row)), axis=1
)
# create final columns
df["Over_val"] = df.apply(
lambda x: 1 if x["price"] > x["mean"] + x["standard deviation"] else 0, axis=1
)
df["Under_val"] = df.apply(
lambda x: 1 if x["price"] < x["mean"] - x["standard deviation"] else 0, axis=1
)
df["Norm_val"] = df.apply(
lambda x: 1 if x["Over_val"] + x["Under_val"] == 0 else 0, axis=1
)
# delete the mean and standard deviation columns
df.drop(["mean", "standard deviation"], axis=1)

Python 3.x - Merge pandas data frames

I am using Python for Titanic disaster competition on Kaggle. The dataset (df) contains 3 attributes corresponding to each passenger - 'Gender'(1/0), 'Age' and 'Pclass'(1/2/3). I want to obtain median age corresponding to each Gender-Pclass combination.
The end result should be a dataframe as -
Gender Class
1 1
0 2
1 3
0 1
1 2
0 3
Median age will be calculated later
I tried to create the data frame as follows -
unique_gender = pd.DataFrame(df.Gender.unique())
unique_class = pd.DataFrame(df.Class.unique())
reqd_df = pd.merge(unique_gender, unique_class, how = 'outer')
But the output obtained is -
0
0 3
1 1
2 2
3 0
can someone please help me get the desired output?
You want df.groupby(['gender','class'])['age'].median() (per JohnE)

Resources