Removing outliers based on column variables or multi-index in a dataframe - python-3.x

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.

If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

Related

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

substract two ECDF time series

Hi I have a ECDF plot by seaborn which is the following.
I can obtain this by doing sns.ecdfplot(data=df2, x='time', hue='seg_oper', stat='count').
My dataframe is very simple:
In [174]: df2
Out[174]:
time seg_oper
265 18475 1->0:ADD['TX']
2342 78007 0->1:ADD['RX']
2399 78613 1->0:DELETE['TX']
2961 87097 0->1:ADD['RX']
2994 87210 0->1:ADD['RX']
... ... ...
330823 1002281 1->0:DELETE['TX']
331256 1003545 1->0:DELETE['TX']
331629 1004961 1->0:DELETE['TX']
332375 1006663 1->0:DELETE['TX']
333083 1008644 1->0:DELETE['TX']
[834 rows x 2 columns]
How can I substract series 0->1:ADD['RX'] from 1->0:DELETE['TX']?
I like seaborn because most of this data mangling is done inside the library, but in this case I need to substract these two series ...
Thanks.
So the first thing is to obtain what seaborn does, but manually. After that (because I need to) I can subtract one series from the other.
Cumulative Count
First we need to obtain a cumulative count per each series.
In [304]: df2['cum'] = df2.groupby(['seg_oper']).cumcount()
In [305]: df2
Out[305]:
time seg_oper cum
265 18475 1->0:ADD['TX'] 0
2961 87097 0->1:ADD['RX'] 1
2994 87210 0->1:ADD['RX'] 2
... ... ... ...
332375 1006663 1->0:DELETE['TX'] 413
333083 1008644 1->0:DELETE['TX'] 414
Pivot the data
Rearrange the DF.
In [307]: df3 = df2.pivot(index='time', columns='seg_oper',values='cum').reset_index()
In [308]: df3
Out[308]:
seg_oper time 0->1:ADD['RX'] 1->0:ADD['TX'] 1->0:DELETE['TX']
0 18475 NaN 0.0 NaN
1 78007 0.0 NaN NaN
2 78613 NaN NaN 0.0
3 87097 1.0 NaN NaN
4 87210 2.0 NaN NaN
.. ... ... ... ...
828 1002281 NaN NaN 410.0
829 1003545 NaN NaN 411.0
830 1004961 NaN NaN 412.0
831 1006663 NaN NaN 413.0
832 1008644 NaN NaN 414.0
[833 rows x 4 columns]
Fill the gaps
I'm assuming that the NaN values can be filled with the previous value of the row until the next one.
df3=df3.fillna(method='ffill')
At this point, if you plot df3 you'll obtain the same as doing sns.ecdfplot(df2) with seaborn.
I still want to substract one series from the other.
df3['diff'] = df3["0->1:ADD['RX']"] - df3["1->0:DELETE['TX']"]
df3.plot(x='time')
The following plot, is the result.
pd: I don't understand the negative vote on the question. If someone can explain, I'll appreciate it.

Table has several columns with the same type of information

My table has 4 columns: order_id, item_id_1, item_id_2 and item_id_3. The three last columns cover the same type of information (the ids of products). I want to transform this table to get 2-columns table with "order_id" and "item_id", so my columns cover unique type of informations. That means, if in a particular order_id there were 3 products ordered, I will get three (instead of one) rows in my new table).
This will alow me, for exapmle, perform 'grupby' operation on 'item_id" column to count how meny times a particular product was ordered.
How this table transformation process is called?
For example, if you have a dataframe like this -
df = pd.DataFrame({'order_id':[1,2,3], 'item_id_1':['a','b','c'], 'item_id_2':['x','y',''], 'item_id_3':['','q','']})
df
order_id item_id_1 item_id_2 item_id_3
0 1 a x
1 2 b y q
2 3 c
pd.melt(df, id_vars=['order_id'], \
value_vars=['item_id_1', 'item_id_2', 'item_id_3'], \
var_name='item_id', value_name='item_value').\
replace('',np.nan).dropna().\
sort_values(['order_id']).\
reset_index(drop=True)\
[['order_id', 'item_id']]
So I'm not aware of any method that allows you to expand rows automatically as you're suggesting, but you can easily reach you're goal without. Let's start from a similar data frame, I put nan in cells of items that have not been ordered:
import pandas as pd
import numpy as np
data = {'order_id':[1,2,3],'item_id_1':[11,12,13],'item_id_2':[21,np.nan,23],'item_id_3':[31,np.nan,np.nan]}
df = pd.DataFrame(data)
cols = ['item_id_1','item_id_2','item_id_3']
print(df)
Out:
order_id item_id_1 item_id_2 item_id_3
0 1 11 21.0 31.0
1 2 12 NaN NaN
2 3 13 23.0 NaN
Then you can define a new empty data frame to fill by iterating through the rows of the initial one. For every item a new row is added to the empty data frame with same order_id and different item_id.
new_df = pd.DataFrame(columns = ['order_id','item_id']) # ,'item_num']
for ind, row in df.iterrows():
new_row = {}
new_row['order_id'] = row['order_id']
for col in cols: # for num, col in enumerate(cols):
item = row[col]
if not pd.isna(item):
new_row['item_id'] = item
# new_row['item_num'] = num +1
new_df = new_df.append(new_row,ignore_index=True)
print(new_df)
Out: # shape (6,2), ok because because 6 items have been ordered
order_id item_id
0 1.0 11.0
1 1.0 21.0
2 1.0 31.0
3 2.0 12.0
4 3.0 13.0
5 3.0 23.0
If you want, you could also add a third column to keep trace of the category of each item (i.e. if it was item_1, 2 or 3) by uncommenting the lines in the code, which gives you this output:
order_id item_id item_num
0 1.0 11.0 1.0
1 1.0 21.0 2.0
2 1.0 31.0 3.0
3 2.0 12.0 1.0
4 3.0 13.0 1.0
5 3.0 23.0 2.0

How do I fill these `NaN` values properly?

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?
Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.
It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

Pandas rolling mean don't change numbers to NaN in DataFrame

I'm working with a pandas DataFrame which looks like this:
(**N.B - the offset is set as the index of the DataFrame)
offset X Y Z
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.126465 -1.926758 -0.414062
50 -0.137207 -1.916992 -0.404297
60 -0.130371 -3.784591 -0.987654
70 -0.125000 -1.918457 -0.403809
80 -0.123456 -1.917480 -0.413574
90 -0.126465 -1.926758 -0.333554
I have applied the rolling mean with window size = 5, to the data frame using the following code.
I need to keep this window size = 5 and I need values for the whole dataframe for all of the offset values (no NaNs).
df = df.rolling(center=False, window=5).mean()
Which gives me:
offset X Y Z
0.0 NaN NaN NaN
10.0 NaN NaN NaN
20.0 NaN NaN NaN
30.0 NaN NaN NaN
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
I would like the DataFrame to be able to keep the first values that are NaN unchanged and have the the rest of the values as the result of the rolling mean. Is there a simple way that I would be able to do this? Thanks
i.e.
offset X Y Z
0.0 -0.140137 -1.924316 -0.426758
10.0 -2.789123 -1.111212 -0.416016
20.0 -0.133789 -1.923828 -4.408691
30.0 -0.101112 -1.457891 -0.425781
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
You can fill with the original df:
df.rolling(center=False, window=5).mean().fillna(df)
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
There is also an argument, min_periods that you can use. If you pass min_periods=1 then it will take the first value as it is, second value as the mean of the first two etc. It might make more sense in some cases.
df.rolling(center=False, window=5, min_periods=1).mean()
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -1.464630 -1.517764 -0.421387
20 -1.021016 -1.653119 -1.750488
30 -0.791040 -1.604312 -1.419311
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
Assuming you don't have other rows with all NaN's, you can identify which rows have all NaN's in your rolling_df, and replace them with the corresponding rows from the original. Example:
df=pd.DataFrame(np.random.rand(13,5))
df_rolling=df.rolling(center=False,window=5).mean()
#identify which rows are all NaN
idx = df_rolling.index[df_rolling.isnull().all(1)]
#replace those rows with the original data
df_rolling.loc[idx,:]=df.loc[idx,:]

Resources