Mean imputation based on certain Conditions - python-3.x

I have the below dataframe,
Category Value
A 100
A -
B -
C 50
D 200
D 400
D -
As you can see, there are some values which have the hyphen symbol '-'. I want to replace those hyphons with the means of the corresponding category.
In the example, there are two entries for "A" - One row with value 100 and other with hyphen. So the mean would be 100 itself. For B, since there are no valid values, the mean would be the mean of the entire column which would be (100+50+200+400/4 = 187.5). For C, no changes and for D, the hyphen will be replaced by 300 (same logic as for "A").
Output:
Category Value
A 100
A 100
B 187.5
C 50
D 200
D 400
D 300

Try:
df = df.replace("-", np.nan)
df["Value"] = pd.to_numeric(df["Value"])
avg = df["Value"].mean()
df["Value"] = df["Value"].fillna(
df.groupby("Category")["Value"].transform(
lambda x: avg if x.isna().all() else x.mean()
)
)
print(df)
Prints:
Category Value
0 A 100.0
1 A 100.0
2 B 187.5
3 C 50.0
4 D 200.0
5 D 400.0
6 D 300.0

Related

Concat values on dataframe columns excluding NaN's

I have a dataframe with n store columns, here I'm just showing the first 2:
ref_id store_0 store_1
0 100 c b
1 300 d NaN
I want a way to concat only the non-NaN values from store columns into a new column adding a comma between each value, and finally drop those columns. Desired output is:
ref_id stores
0 100 c,b
1 300 d
Right now I've tried df['stores'] = df['store_0'] + ',' + df['store_1'] with this result:
ref_id store_0 store_1 stores
0 100 c b c,b
1 300 d NaN NaN
You can use:
cols = df.filter(like='store_').columns
df2 = (df
.drop(columns=cols)
.assign(stores=df[cols].agg(lambda s: s.dropna()
.str.cat(sep=','),
axis=1))
)
Or, for in place modification:
cols = df.filter(like='store_').columns
df['stores'] = df[cols].agg(lambda s: s.dropna().str.cat(sep=','), axis=1)
df.drop(columns=cols, inplace=True)
Output:
ref_id stores
0 100 c,b
1 300 d
You can try
df_ = df.filter(like='store')
df = (df.assign(store=df_.apply(lambda row : row.str.cat(sep=','), axis=1))
.drop(df_.columns, axis=1))
print(df)
ref_id store
0 100 c,b
1 300 d
Try with
df['store'] = df.filter(like = 'store').apply(lambda x : ','.join(x[x==x]),1)
df
Out[60]:
ref_id store_0 store_1 store
0 100 c b c,b
1 300 d NaN d

Drop by multiple columns groups if specific values not exit in another column in Pandas

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Pandas: Calculating value of difference between current column value and next column value depending if it meets criteria at a different column

I have a dataframe:
df = pd.DataFrame.from_items([('A', [10, 'foo']), ('B', [440, 'foo']), ('C', [790, 'bar']), ('D', [800, 'bar']), ('E', [7000, 'foo'])], orient='index', columns=['position', 'foobar'])
Which looks like the below:
position foobar
A 10 foo
B 440 foo
C 790 bar
D 800 bar
E 7000 foo
I would like to know the difference between each position and the next position that has the opposite value in the foobar column. Normally I would use the shift method to move down the position column:
df[comparisonCol].shift(-1) - df[comparisonCol]
but as I am using the foobar column to decide which position is applicable, I am not sure how to do this.
The result should look like:
position foobar difference
A 10 foo 780
B 440 foo 350
C 790 bar 6210
D 800 bar 6200
E 7000 foo NaN
I think you need if unique values in foobar are only 2, so is possible shift between groups in a Series:
#identify consecutive groups
a = df['foobar'].ne(df['foobar'].shift()).cumsum()
print (a)
A 1
B 1
C 2
D 2
E 3
Name: foobar, dtype: int32
#get first value by a of position column
b = df.groupby(a)['position'].first()
print (b)
foobar
1 10
2 790
3 7000
Name: position, dtype: int64
#subtract mapped value, but for next group is added 1 to a Series
df['difference'] = a.add(1).map(b) - df['position']
print (df)
position foobar difference
A 10 foo 780.0
B 440 foo 350.0
C 790 bar 6210.0
D 800 bar 6200.0
E 7000 foo NaN
Detail:
print (a.add(1).map(b))
A 790.0
B 790.0
C 7000.0
D 7000.0
E NaN
Name: foobar, dtype: float64

Get column names from pandas DataFrame in format dtype:object

I have a similar doubt to the one in the mentioned link. Instead of returning column names in a list, I want column names in the format dtype:object.
For example,
A
B
C
D
Name:x,dtype:object
I am using Excel file in xlsx format.
Link: Get list from pandas DataFrame column headers
I think you need read_excel first for df and then Series constructor or Index.to_series for Series from column names:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
s = pd.Series(df.columns.values, name='x')
print (s)
0 A
1 B
2 C
3 D
Name: x, dtype: object
s1 = df.columns.to_series().rename('x')
print (s1)
A A
B B
C C
D D
Name: x, dtype: object

Resources