I have a dataframe where each row is a dictionary on which I'd like to use seaborn's horizontal box plot.
The x axis should be the float values for each 'dialog'
The y axis should show the 4 different models
There should be a plot for each parts of speech, meaning there should be a graph for 'INTJ', another for 'ADV' and so on.
I'm thinking I'll have to do a pd.melt first to restructure the data first so that the new columns would be 'dialog_num', 'model_type', and 'value' (automatic variable name after doing a melt, but basically the rows of dictionaries).
After that, perhaps break the 'value' variable so that each column is a part of speech ('ADV', 'INTJ', 'VERB', etc.) (this part seems tricky to me). Past this point...do a for loop on all of the columns and apply the horizontal boxplot?
import pandas as pd
pos =\
{'dialog_num': {0: 0, 1: 1, 2: 2},
'model1': {0: {'ADV': 0.072, 'INTJ': 0.03, 'PRON': 0.133, 'VERB': 0.109},
1: {'ADJ': 0.03, 'NOUN': 0.2, 'PRON': 0.13},
2: {'ADV': 0.083, 'PRON': 0.125, 'VERB': 0.0625}},
'model2': {0: {'ADJ': 0.1428, 'ADV': 0.1428, 'AUX': 0.1428, 'INTJ': 0.285},
1: {'ADJ': 0.1, 'DET': 0.1, 'NOUN': 0.1, 'PROPN': 0.1, 'VERB': 0.2},
2: {'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166, 'VERB': 0.3333}},
'model3': {0: {'ADJ': 0.06, 'CCONJ': 0.06, 'NOUN': 0.2, 'PRON': 0.266, 'SPACE': 0.066, 'VERB': 0.333},
1: {'AUX': 0.15, 'PRON': 0.25, 'PUNCT': 0.15, 'VERB': 0.15},
2: {'ADP': 0.125, 'PRON': 0.0625, 'PUNCT': 0.0625, 'VERB': 0.25}},
'model4': {0: {'ADJ': 0.25, 'ADV': 0.08, 'CCONJ': 0.083, 'PRON': 0.166},
1: {'AUX': 0.33, 'PRON': 0.2, 'VERB': 0.0667},
2: {'CCONJ': 0.125, 'NOUN': 0.125, 'PART': 0.125, 'PRON': 0.125, 'SPACE': 0.125, 'VERB': 0.375}}}
df = pd.DataFrame.from_dict(pos)
display(df)
dialog_num model1 model2 model3 model4
0 0 {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109} {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428} {'PRON': 0.266, 'VERB': 0.333, 'ADJ': 0.06, 'NOUN': 0.2, 'CCONJ': 0.06, 'SPACE': 0.066} {'PRON': 0.166, 'ADV': 0.08, 'ADJ': 0.25, 'CCONJ': 0.083}
1 1 {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2} {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1} {'PRON': 0.25, 'AUX': 0.15, 'VERB': 0.15, 'PUNCT': 0.15} {'PRON': 0.2, 'AUX': 0.33, 'VERB': 0.0667}
2 2 {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625} {'VERB': 0.3333, 'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166} {'PRON': 0.0625, 'VERB': 0.25, 'PUNCT': 0.0625, 'ADP': 0.125} {'PRON': 0.125, 'VERB': 0.375, 'PART': 0.125, 'CCONJ': 0.125, 'NOUN': 0.125, 'SPACE': 0.125}
sns.boxplot expects data to be supplied in a long form when specifying x= and y=.
In this case, based on the specifications of having each speech type as a separate plot, sns.catplot will be used because there is a col= parameter, which can be used to create separate plots for speech types.
As mentioned in the OP, use .melt to unpivot the wide dataframe.
.json_normalize can be used to convert the the 'value' column (dict type) into a flat table.
See Split / Explode a column of dictionaries into separate columns with pandas if there are issues with this step.
Join the flattened table (vals) to dfm with .join.
This works because vals and dfm have matching indices.
.melt the dataframe again.
Plot the box plot from the long form dataframe.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
import pandas as pd
import seaborn as sns
# load the dict into a dataframe
df = pd.DataFrame(pos)
# unpivot the dataframe
dfm = df.melt(id_vars='dialog_num', var_name='model')
# convert the 'value' column of dicts to a flat table
vals = pd.json_normalize(dfm['value'])
# combine vals to dfm, without the 'value' column
dfm = dfm.iloc[:, 0:-1].join(vals)
# unpivot the dataframe again
dfm = dfm.melt(id_vars=['dialog_num', 'model'])
plot all of the speech types together
p = sns.boxplot(data=dfm, x='value', y='model')
plot speech types separately
Most speech types have only a single value, or no values.
p = sns.catplot(kind='box', data=dfm, x='value', y='model', col='variable', col_wrap=4, height=4)
DataFrames at each step
1: dfm.head()
dialog_num model value
0 0 model1 {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}
1 1 model1 {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}
2 2 model1 {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}
3 0 model2 {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}
4 1 model2 {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}
2: vals.head()
INTJ ADV PRON VERB ADJ NOUN AUX PROPN DET CCONJ SPACE PUNCT ADP PART
0 0.030 0.0720 0.133 0.1090 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 0.130 NaN 0.0300 0.2 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 0.0830 0.125 0.0625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.285 0.1428 NaN NaN 0.1428 NaN 0.1428 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN 0.2000 0.1000 0.1 NaN 0.1 0.1 NaN NaN NaN NaN NaN
3: dfm.head()
dialog_num model INTJ ADV PRON VERB ADJ NOUN AUX PROPN DET CCONJ SPACE PUNCT ADP PART
0 0 model1 0.030 0.0720 0.133 0.1090 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 model1 NaN NaN 0.130 NaN 0.0300 0.2 NaN NaN NaN NaN NaN NaN NaN NaN
2 2 model1 NaN 0.0830 0.125 0.0625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 model2 0.285 0.1428 NaN NaN 0.1428 NaN 0.1428 NaN NaN NaN NaN NaN NaN NaN
4 1 model2 NaN NaN NaN 0.2000 0.1000 0.1 NaN 0.1 0.1 NaN NaN NaN NaN NaN
4: dfm.head()
dialog_num model variable value
0 0 model1 INTJ 0.030
1 1 model1 INTJ NaN
2 2 model1 INTJ NaN
3 0 model2 INTJ 0.285
4 1 model2 INTJ NaN
Related
Here is my code in Sklearn for using random forest - i have already manually filled NA with 0s. I have also gotten this code to run with the same data before, so it must be something in the code that is able to be fixed and not the data itself:
code:
parameters = {'max_depth':[2, 3, 4, 5, 7], 'n_estimators':[1, 10, 25, 50, 100, 256, 512], 'random_state':[42]}
def perform_grid_search(X_data, y_data): """ Function to perform a grid search. """
rf = RandomForestClassifier(criterion='entropy')
clf = GridSearchCV(rf, parameters, cv=4, scoring='roc_auc', n_jobs=3)
clf.fit(X_data, y_data)
print(clf.cv_results_['mean_test_score'])
return clf.best_params_['n_estimators'], clf.best_params_['max_depth']
and when running:
#next function
n_estimator, depth = perform_grid_search(X_train, y_train) c_random_state = 42
print(n_estimator, depth, c_random_state)
this error comes back:
model_selection_search.py:922: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan] warnings.warn(
and ValueError: could not convert string to float: ''
Please let me know as this has totally broken down my code process!
I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum().
But this is not working when some of the rows have NaN entry in a single column.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
pprint(dfs)
grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
df["name"] = f'name{i}'
i=i+1
df = df.append(pd.Series(dtype='object'), ignore_index=True)
dfm = dfm.append(df, ignore_index=True)
print(dfm)
Input df:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B NaN
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
Output obtained:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 2.0 2.0 B 2.0 name1
5 NaN NaN NaN NaN NaN
6 0.0 2.0 B 3.0 name2
7 1.0 4.0 A 1.0 name2
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Expected:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
I am basically doing this to append names to the last column of the dataframe after splitting df
using
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
and appending NaN rows.
Again, I use the NaN rows to split the df into a list and add new column. But dfs.isnull().all(axis=1).cumsum() isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.
Suggestions on how to get the expected output will be really helpful.
Setup
df = pd.DataFrame(d)
print(df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3 0 2 B NaN
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 1.0
Simplify your approach
# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name' + m.cumsum().astype(str)
# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index)
last_nan_row = pd.DataFrame(index=df.index[[-1]])
# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)
Result
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
Alternatively if you still want to start with the initial input as dfs, here is another approach:
dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name' + m.cumsum().astype(str)
I have a sample dataframe of a very huge dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'Start_x':['Tom', NaN, NaN, NaN,NaN],
'Start_y':[NaN, 'Nick', NaN, NaN, NaN],
'Start_z':[NaN, NaN, 'Alison', NaN, NaN],
'Start_a':[NaN, NaN, NaN, 'Mark',NaN],
'Start_b':[NaN, NaN, NaN, NaN, 'Oliver'],
'Sex': ['Male','Male','Female','Male','Male']}
df = pd.DataFrame(data)
df
I want the final result to look like the image given below. The 4 columns have to be merged to a single new column but the 'Sex' column should be as it is.
Any help is greatly appreciated. Thank you!
One option could be to backfill Start columns by rows and then take the first column:
df['New_Column'] = df.filter(like='Start').bfill(axis=1).iloc[:, 0]
df
Start_x Start_y Start_z Start_a Start_b Sex New_Column
0 Tom NaN NaN NaN NaN Male Tom
1 NaN Nick NaN NaN NaN Male Nick
2 NaN NaN Alison NaN NaN Female Alison
3 NaN NaN NaN Mark NaN Male Mark
4 NaN NaN NaN NaN Oliver Male Oliver
I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[2, np.nan, np.nan, np.nan, np.nan],
[np.nan, 2, np.nan, np.nan, np.nan],
[np.nan, np.nan, 2, np.nan, np.nan],
[np.nan, 2, 2, np.nan, np.nan],
[2, np.nan, 2, np.nan, 2],
[2, np.nan, np.nan, 2, np.nan],
[np.nan, 2, 2, 2, np.nan],
[2, np.nan, np.nan, np.nan, 2]
],
index=list('abcdefgh'), columns=list('ABCDE')
)
df
A B C D E
a 2.0 NaN NaN NaN NaN
b NaN 2.0 NaN NaN NaN
c NaN NaN 2.0 NaN NaN
d NaN 2.0 2.0 NaN NaN
e 2.0 NaN 2.0 NaN 2.0
f 2.0 NaN NaN 2.0 NaN
g NaN 2.0 2.0 2.0 NaN
h 2.0 NaN NaN NaN 2.0
I would like to fill NaNs by 0 for each row, before and after there is a non-NaN value, only for one NaN for each side of the non-NaN value with pandas.
so my desired output would be the following:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
I know how to do it with for loops, but I was wondering if it is possible do it only with pandas.
Thank you very much for your help!
You can use shift backward and forward on both axes and mask:
cond = (df.notna().shift(axis=1, fill_value=False) # check left
|df.notna().shift(-1, axis=1, fill_value=False) # check right
)&df.isna() # cell is NA
df.mask(cond, 0)
output:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
NB. This transformation is called a binary dilation, you can also use scipy.ndimage.morphology.binary_dilation for that. The advantage with this method is that you can use various structurating elements (not only Left/Right/Top/Bottom)
import numpy as np
from scipy.ndimage.morphology import binary_dilation
struct = np.array([[True, False, True]])
df.mask(binary_dilation(df.notna(), structure=struct), 0)
I'm trying to reshape this sample dataframe from long to wide format, without aggregating any of the data.
import numpy as np
import pandas as pd
df = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'B', 'C', 'A'], 'Date':
['2010-03-14', '2010-03-15', '2010-03-16', '2010-03-14', '2010-05-15',
'2010-03-14', '2010-03-14'], 'Var1': [1 , 12, 4, 7, 90, 1, 9], 'Var2': [ 0,
0, 1, 1, 1, 0, 1], 'Var3': [np.nan, 1, 0, np.nan, 0, 1, np.nan]})
df['Date'] = pd.to_datetime(df['Date']); df
Date SubjectID Var1 Var2 Var3
0 2010-03-14 A 1 0 NaN
1 2010-03-15 A 12 0 1.0
2 2010-03-16 A 4 1 0.0
3 2010-03-14 B 7 1 NaN
4 2010-05-15 B 90 1 0.0
5 2010-03-14 C 1 0 1.0
6 2010-03-14 A 9 1 NaN
To get around the duplicate values, I'm grouping by the "Date" column and getting the cumulative count for each value. Then I make a pivot table
df['idx'] = df.groupby('Date').cumcount()
dfp = df.pivot_table(index = 'SubjectID', columns = 'idx'); dfp
Var1 Var2 Var3
idx 0 1 2 3 0 1 2 3 0 2
SubjectID
A 5.666667 NaN NaN 9.0 0.333333 NaN NaN 1.0 0.5 NaN
B 90.000000 7.0 NaN NaN 1.000000 1.0 NaN NaN 0.0 NaN
C NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN 1.0
However, I want the idx column index to be the values from the "Date" column and I don't want to aggregate any data. The expected output is
Var1_2010-03-14 Var1_2010-03-14 Var1_2010-03-15 Var1_2010-03-16 Var1_2010-05-15 Var2_2010-03-14 Var2_2010-03-15 Var2_2010-03-16 Var2_2010-05-15 Var3_2010-03-14 Var3_2010-03-15 Var3_2010-03-16 Var3_2010-05-15
SubjectID
A 1 9 12 4 NaN 0 1 0 1.0 NaN NaN NaN 1.0 0.0 NaN
B 7.0 NaN NaN NaN 90 1 NaN NaN 1.0 NaN NaN NaN NaN NaN 0.0
C 1 NaN NaN NaN NaN 0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
How can I do this? Eventually, I'll merge the two column indexes by dfp.columns = [col[0]+ '_' + str(col[1]) for col in dfp.columns].
You are on the correct path:
# group
df['idx'] = df.groupby('Date').cumcount()
# set index and unstack
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
# drop idx column
new.columns = new.columns.droplevel(1)
new.columns = [f'{val}_{date}' for val, date in new.columns]
I think this is your expected output
Using map looks like it will be a little faster:
df['idx'] = df.groupby('Date').cumcount()
df['Date'] = df['Date'].astype(str)
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
new.columns = new.columns.droplevel(1)
#new.columns = [f'{val}_{date}' for val, date in new.columns]
new.columns = new.columns.map('_'.join)
Here is a 50,000 row test example:
#data
data = pd.DataFrame(pd.date_range('2000-01-01', periods=50000, freq='D'))
data['a'] = list('abcd')*12500
data['b'] = 2
data['c'] = list('ABCD')*12500
data.rename(columns={0:'date'}, inplace=True)
# list comprehension:
%%timeit -r 3 -n 200
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = [f'{x}_{y}' for x,y in new.columns]
# 98.2 ms ± 13.3 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)
# map with join:
%%timeit -r 3 -n 200
data['date'] = data['date'].astype(str)
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = new.columns.map('_'.join)
# 84.6 ms ± 3.87 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)