dataframes list concatenation/merging introduces nan values

dataframes list concatenation/merging introduces nan values - python-3.x

I have these dataframes:
import pandas as pd
import numpy as np
from functools import reduce
a = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [0.1, np.nan, 0.3, np.nan, 0.5], 'B_val': [1.233, np.nan, 1.4, np.nan, 1.9]})
b = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [np.nan, 0.2, np.nan, 0.4, np.nan], 'B_val': [np.nan, 1.56, np.nan, 1.1, np.nan]})
c = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [121, np.nan, 334, np.nan, 555], 'D_val': [10.233, np.nan, 10.4, np.nan, 10.9]})
d = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [np.nan, 322, np.nan, 454, np.nan], 'D_val': [np.nan, 10.56, np.nan, 10.1, np.nan]})
I am dropping the nan values:
a.dropna(inplace=True)
b.dropna(inplace=True)
c.dropna(inplace=True)
d.dropna(inplace=True)
And then , I want to merge them and have this result:
id gr_code A_val B_val C_val D_val
1 121 0.1 1.233 121.0 10.233
2 121 0.2 1.56 322 10.56
3 134 0.3 1.400 334.0 10.400
4 155 0.4 1.10 454.0 10.10
5 156 0.5 1.900 555.0 10.900
but whatever I try , it introduces nan values.
For example:
df = pd.concat([a, b, c, d], axis=1)
df = df.loc[:,~df.columns.duplicated()]
gives:
id gr_code A_val B_val C_val D_val
1.0 121.0 0.1 1.233 121.0 10.233
3.0 134.0 0.3 1.400 334.0 10.400
5.0 156.0 0.5 1.900 555.0 10.900
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
If I try:
df_list = [a, b, c, d]
df = reduce(lambda left, right: pd.merge(left, right,
on=['id', 'gr_code'],
how='outer'), df_list)
it gives:
id gr_code A_val_x B_val_x A_val_y B_val_y C_val_x D_val_x C_val_y D_val_y
1 121 0.1 1.233 NaN NaN 121.0 10.233 NaN NaN
3 134 0.3 1.400 NaN NaN 334.0 10.400 NaN NaN
5 156 0.5 1.900 NaN NaN 555.0 10.900 NaN NaN
2 121 NaN NaN 0.2 1.56 NaN NaN 322.0 10.56
4 155 NaN NaN 0.4 1.10 NaN NaN 454.0 10.10
more dataframes:
e = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [0.11, np.nan, 0.13, np.nan, 0.35], 'F_val': [11.233, np.nan, 11.4, np.nan, 11.9]})
f = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [np.nan, 3222, np.nan, 4541, np.nan], 'F_val': [np.nan, 110.56, np.nan, 101.1, np.nan]})

You can use concat and merge the duplicated columns:
df = (pd.concat([d.set_index(['id', 'gr_code']) for d in df_list], axis=1)
.groupby(level=0, axis=1).first().reset_index()
)
output:
id gr_code A_val B_val C_val D_val
0 1 121 0.1 1.233 121.0 10.233
1 2 121 0.2 1.560 322.0 10.560
2 3 134 0.3 1.400 334.0 10.400
3 4 155 0.4 1.100 454.0 10.100
4 5 156 0.5 1.900 555.0 10.900

Related

How to create box plots from columns of dicts in pandas

I have a dataframe where each row is a dictionary on which I'd like to use seaborn's horizontal box plot.
The x axis should be the float values for each 'dialog'
The y axis should show the 4 different models
There should be a plot for each parts of speech, meaning there should be a graph for 'INTJ', another for 'ADV' and so on.
I'm thinking I'll have to do a pd.melt first to restructure the data first so that the new columns would be 'dialog_num', 'model_type', and 'value' (automatic variable name after doing a melt, but basically the rows of dictionaries).
After that, perhaps break the 'value' variable so that each column is a part of speech ('ADV', 'INTJ', 'VERB', etc.) (this part seems tricky to me). Past this point...do a for loop on all of the columns and apply the horizontal boxplot?
import pandas as pd
pos =\
{'dialog_num': {0: 0, 1: 1, 2: 2},
'model1': {0: {'ADV': 0.072, 'INTJ': 0.03, 'PRON': 0.133, 'VERB': 0.109},
1: {'ADJ': 0.03, 'NOUN': 0.2, 'PRON': 0.13},
2: {'ADV': 0.083, 'PRON': 0.125, 'VERB': 0.0625}},
'model2': {0: {'ADJ': 0.1428, 'ADV': 0.1428, 'AUX': 0.1428, 'INTJ': 0.285},
1: {'ADJ': 0.1, 'DET': 0.1, 'NOUN': 0.1, 'PROPN': 0.1, 'VERB': 0.2},
2: {'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166, 'VERB': 0.3333}},
'model3': {0: {'ADJ': 0.06, 'CCONJ': 0.06, 'NOUN': 0.2, 'PRON': 0.266, 'SPACE': 0.066, 'VERB': 0.333},
1: {'AUX': 0.15, 'PRON': 0.25, 'PUNCT': 0.15, 'VERB': 0.15},
2: {'ADP': 0.125, 'PRON': 0.0625, 'PUNCT': 0.0625, 'VERB': 0.25}},
'model4': {0: {'ADJ': 0.25, 'ADV': 0.08, 'CCONJ': 0.083, 'PRON': 0.166},
1: {'AUX': 0.33, 'PRON': 0.2, 'VERB': 0.0667},
2: {'CCONJ': 0.125, 'NOUN': 0.125, 'PART': 0.125, 'PRON': 0.125, 'SPACE': 0.125, 'VERB': 0.375}}}
df = pd.DataFrame.from_dict(pos)
display(df)
dialog_num model1 model2 model3 model4
0 0 {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109} {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428} {'PRON': 0.266, 'VERB': 0.333, 'ADJ': 0.06, 'NOUN': 0.2, 'CCONJ': 0.06, 'SPACE': 0.066} {'PRON': 0.166, 'ADV': 0.08, 'ADJ': 0.25, 'CCONJ': 0.083}
1 1 {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2} {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1} {'PRON': 0.25, 'AUX': 0.15, 'VERB': 0.15, 'PUNCT': 0.15} {'PRON': 0.2, 'AUX': 0.33, 'VERB': 0.0667}
2 2 {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625} {'VERB': 0.3333, 'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166} {'PRON': 0.0625, 'VERB': 0.25, 'PUNCT': 0.0625, 'ADP': 0.125} {'PRON': 0.125, 'VERB': 0.375, 'PART': 0.125, 'CCONJ': 0.125, 'NOUN': 0.125, 'SPACE': 0.125}

sns.boxplot expects data to be supplied in a long form when specifying x= and y=.
In this case, based on the specifications of having each speech type as a separate plot, sns.catplot will be used because there is a col= parameter, which can be used to create separate plots for speech types.
As mentioned in the OP, use .melt to unpivot the wide dataframe.
.json_normalize can be used to convert the the 'value' column (dict type) into a flat table.
See Split / Explode a column of dictionaries into separate columns with pandas if there are issues with this step.
Join the flattened table (vals) to dfm with .join.
This works because vals and dfm have matching indices.
.melt the dataframe again.
Plot the box plot from the long form dataframe.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
import pandas as pd
import seaborn as sns
# load the dict into a dataframe
df = pd.DataFrame(pos)
# unpivot the dataframe
dfm = df.melt(id_vars='dialog_num', var_name='model')
# convert the 'value' column of dicts to a flat table
vals = pd.json_normalize(dfm['value'])
# combine vals to dfm, without the 'value' column
dfm = dfm.iloc[:, 0:-1].join(vals)
# unpivot the dataframe again
dfm = dfm.melt(id_vars=['dialog_num', 'model'])
plot all of the speech types together
p = sns.boxplot(data=dfm, x='value', y='model')
plot speech types separately
Most speech types have only a single value, or no values.
p = sns.catplot(kind='box', data=dfm, x='value', y='model', col='variable', col_wrap=4, height=4)
DataFrames at each step
1: dfm.head()
dialog_num model value
0 0 model1 {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}
1 1 model1 {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}
2 2 model1 {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}
3 0 model2 {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}
4 1 model2 {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}
2: vals.head()
INTJ ADV PRON VERB ADJ NOUN AUX PROPN DET CCONJ SPACE PUNCT ADP PART
0 0.030 0.0720 0.133 0.1090 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 0.130 NaN 0.0300 0.2 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 0.0830 0.125 0.0625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.285 0.1428 NaN NaN 0.1428 NaN 0.1428 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN 0.2000 0.1000 0.1 NaN 0.1 0.1 NaN NaN NaN NaN NaN
3: dfm.head()
dialog_num model INTJ ADV PRON VERB ADJ NOUN AUX PROPN DET CCONJ SPACE PUNCT ADP PART
0 0 model1 0.030 0.0720 0.133 0.1090 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 model1 NaN NaN 0.130 NaN 0.0300 0.2 NaN NaN NaN NaN NaN NaN NaN NaN
2 2 model1 NaN 0.0830 0.125 0.0625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 model2 0.285 0.1428 NaN NaN 0.1428 NaN 0.1428 NaN NaN NaN NaN NaN NaN NaN
4 1 model2 NaN NaN NaN 0.2000 0.1000 0.1 NaN 0.1 0.1 NaN NaN NaN NaN NaN
4: dfm.head()
dialog_num model variable value
0 0 model1 INTJ 0.030
1 1 model1 INTJ NaN
2 2 model1 INTJ NaN
3 0 model2 INTJ 0.285
4 1 model2 INTJ NaN

How to fill before and after NANs by 0 for just one NAN in pandas?

I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[2, np.nan, np.nan, np.nan, np.nan],
[np.nan, 2, np.nan, np.nan, np.nan],
[np.nan, np.nan, 2, np.nan, np.nan],
[np.nan, 2, 2, np.nan, np.nan],
[2, np.nan, 2, np.nan, 2],
[2, np.nan, np.nan, 2, np.nan],
[np.nan, 2, 2, 2, np.nan],
[2, np.nan, np.nan, np.nan, 2]
],
index=list('abcdefgh'), columns=list('ABCDE')
)
df
A B C D E
a 2.0 NaN NaN NaN NaN
b NaN 2.0 NaN NaN NaN
c NaN NaN 2.0 NaN NaN
d NaN 2.0 2.0 NaN NaN
e 2.0 NaN 2.0 NaN 2.0
f 2.0 NaN NaN 2.0 NaN
g NaN 2.0 2.0 2.0 NaN
h 2.0 NaN NaN NaN 2.0
I would like to fill NaNs by 0 for each row, before and after there is a non-NaN value, only for one NaN for each side of the non-NaN value with pandas.
so my desired output would be the following:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
I know how to do it with for loops, but I was wondering if it is possible do it only with pandas.
Thank you very much for your help!

You can use shift backward and forward on both axes and mask:
cond = (df.notna().shift(axis=1, fill_value=False) # check left
|df.notna().shift(-1, axis=1, fill_value=False) # check right
)&df.isna() # cell is NA
df.mask(cond, 0)
output:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
NB. This transformation is called a binary dilation, you can also use scipy.ndimage.morphology.binary_dilation for that. The advantage with this method is that you can use various structurating elements (not only Left/Right/Top/Bottom)
import numpy as np
from scipy.ndimage.morphology import binary_dilation
struct = np.array([[True, False, True]])
df.mask(binary_dilation(df.notna(), structure=struct), 0)

Creating a column in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "A", "A", "B","B"], 'A':[80, 23, np.nan, 60, 1,22], 'B': [80, 55, 5, 76, 67,np.nan]})
df
ORDER A B
0 A 80.0 80.0
1 A 23.0 55.0
2 A NaN 5.0
3 A 60.0 76.0
4 B 1.0 67.0
5 B 22.0 NaN
I want to create a column "new" as below:
If ORDER == 'A', then new=df['A']
If ORDER == 'B', then new=df['B']
This can be achieved using the below code:
df['new'] = np.where(df['ORDER'] == 'A', df['A'], np.nan)
df['new'] = np.where(df['ORDER'] == 'B', df['B'], df['new'])
The tweak here is if ORDER doesnot have the value "B", Then B will not be present in the dataframe.So the dataframe might look like below. And if we use the above code o this dataframe, it will give an error because column "B" is missing from this dataframe.
ORDER A
0 A 80.0
1 A 23.0
2 A NaN
3 A 60.0
4 A 1.0
5 A 22.0

Use DataFrame.lookup, so you dont need to hardcode df['B'], but it looksup the column value:
df['new'] = df.lookup(df.index, df['ORDER'])
ORDER A B new
0 A 80.0 80.0 80.0
1 A 23.0 55.0 23.0
2 A NaN 5.0 NaN
3 A 60.0 76.0 60.0
4 B 1.0 67.0 67.0
5 B 22.0 NaN NaN

Data partition on known and unknown rows

I have a dataset with known and unknown variables (just one column). I'd like to separate rows for 2 lists - First list of rows with all known variables and the Second list of rows with all missed (unknown) variables.
df = {'Id' : [1, 2, 3, 4, 5],
'First' : [30, 22, 18, 49, 22],
'Second' : [80, 28, 16, 56, 30],
'Third' : [14, None, None, 30, 27],
'Fourth' : [14, 85, 17, 22, 14],
'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['Id', 'First', 'Second', 'Third', 'Fourth'])
df
Two separate lists with all Known variables and another one with Unknown variables

Let me know if this helps :
df['TF']= df.isnull().any(axis=1)
df_without_none = df[df['TF'] == 0]
df_with_none = df[df['TF'] == 1]
print(df_without_none.head())
print(df_with_none.head())
#### Input ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
#### Output ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
Id First Second Third Fourth Fruit Total TF
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True

Shift the values of a column by x rows in Python

I would like to shift the values of one column in a data frame by x rows(hours).
For example, in the following dataframe:
ind = pd.date_range('01 / 01 / 2000', periods=5, freq='12H')
df = pd.DataFrame({"A": [1, 2, 3, 4, 5],
"B": [10, 20, 30, 40, 50],
"C": [11, 22, 33, 44, 55],
"D": [12, 24, 51, 36, 2]},
index=ind)
I would like to shift the values in column A by two hours.
I use the following:
mask = (df.columns.isin(['A']))
cols_to_shift = df.columns[mask]
df[cols_to_shift] = df[cols_to_shift].shift(2,freq='H')
However, all column A's values are filled with NA. I guess it is because the values are shifted to hours that do not exist in the index column.
Is there a way to fix it?
This is the input:
And this is the output:
Thanks

IIUC, you could try assigning your shifted values, then use pandas.concat to extend your original DataFrame. I'm also using DataFrame.sort_index and DataFrame.fillna here to sort the results and deal with NaN:
# Example setup
ind = pd.date_range('01 / 01 / 2000', periods=5, freq='12H')
df = pd.DataFrame({"A": [1, 2, 3, 4, 5],
"B": [10, 20, 30, 40, 50],
"C": [11, 22, 33, 44, 55],
"D": [12, 24, 51, 36, 2]},
index=ind)
mask = (df.columns.isin(['A']))
cols_to_shift = df.columns[mask]
shifted = df[cols_to_shift].shift(2, freq='H')
df[cols_to_shift] = shifted
df = pd.concat([df, shifted]).sort_index().fillna(0)
print(df)
[out]
A B C D
2000-01-01 00:00:00 0.0 10.0 11.0 12.0
2000-01-01 02:00:00 1.0 0.0 0.0 0.0
2000-01-01 12:00:00 0.0 20.0 22.0 24.0
2000-01-01 14:00:00 2.0 0.0 0.0 0.0
2000-01-02 00:00:00 0.0 30.0 33.0 51.0
2000-01-02 02:00:00 3.0 0.0 0.0 0.0
2000-01-02 12:00:00 0.0 40.0 44.0 36.0
2000-01-02 14:00:00 4.0 0.0 0.0 0.0
2000-01-03 00:00:00 0.0 50.0 55.0 2.0
2000-01-03 02:00:00 5.0 0.0 0.0 0.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

dataframes list concatenation/merging introduces nan values - python-3.x

Related

How to create box plots from columns of dicts in pandas

How to fill before and after NANs by 0 for just one NAN in pandas?

Creating a column in pandas dataframe

Data partition on known and unknown rows

Shift the values of a column by x rows in Python

Categories

Resources