Pandas: categorical column and insertion of rows for every category - python-3.x

I seem unable to achieve inserting rows with missing values, while having one column as Categorical.
Assume the following dataframe df, where column B is categorical and categories should appear in the order of 'd', 'b', 'c', 'a'.
df= pd.DataFrame({'A':['i', 'i', 'i', 'j', 'k'], \
'B':pd.Categorical(['d', 'c', 'b','b', 'a'], \
categories= ['d', 'b', 'c', 'a'], \
ordered=True), \
'C':[1, 0, 3 ,2, np.nan]})
I need to convert df into the following format:
A B C
0 i d 1.0
1 i b 0.0
2 i c 3.0
3 i a NaN
4 j d NaN
5 j b 2.0
6 j c NaN
7 j a NaN
8 k d NaN
9 k b NaN
10 k c NaN
11 k a NaN
Thank you in advance!

You could set the dataframe index to column B, this way we can use the reindex later on to fill the missing categorical values for each group. Use groupby column A and select the column C, then apply the reindex function as mention before, using now the desired category sequence. Afterwards, use reset_index to insert the indices (A and B) back into dataframe columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['i', 'i', 'i', 'j', 'k'], \
'B':pd.Categorical(['d', 'c', 'b','b', 'a'], \
categories= ['d', 'b', 'c', 'a'], \
ordered=True), \
'C':[1, 0, 3 ,2, np.nan]})
print(df)
df = df.set_index('B')
df = df.groupby('A')['C']\
.apply(lambda x: x.reindex(['d', 'b', 'c', 'a']))\
.reset_index()
df.B = pd.Categorical(df.B)
print(df)
Output from df
A B C
0 i d 1.0
1 i b 3.0
2 i c 0.0
3 i a NaN
4 j d NaN
5 j b 2.0
6 j c NaN
7 j a NaN
8 k d NaN
9 k b NaN
10 k c NaN
11 k a NaN

Related

How to create a weighted edgelist directed from a pandas dataframe with weights in two columns?

I have the following pandas DataFrame (df):
>>> import pandas as pd
>>> df = pd.DataFrame([
['A', 'B', '1'],
['A', 'B', '2'],
['B', 'A', '41'],
['A', 'C', '11'],
['C', 'B', '3'],
['B', 'D', '4'],
['D', 'B','51']
], columns=('station_i', 'station_j','UID'))
I used
>>> df2=df.groupby(by=['station_i', 'station_j']).size().to_frame(name = 'counts_ij').reset_index()
to obtain the dataframe df2:
>>> print(df2)
station_i station_j counts_ij
0 A B 2
1 A C 1
2 B A 1
3 B D 1
4 C B 1
5 D B 1
Now, I would obtain the dataframe df3, build as shown below, where couples with same values, but reversed, are dropped and counted in an extra column as showed below:
>>>print(df3)
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 C B 1 0
3 B D 1 1
Would really appreciate some suggestions
import pandas as pd
import numpy as np
# find out indices that are reverse duplicated
dupe = pd.concat([
np.maximum(df2.station_i, df2.station_j),
np.minimum(df2.station_i, df2.station_j)
], axis=1).duplicated()
df2[dupe]
station_i station_j counts_ij
2 B A 1
5 D B 1
df2[~dupe]
station_i station_j counts_ij
0 A B 2
1 A C 1
3 B D 1
4 C B 1
# split by dupe, reverse the dupe station and merge with the non dupe
df2[~dupe].merge(
df2[dupe].rename(
columns={'station_i': 'station_j', 'station_j': 'station_i', 'counts_ij': 'counts_ji'}
), how='left'
).fillna(0).astype({'counts_ji': int})
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 B D 1 1
3 C B 1 0

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

fill values after condition with NaN

I have a df like this:
df = pd.DataFrame(
[
['A', 1],
['A', 1],
['A', 1],
['B', 2],
['B', 0],
['A', 0],
['A', 1],
['B', 1],
['B', 0]
], columns = ['key', 'val'])
df
print:
key val
0 A 1
1 A 1
2 A 1
3 B 2
4 B 0
5 A 0
6 A 1
7 B 1
8 B 0
I want to fill the rows after 2 in the val column (in the example all values in the val column from row 3 to 8 are replaced with nan).
I tried this:
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
and iterating over rows like this:
for row in df.iterrows():
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
but cant get it to fill nan forward.
You can use boolean indexing with cummax to fill nan values:
df.loc[df['val'].eq(2).cummax(), 'val'] = np.nan
Alternatively you can also use Series.mask:
df['val'] = df['val'].mask(lambda x: x.eq(2).cummax())
key val
0 A 1.0
1 A 1.0
2 A 1.0
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN
You can try :
ind = df.loc[df['val']==2].index
df.iloc[ind[0]:,1] = np.nan
Once you get index by df.index[df.val.shift(-1).eq(2)].item() then you can use slicing
idx = df.index[df.val.shift(-1).eq(2)].item()
df.iloc[idx:, 1] = np.nan
df
key val
0 A 1.0
1 A 1.0
2 A NaN
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN

How can I create a new dataframe by taking the rolling COLUMN total/sum of another dataframe?

import pandas as pd
df = {'a': [1,1,1], 'b': [2,2,2], 'c': [3,3,3], 'd': [4,4,4], 'e': [5,5,5], 'f': [6,6,6], 'g': [7,7,7]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g'])
dg = {'h': [10,10,10], 'i': [14,14,14], 'j': [18,18,18], 'k': [22,22,22]}
df2 = pd.DataFrame(dg, columns = ['h', 'i', 'j', 'k'])
df1
a b c d e f g
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
df1 is my original data frame. I would like to create another dataframe by adding each consecutive 4 columns (rolling column sum).
df2
h i j k
0 10 14 18 22
1 10 14 18 22
2 10 14 18 22
df2 is the resulting dataframe after adding 4 consecutive columns of df1.
For example: column h in df2 is the sum of columns a, b, c, d in df1; column i in df2 is the sum of columns b, c, d, e in df1; column j in df2 is the sum of columns c, d, e, f in df1; column k in df2 is the sum of columns d, e, f, g in df1.
I could not find any similar question/answer/example like this.
I would appreciate any help.
You can use rolling over 4 columns and take the sum. Finally drop the first 3 columns.
df1.rolling(4, axis=1).sum().dropna(axis=1)
d e f g
0 10.0 14.0 18.0 22.0
1 10.0 14.0 18.0 22.0
2 10.0 14.0 18.0 22.0

append one dataframe column value to another dataframe

I have two dataframes. df1 is empty dataframe and df2 is having some data as shown. There are few columns common in both dfs. I want to append df2 dataframe columns data into df1 dataframe's column. df3 is expected result.
I have referred Python + Pandas + dataframe : couldn't append one dataframe to another, but not working. It gives following error:
ValueError: Plan shapes are not aligned
df1:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: [] `
df2:
c e
0 11 55
1 22 66
df3 (expected output):
a b c d e
0 11 55
1 22 66
tried with append but not getting desired result
import pandas as pd
l1 = ['a', 'b', 'c', 'd', 'e']
l2 = []
df1 = pd.DataFrame(l2, columns=l1)
l3 = ['c', 'e']
l4 = [[11, 55],
[22, 66]]
df2 = pd.DataFrame(l4, columns=l3)
print("concat","\n",pd.concat([df1,df2])) # columns will be inplace
print("merge Nan","\n",pd.merge(df2, df1,how='left', on=l3)) # columns occurence is not preserved
#### Output ####
#concat
a b c d e
0 NaN NaN 11 NaN 55
1 NaN NaN 22 NaN 66
#merge
c e a b d
0 11 55 NaN NaN NaN
1 22 66 NaN NaN NaN
Append seems to work for me. Does this not do what you want?
df1 = pd.DataFrame(columns=['a', 'b', 'c'])
print("df1: ")
print(df1)
df2 = pd.DataFrame(columns=['a', 'c'], data=[[0, 1], [2, 3]])
print("df2:")
print(df2)
print("df1.append(df2):")
print(df1.append(df2, ignore_index=True, sort=False))
Output:
df1:
Empty DataFrame
Columns: [a, b, c]
Index: []
df2:
a c
0 0 1
1 2 3
df1.append(df2):
a b c
0 0 NaN 1
1 2 NaN 3
Have you tried pd.concat ?
pd.concat([df1,df2])

Resources