Add a row to pandas dataframe based on dictionary - python-3.x

Here is my example dataframe row:
A B C D E
I have a dictionary formatted like:
{'foo': ['A', 'B', 'C'], 'bar': ['D', 'E']}
I would like to add a row above my original dataframe so my new dataframe is:
foo foo foo bar bar
A B C D E
I think maybe the df.map function should be able to do it, but I've tried it and can't seem to get the syntax right.

I believe you want set columns names by row of DataFrame with dict and map:
d = {'foo': ['A', 'B', 'C'], 'bar': ['D', 'E']}
#swap keys with values
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'E': 'bar', 'A': 'foo', 'D': 'bar', 'B': 'foo', 'C': 'foo'}
df = pd.DataFrame([list('ABCDE')])
df.columns = df.iloc[0].map(d1).values
print (df)
foo foo foo bar bar
0 A B C D E
If need set first row in one row DataFrame:
df = pd.DataFrame([list('ABCDE')])
df.loc[-1] = df.iloc[0].map(d1)
df = df.sort_index().reset_index(drop=True)
print (df)
0 1 2 3 4
0 foo foo foo bar bar
1 A B C D E

Related

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

Concatenate Each cell in column A with Column B in Python DataFrame

Need help in concatenating each row of a column with other column of a dataframe
Input:
Output
Use itertools.product in list comprehension:
from itertools import product
L = [''.join(x) for x in product(df['Col1'], df['Col2'])]
#alternative
L = [a + b for a, b in product(df['Col1'], df['Col2'])]
df = pd.DataFrame({'Col3':L})
print (df)
Col3
0 AE
1 AF
2 AG
3 BE
4 BF
5 BG
6 CE
7 CF
8 CG
Or cross join solution with helper column a:
df1 = df.assign(a=1)
df1 = df1.merge(df1, on='a')
df = (df1['Col1_x'] + df1['Col2_y']).to_frame('Col3')
Remark: it's easier to help if you copy the code for creating the input rather than images such as:
import pandas as pd
df=pd.DataFrame([['A', 'B', 'C', 'D'],['E', 'F', 'G', 'H']], columns=['col1', 'col2'])
Solution: least effort is the itertools library
from itertools import product
lst1 = ['A', 'B', 'C', 'D']
lst2 = ['E', 'F', 'G', 'H']
reslst = list(product(lst1, lst2))
or as dataframe series:
reslst = list(product(df['col1'].values, df['col2'].values))
print(reslst)
Note: as you know the result is a list which is n**2 long and hence can not be assigned to the original dataframe.

Create a new column in a dataframe, based on Groupby and values in a separate column

I have a df like so:
df = pd.DataFrame({'Info': ['A','B','C', 'D', 'E'], 'Section':['1','1', '2', '2', '3']})
I want to be able to create a new column, like 'Unique_Info', like so:
df = pd.DataFrame({'Info': ['A','B','C', 'D', 'E'], 'Section':['1','1', '2', '2', '3'],
'Unique_Info':[['A', 'B'], ['A', 'B'], ['C', 'D'], ['C', 'D'], ['E']]})
So a list is created with all unique values from the Info column, belonging to that section. So Section=1, hence ['A', 'B'].
I assume groupby is the most convenient way, and I've used the following:
df['Unique_Info'] = df.groupby('Section').agg({'Info':'unique'})
Any ideas where I'm going wrong?
df.groupby().agg returns a series with different indexing, which is the Section number. You should use map to assign back to your dataframe:
s = df.groupby('Section')['Info'].agg('unique')
df['Unique_Info'] = df['Section'].map(s)
Output:
Info Section Unique_Info
0 A 1 [A, B]
1 B 1 [A, B]
2 C 2 [C, D]
3 D 2 [C, D]
4 E 3 [E]
Use df.merge and df.agg:
In [1531]: grp = df.groupby('Section')['Info'].agg(list).reset_index()
In [1535]: df.merge(grp, on='Section').rename(columns={'Info_y': 'unique'})
Out[1535]:
Info_x Section unique
0 A 1 [A, B]
1 B 1 [A, B]
2 C 2 [C, D]
3 D 2 [C, D]
4 E 3 [E]

Change values in pandas dataframe based on values in certain columns

How can I convert this first dataframe to the one below it? Based on different scenarios of the first three columns matching, I want to change the values in the rest of the columns.
import pandas as pd
df = pd.DataFrame([['foo', 'foo', 'bar', 'a', 'b', 'c', 'd'], ['bar', 'foo', 'bar', 'a', 'b', 'c', 'd'],
['spa', 'foo', 'bar', 'a', 'b', 'c', 'd']], columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar a b c d
1 bar foo bar a b c d
2 spa foo bar a b c d
If col1 = col2, I want to change all a's to 2, all b's and c's to 1, and all d's to 0. This is row 1 in my example df.
If col1 = col3, I want to change all a's to 0, all b's and c's to 1, and all d's to 2. This is row 2 in my example df.
If col1 != col2/col3, I want to delete the row and add 1 to a counter so I have a total of deleted rows. This is row 3 in my example df.
So my final dataframe would look like this, with counter = 1:
df = pd.DataFrame([['foo', 'foo', 'bar', '2', '1', '1', '0'], ['bar', 'foo', 'bar', '0', '1', '1', '2']],
columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
I was reading that using df.iterrows is slow and so there must be a way to do this on the whole df at once, but my original idea was:
for row in df.iterrows:
if (row["col1"] == row["col2"]):
df.replace(to_replace=['a'], value='2', inplace=True)
df.replace(to_replace=['b', 'c'], value='1', inplace=True)
df.replace(to_replace=['d'], value='0', inplace=True)
elif (row["col1"] == row["col3"]):
df.replace(to_replace=['a'], value='0', inplace=True)
df.replace(to_replace=['b', 'c'], value='1', inplace=True)
df.replace(to_replace=['d'], value='2', inplace=True)
else:
(delete row, add 1 to counter)
The original df is massive, so speed is important to me. I'm hoping it's possible to do the conversions on the whole dataframe without iterrows. Even if it's not possible, I could use help getting the syntax right for the iterrows.
You can remove rows by boolean indexing first:
m1 = df["col1"] == df["col2"]
m2 = df["col1"] == df["col3"]
m = m1 | m2
Get number of removed rows by sum of chained condition m1 and m2 with inverting by ~:
counter = (~m).sum()
print (counter)
1
df = df[m].copy()
print (df)
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar a b c d
1 bar foo bar a b c d
and then replace with dictionary by condition:
d1 = {'a':2,'b':1,'c':1,'d':0}
d2 = {'a':0,'b':1,'c':1,'d':2}
m1 = df["col1"] == df["col2"]
#replace all columns without col1-col3
cols = df.columns.difference(['col1','col2','col3'])
df.loc[m1, cols] = df.loc[m1, cols].replace(d1)
df.loc[~m1, cols] = df.loc[~m1, cols].replace(d2)
print (df)
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
Timings:
In [138]: %timeit (jez(df))
872 ms ± 6.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [139]: %timeit (hb(df))
1.33 s ± 9.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Setup:
np.random.seed(456)
a = ['foo','bar', 'spa']
b = list('abcd')
N = 100000
df1 = pd.DataFrame(np.random.choice(a, size=(N, 3))).rename(columns=lambda x: 'col{}'.format(x+1))
df2 = pd.DataFrame(np.random.choice(b, size=(N, 20))).rename(columns=lambda x: 's{}'.format(x+1))
df = df1.join(df2)
#print (df.head())
def jez(df):
m1 = df["col1"] == df["col2"]
m2 = df["col1"] == df["col3"]
m = m1 | m2
counter = (~m).sum()
df = df[m].copy()
d1 = {'a':2,'b':1,'c':1,'d':0}
d2 = {'a':0,'b':1,'c':1,'d':2}
m1 = df["col1"] == df["col2"]
cols = df.columns.difference(['col1','col2','col3'])
df.loc[m1, cols] = df.loc[m1, cols].replace(d1)
df.loc[~m1, cols] = df.loc[~m1, cols].replace(d2)
return df
def hb(df):
counter = 0
df[df.col1 == df.col2] = df[df.col1 == df.col2].replace(['a', 'b', 'c', 'd'], [2,1,1,0])
df[df.col1 == df.col3] = df[df.col1 == df.col3].replace(['a', 'b', 'c', 'd'], [0,1,1,2])
index_drop =df[((df.col1 != df.col3) & (df.col1 != df.col2))].index
counter = counter + len(index_drop)
df = df.drop(index_drop)
return df
You can use:
import pandas as pd
df = pd.DataFrame([['foo', 'foo', 'bar', 'a', 'b', 'c', 'd'], ['bar', 'foo', 'bar', 'a', 'b', 'c', 'd'],
['spa', 'foo', 'bar', 'a', 'b', 'c', 'd']], columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
counter = 0
#
df[df.col1 == df.col2] = df[df.col1 == df.col2].replace(['a', 'b', 'c', 'd'], [2,1,1,0])
df[df.col1 == df.col3] = df[df.col1 == df.col3].replace(['a', 'b', 'c', 'd'], [0,1,1,2])
index_drop =df[((df.col1 != df.col3) & (df.col1 != df.col2))].index
counter = counter + len(index_drop)
df = df.drop(index_drop)
print(df)
print(counter)
Output:
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
1 # counter

Looking for an analogue to pd.DataFrame.drop_duplicates() where order does not matter

I would like to use something similar to dropping the duplicates of a DataFrame. I would like columns' order not to matter. What I mean is that the function shuold consider a row consisting of the entries 'a', 'b' to be identical to a row consisting of the entries 'b', 'a'. For example, given
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['a', 'b'], ['b', 'a']])
0 1
0 a b
1 c d
2 a b
3 b a
I would like to obtain:
0 1
0 a b
1 c d
where the preference is for efficiency, as I run this on a huge dataset within a groupby operation.
Call np.sort first, and then drop duplicates.
df[:] = np.sort(df.values, axis=1)
df.drop_duplicates()
0 1
0 a b
1 c d

Resources