Combine text in dataframe python - python-3.x

Suppose I have this DataFrame:
df = pd.DataFrame({'col1': ['AC1', 'AC2', 'AC3', 'AC4', 'AC5'],
'col2': ['A', 'B', 'B', 'A', 'C'],
'col3': ['ABC', 'DEF', 'FGH', 'IJK', 'LMN']})
I want to comnbine text of 'col3' if values in 'col2' are duplicated. The result should be like this:
col1 col2 col3
0 AC1 A ABC, IJK
1 AC2 B DEF, FGH
2 AC3 B DEF, FGH
3 AC4 A ABC, IJK
4 AC5 C LMN
I start this excercise by finding duplicated values in this dataframe:
col2 = df['col2']
df1 = df[col2.isin(col2[col2.duplicated()])]
Any suggestion what I should do next?

You can use
a = df.groupby('col2').apply(lambda group: ','.join(group['col3']))
df['col3'] = df['col2'].map(a)
Output
print(df)
col1 col2 col3
0 AC1 A ABC,IJK
1 AC2 B DEF,FGH
2 AC3 B DEF,FGH
3 AC4 A ABC,IJK
4 AC5 C LMN

You might want to leverage the groupby and apply functions in Pandas
df.groupby('col2').apply(lambda group: ','.join(group['col3']))

Related

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

Duplicate rows in a dataframe according to a criterion from the table

I have a dataframe like this:
d = {'col1': ['a', 'b'], 'col2': [2, 4]}
df = pd.DataFrame(data=d)
df
>> col1 col2
0 a 2
1 b 4
and i want to duplicate the rows by col2 and get a table like this:
>> col1 col2
0 a 2
1 a 2
2 b 4
3 b 4
4 b 4
5 b 4
Thanks to everyone for the help!
Here's my solution using some numpy:
numRows = np.sum(df.col2)
blankSpace = np.zeros(numRows,).astype(str)
d2 = {'col1': blankSpace, 'col2': blankSpace}
df2 = pd.DataFrame(data=d2)
counter = 0
for i in range(df.shape[0]):
letter = df.col1[i]
numRowsForLetter = df.col2[i]
for j in range(numRowsForLetter):
df2.at[counter, 'col1'] = letter
df2.at[counter, 'col2'] = numRowsForLetter
counter += 1
df2 is your output dataframe!

How can i update the rows of a Dataframe with to make them a dictionary with the column names?

I have a dataframe like this.
ID Name id2 name2
101 A 1 d_a
103 B 2 d_b
101 A 3 d_c
103 B 4 d_d
and i want the output df like this.
ID Name id2 name2
101 A [{'id2':1},{'id2':3}] [{'name2':'d_a'},{'name2':'d_c'}]
103 B [{'id2':2},{'id2':4}] [{'name2':'d_b'},{'name2':'d_d'}]
Use list comprehension with DataFrame.to_dict:
df1 = pd.DataFrame([[df[[x]].to_dict('r') for x in df]], columns=df.columns)
print (df1)
col1 \
0 [{'col1': 1}, {'col1': 2}, {'col1': 3}]
col2
0 [{'col2': 'def'}, {'col2': 'bb'}, {'col2': 'ra'}]
EDIT: Use GroupBy.apply with lambda function:
cols = ['id2','name2']
df2 = df.groupby(['ID','Name'])[cols].agg(lambda x: x.to_frame().to_dict('r')).reset_index()
print (df2)
ID Name id2 name2
0 101 A [{'id2': 1}, {'id2': 3}] [{'name2': 'd_a'}, {'name2': 'd_c'}]
1 103 B [{'id2': 2}, {'id2': 4}] [{'name2': 'd_b'}, {'name2': 'd_d'}]

fill a new column in a pandas dataframe from the value of another dataframe [duplicate]

This question already has an answer here:
Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe
(1 answer)
Closed 4 years ago.
I have two dataframes :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5, 1]})
col1 col2 col3
0 a c 1
1 b c 2
2 a d 3
3 a d 4
4 b c 5
5 h i 1
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'f'], 'col2': ['c', 'c', 'd', 'k'], 'col3': [12, 23, 45, 78]})
col1 col2 col3
0 a c 12
1 b c 23
2 a d 45
3 f k 78
and I'd like to build a new column in the first one according to the values of col1 and col2 that can be found in the second one. That is this new one :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5],'col4' : [12, 23, 45, 45, 23]})
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23
5 h i 1 NaN
How am I able to do that ?
Tks for your attention :)
Edit : it has been adviced to look for the answer in this subject Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe but it is not the same question.
In here, not only the ID does not exist since it is splitted in col1 and col2 but above all, although being unique in the second dataframe, it is not unique in the first one. This is why I think that neither a merge nor a join can be the answer to this.
Edit2 : In addition, couples col1 and col2 of df1 may not be present in df2, in this case NaN is awaited in col4, and couples col1 and col2 of df2 may not be needed in df1. To illustrate these cases, I addes some rows in both df1 and df2 to show how it could be in the worst case scenario
You could also use map like
In [130]: cols = ['col1', 'col2']
In [131]: df1['col4'] = df1.set_index(cols).index.map(df2.set_index(cols)['col3'])
In [132]: df1
Out[132]:
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23

Resources