fill a new column in a pandas dataframe from the value of another dataframe [duplicate] - python-3.x

This question already has an answer here:
Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe
(1 answer)
Closed 4 years ago.
I have two dataframes :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5, 1]})
col1 col2 col3
0 a c 1
1 b c 2
2 a d 3
3 a d 4
4 b c 5
5 h i 1
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'f'], 'col2': ['c', 'c', 'd', 'k'], 'col3': [12, 23, 45, 78]})
col1 col2 col3
0 a c 12
1 b c 23
2 a d 45
3 f k 78
and I'd like to build a new column in the first one according to the values of col1 and col2 that can be found in the second one. That is this new one :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5],'col4' : [12, 23, 45, 45, 23]})
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23
5 h i 1 NaN
How am I able to do that ?
Tks for your attention :)
Edit : it has been adviced to look for the answer in this subject Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe but it is not the same question.
In here, not only the ID does not exist since it is splitted in col1 and col2 but above all, although being unique in the second dataframe, it is not unique in the first one. This is why I think that neither a merge nor a join can be the answer to this.
Edit2 : In addition, couples col1 and col2 of df1 may not be present in df2, in this case NaN is awaited in col4, and couples col1 and col2 of df2 may not be needed in df1. To illustrate these cases, I addes some rows in both df1 and df2 to show how it could be in the worst case scenario

You could also use map like
In [130]: cols = ['col1', 'col2']
In [131]: df1['col4'] = df1.set_index(cols).index.map(df2.set_index(cols)['col3'])
In [132]: df1
Out[132]:
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23

Related

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

Using groupby and filters on a dataframe

I have a dataframe with both string and integer values.
Attaching a sample data dictionary to understand the dataframe that I have:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12]
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
I need to extract data as under:
Max value from col4
Grouped by col1
Filtered out col3 from the result if value is Y
Filter col5 from the result to show only values not more than 5.
So I tried something and faced following problems.
1- I used following method to find max value from the data. But I am not able to find max value from each group.
print(dataframe['col4'].max()) #this worked to get one max value
print(dataframe.groupby('col1').max() #this doesn't work
Second one doesn't work for me as that returns maximum value for col2 as well. I need the result to have col2 value against the max row under each group.
2- I am not able to apply filter on both col3 (str) and col5 (int) in one command. Any way to do that?
print(dataframe[dataframe['col3'] != 'Y' & dataframe['col5'] < 6]) #generates an error
The output that I am expecting through this is:
col1 col2 col3 col4 col5
0 A 10 X 45 3
3 B 10 X 56 4
6 C 10 X 87 4
10 D 20 X 43 4
#
# 78 is max in group A, but ignored as col5 is 6 (we need < 6)
# Similarly, 89 is max in group D, but ignored as col3 is Y.
I apologize if I am doing something wrong. I am quite new to this.
Thank you.
I'm not a python developer, but im my opinion you do it in a wrong way.
You shoud have a list of structure insted of structure of list.
Then you can start workin on such list.
This is an example solution, so probably it coud be done im much smootcher way:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12],
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
newData = [];
for i in range(len(data['col1'])):
newData.append({'col1' : data['col1'][i], 'col2' : data['col2'][i], 'col3' : data['col3'][i], 'col4' : data['col4'][i], 'col5' : data['col5'][i]})
withoutY = list(filter(lambda d: d['col3'] != 'Y', newData))
lessThen5 = list(filter(lambda d: d['col5'] < 5, withoutY))
values = set(map(lambda d: d['col1'], lessThen5))
groupped = [[d1 for d1 in lessThen5 if d1['col1']==d2] for d2 in values]
result = [];
for i in range(len(groupped)):
result.append(max(groupped[i], key = lambda g: g['col4']))
sortedResult = sorted(result, key = lambda r: r['col1'])
print (sortedResult)
result:
[
{'col1': 'A', 'col2': 10, 'col3': 'X', 'col4': 45, 'col5': 3},
{'col1': 'B', 'col2': 10, 'col3': 'X', 'col4': 56, 'col5': 4},
{'col1': 'C', 'col2': 10, 'col3': 'X', 'col4': 87, 'col5': 4},
{'col1': 'D', 'col2': 20, 'col3': 'X', 'col4': 43, 'col5': 4}
]
Ok, I didn't actually notice.
So i was try something like this:
#fd is a filtered data
fd=data.query('col3 != "Y"').query('col5 < 6')
# or fd=data[data.col3 != 'Y'][data.col5 < 6]
#m is max for col4 grouped by col1
m=fd.groupby('col1')['col4'].max()
This will group by col1 and get max from col4, but in result we have 2 colums (col1 and col4).
I don't know what do you want to achieve.
If you want to have all line, here is the code:
result=fd[lambda x: x.col4 == m.get(x.col1).values]
You need to be careful, because you not alway will have one line for "col1".
E.g. For data
data = pd.DataFrame({
'col1': ['A','A','A','A','B','B','B','B','C','C','C','D','D','D'],
'col2': [20,10,20,30,10,20,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,45,23,78,45,56,12,34,87,54,43,89,43,12],
'col5': [1,3,4,6,1,4,3,2,4,3,5,3,4,6]})
Result will be:
col1 col2 col3 col4 col5
0 A 20 X 45 1
1 A 10 X 45 3
5 B 20 X 56 4
8 C 10 X 87 4
12 D 20 X 43 4
Additionally if you want to have normal index instead of ..., 8, 9 12, you could use "where" instead of "query"

How can I create a new dataframe by taking the rolling COLUMN total/sum of another dataframe?

import pandas as pd
df = {'a': [1,1,1], 'b': [2,2,2], 'c': [3,3,3], 'd': [4,4,4], 'e': [5,5,5], 'f': [6,6,6], 'g': [7,7,7]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g'])
dg = {'h': [10,10,10], 'i': [14,14,14], 'j': [18,18,18], 'k': [22,22,22]}
df2 = pd.DataFrame(dg, columns = ['h', 'i', 'j', 'k'])
df1
a b c d e f g
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
df1 is my original data frame. I would like to create another dataframe by adding each consecutive 4 columns (rolling column sum).
df2
h i j k
0 10 14 18 22
1 10 14 18 22
2 10 14 18 22
df2 is the resulting dataframe after adding 4 consecutive columns of df1.
For example: column h in df2 is the sum of columns a, b, c, d in df1; column i in df2 is the sum of columns b, c, d, e in df1; column j in df2 is the sum of columns c, d, e, f in df1; column k in df2 is the sum of columns d, e, f, g in df1.
I could not find any similar question/answer/example like this.
I would appreciate any help.
You can use rolling over 4 columns and take the sum. Finally drop the first 3 columns.
df1.rolling(4, axis=1).sum().dropna(axis=1)
d e f g
0 10.0 14.0 18.0 22.0
1 10.0 14.0 18.0 22.0
2 10.0 14.0 18.0 22.0

Combine text in dataframe python

Suppose I have this DataFrame:
df = pd.DataFrame({'col1': ['AC1', 'AC2', 'AC3', 'AC4', 'AC5'],
'col2': ['A', 'B', 'B', 'A', 'C'],
'col3': ['ABC', 'DEF', 'FGH', 'IJK', 'LMN']})
I want to comnbine text of 'col3' if values in 'col2' are duplicated. The result should be like this:
col1 col2 col3
0 AC1 A ABC, IJK
1 AC2 B DEF, FGH
2 AC3 B DEF, FGH
3 AC4 A ABC, IJK
4 AC5 C LMN
I start this excercise by finding duplicated values in this dataframe:
col2 = df['col2']
df1 = df[col2.isin(col2[col2.duplicated()])]
Any suggestion what I should do next?
You can use
a = df.groupby('col2').apply(lambda group: ','.join(group['col3']))
df['col3'] = df['col2'].map(a)
Output
print(df)
col1 col2 col3
0 AC1 A ABC,IJK
1 AC2 B DEF,FGH
2 AC3 B DEF,FGH
3 AC4 A ABC,IJK
4 AC5 C LMN
You might want to leverage the groupby and apply functions in Pandas
df.groupby('col2').apply(lambda group: ','.join(group['col3']))

Resources