Proper way to update pandas dataframe with values from another - python-3.x

What is the proper way to update multiple columns in one dataframe with values from another dataframe?
Say I have these two dataframes:
import pandas as pd
df1 = pd.DataFrame([['4', 'val1', 'val2.4', 'val3.4'],
['5', 'val1', 'val2.5', 'val3.5'],
['6', 'val1', 'val2.6', 'val3.6'],
['7', 'val1', 'val2.7', 'val3.7']],
columns=['account_id', 'field1', 'field2', 'field3'])
df2 = pd.DataFrame([['6', 'VAL2.6', 'VAL3.6'],
['5', 'VAL2.5', 'VAL3.5']],
columns=['account_id', 'field2', 'field3'])
Of note, df2 has only a subset of d1's rows (in some random order) and columns.
I'd like to replace values in df1 with values from df2 (where they exist, joining on account_id, ala an SQL UPDATE).
One solution is something like
cols_to_update = ['field2', 'field3']
df1.loc[df1.account_id.isin(df2.account_id), cols_to_update] = df2[cols_to_update].values
But that doesn't handle the join and results in
account_id field1 field2 field3
0 4 val1 val2.4 val3.4
1 5 val1 VAL2.6 VAL3.6
2 6 val1 VAL2.5 VAL3.5
3 7 val1 val2.7 val3.7
where account_id 6 now has the wrong values.
My questions are:
How do I use indexes to make something like that work?
Is there a merge() or join() solution that isn't so tedious with combining duplicate columns?

Sort the values of df2 before assigning i.e
cols_to_update = ['field2', 'field3']
df1.loc[df1.account_id.isin(df2.account_id), cols_to_update] = df2.sort_values(['account_id'])[cols_to_update].values
account_id field1 field2 field3
0 4 val1 val2.4 val3.4
1 5 val1 VAL2.5 VAL3.5
2 6 val1 VAL2.6 VAL3.6
3 7 val1 val2.7 val3.7

I would suggest you to use the function update of panda's dataframe:
df = pd.DataFrame({'A': [1, 2, 3],'B': [400, 500, 600]})
new_df = pd.DataFrame({'B': [4, 5, 6],'C': [7, 8, 9]})
df.update(new_df)
df
A B
0 1 4
1 2 5
2 3 6

Related

How to map/replace multiple values in a column for each row in pandas dataframe

I have this sample
col1 result
1 A
1,2,3
2 B
2,3,4
3,4
4 D
1,3,4
3 C
Here's my map variable.
vals_to_replace = {'1':'A', '2':'B', '3':'C' , '4':'D'}
I map this to col1, and only getting some values from the col result, not sure why why single value got mapped only.
Any ideas on how to solve it?
Thanks
Maybe this is what works for you:
import pandas as pd
df = pd.DataFrame({'col1': ['1', '1,2,3', '2', '2,3,4', '3, 4', '4', '1,3,4', '3']})
translation = {'1':'A', '2':'B', '3':'C' , '4':'D'}
df['result'] = df.col1.str.translate(str.maketrans(translation))
print(df)
Result:
col1 result
0 1 A
1 1,2,3 A,B,C
2 2 B
3 2,3,4 B,C,D
4 3, 4 C, D
5 4 D
6 1,3,4 A,C,D
7 3 C

sort values a data frame with duplicates values

I have a dataframe with a format like this:
d = {'col1': ['PC', 'PO', 'PC', 'XY', 'XY', 'AB', 'AB', 'PC', 'PO'], 'col2':
[1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data=d)
df.sort_values(by = 'col1')
This gives me the result like this:
I want to sort the values based on col1 values with desired order, keep the duplicates. The result I expect would be like this:
Any idea?
Thanks in advance!
You can create an order beforehand and then sort values as below.
order = ['PO','XY','AB','PC']
df['col1'] = pd.CategoricalIndex(df['col1'], ordered=True, categories=order)
df = df.sort_values(by = 'col1')
df
col1 col2
1 PO 2
8 PO 9
3 XY 4
4 XY 5
5 AB 6
6 AB 7
0 PC 1
2 PC 3
7 PC 8

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

Concatenate dataframes in Pandas using an iteration but it doesn't work

I have several dataframes indexed more or less by the same MultiIndex (a few values may be missing on each dataframe, but the total rows exceeds 70K and the missing values is always less than 10). I want to attach/merge/concatenate to all of them a given dataframe (with same indexation). I tried doing this using a for iteration with a tuple, as in the example here. However, at the end, all my data frames do not merge. I provide a simple example where this happens. Why they do not merge?
df1 = pd.DataFrame(np.arange(12).reshape(4,3), index = ["A", "B", "C", "D"], columns = ["1st", "2nd", "3rd"])
df2 = df1 + 2
df3 = df1 - 2
for df in (df1, df2):
df = pd.merge(df, df3, left_index = True, right_index = True, how = "inner")
df1, df2
What is your expected result?
In the for loop, df is the loop variable and also the result on the left-hand side of the assignment statement. Here is the same loop with print statements to provide additional information. I think you are over-writing intermediate results.
for df in (df1, df2):
print(df)
print('-----')
df = pd.merge(df, df3, left_index = True, right_index = True, how = "inner")
print(df)
print('==========', end='\n\n')
print(df)
You could combine df1, df2 and df3 like this.
print(pd.concat([df1, df2, df3], axis=1))
1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd
A 0 1 2 2 3 4 -2 -1 0
B 3 4 5 5 6 7 1 2 3
C 6 7 8 8 9 10 4 5 6
D 9 10 11 11 12 13 7 8 9
UPDATE
Here is an idiomatic way to import and concatenate several CSV files, possibly in multiple directories. In short: read each file into a separate data frame; add each data frame to a list; concatenate once at the end.
Reference: https://pandas.pydata.org/docs/user_guide/cookbook.html#reading-multiple-files-to-create-a-single-dataframe
import pandas as pd
from pathlib import Path
df = list()
for filename in Path.cwd().rglob('*.csv'):
with open(filename, 'rt') as handle:
t = pd.read_csv(handle)
df.append(t)
print(filename.name, t.shape)
df = pd.concat(df)
print('\nfinal: ', df.shape)
penny.csv (62, 8)
penny-2020-06-24.csv (144, 9)
...etc
final: (474, 20)

How to sum columns in python based on column with not empty string

df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3],
'Sum over columns':[1,10,8,5,10]})
Hi everybody, could you please help me with following issue:
I'm trying to sum over columns to get a sum of data1 and data2.
If column with string (key1) is not NaN and if column with string (key2) is not NaN then sum data1 and data2. The result I want is shown in the sum column. Thank your for your help!
Try using the .apply method of df on axis=1 and numpy's array multiplication function to get your desired output:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3]})
df['Sum over columns'] = df.apply(lambda x: np.multiply(x[0:2], ~x[2:4].isnull()).sum(), axis=1)
Or:
df['Sum over columns'] = np.multiply(df[['data1','data2']], ~df[['key1','key2']].isnull()).sum(axis=1)
Either one of them should yield:
# data1 data2 key1 key2 Sum over columns
# 0 2 1 NaN ab 1
# 1 5 5 a aa 10
# 2 8 9 b NaN 8
# 3 5 6 b NaN 5
# 4 7 3 a one 10
I hope this helps.

Resources