I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)
Related
A = [1,3,7]
B = [6,4,8]
C = [2, 2, 8]
datetime = ['2022-01-01', '2022-01-02', '2022-01-03']
df1 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df1.set_index('DATETIME', inplace = True)
df1
A = [1,3,7,6, 8]
B = [3,8,10,5, 8]
C = [5, 7, 9, 6, 5]
datetime = ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05']
df2 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df2.set_index('DATETIME', inplace = True)
df2
I want to compare the difference between every row of df1 to that of df2 and output that date for each row in df1. Lets take the first row in df1 (2022-01-01) where A=1, B=6, and C = 2. Comparing that to df2 2022-03-01 where A=1, B = 3, and C = 5, we get a total difference of 1-1=0, 6-3=3, and 2-5 = 3 for a total of 0+3+3= 6 total difference. Comparing that 2022-01-01 to the rest of df2 we see that 2022-03-01 is the lowest total difference and would like the date in df1.
I'm assuming that you want the lowest total absolute difference.
The fastest way is probably to convert the DataFrames to numpy arrays, and use numpy broadcasting to efficiently perform the computations.
# for each row of df1 get the (positional) index of the df2 row corresponding to the lowest total absolute difference
min_idx = abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(axis=-1).argmin(axis=1)
df1['min_diff_date'] = df2.index[min_idx]
Output:
>>> df1
A B C min_diff_date
DATETIME
2022-01-01 1 6 2 2022-03-01
2022-01-02 3 4 2 2022-03-01
2022-01-03 7 8 8 2022-03-03
Steps:
# Each 'block' corresponds to the absolute difference between a row of df1 and all the rows of df2
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy())
array([[[0, 3, 3],
[2, 2, 5],
[6, 4, 7],
[5, 1, 4],
[7, 2, 3]],
[[2, 1, 3],
[0, 4, 5],
[4, 6, 7],
[3, 1, 4],
[5, 4, 3]],
[[6, 5, 3],
[4, 0, 1],
[0, 2, 1],
[1, 3, 2],
[1, 0, 3]]])
# sum the absolute differences over the columns of each block
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1)
array([[ 6, 9, 17, 10, 12],
[ 6, 9, 17, 8, 12],
[14, 5, 3, 6, 4]])
# for each row of the previous array get the column index of the lowest value
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1).argmin(1)
array([0, 0, 2])
I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]
Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]
I have a dataframe with a "range" column and some value columns:
In [1]: df = pd.DataFrame({
"range": [[1,2], [[1,2], [6,11]], [4,5], [[1,3], [5,7], [9, 11]], [9,10], [[5,6], [9,11]]],
"A": range(1, 7),
"B": range(6, 0, -1)
})
Out[1]:
range A B
0 [1, 2] 1 6
1 [[1, 2], [6, 11]] 2 5
2 [4, 5] 3 4
3 [[1, 3], [5, 7], [9, 11]] 4 3
4 [9, 10] 5 2
5 [[5, 6], [9, 11]] 6 1
For every row I need to check if the range is entirely included (with all of its parts) in the range of another row and then sum the other columns (A and B) up, keeping the longer range. The rows are arbitarily ordered.
The detailed steps for the example dataframe would look like: Row 0 is entirely included in row 1 and 3, row 1, 2 and 3 have no other rows where their ranges are entirely included and row 4 is included in row 1, 3 and 5, but because row 5 is also included in 3 row 4 should only be merged once.
Hence my output dataframe would be:
Out[2]:
range A B
0 [[1, 2], [6, 11]] 8 13
1 [4, 5] 3 4
2 [[1, 3], [5, 7], [9, 11]] 16 12
I thought about sorting the rows first in order to put the longest ranges at the top so it would be easier and more efficient to merge the ranges, but unfortunately I have no idea how to perform this in pandas...
I have Pandas DataFrame that looks like:
id a b c col
1 a 1 2 Null 'aa'
2 a 2 2 3 'aa'
3 b 4 3 1 'bb'
4 c 1 Null 3 'gg'
5 c Null 2 Null 'gg'
I want to groupby the columns to get the following:
id new_col col
1 a [1, 2, 2, 2, 3] 'aa'
2 b [4, 3, 1] 'bb'
3 c [1, 3, 2] 'gg'
Is it possible to do it using pd.groupby?
Thanks
You can use df.melt with groupby+agg:
final = (df.replace('Null',np.nan).melt(['id','col'],value_name='new_col').groupby('id'
,as_index=False).agg({'new_col':lambda x: x.dropna().tolist(),'col':'first'}))
Or stack first with set_index then groupby+agg
final1 = (df.replace('Null',np.nan).set_index(['id','col']).stack().rename('new_col')
.reset_index('col').groupby(level=0).agg({'new_col':list,'col':'first'}))
id new_col col
0 a [1, 2, 2, 2, 3] 'aa'
1 b [4, 3, 1] 'bb'
2 c [1, 2, 3] 'gg'
Use GroupBy.apply with DataFrame.stack by all columns without specified in list by Index.difference:
df = df.replace('Null', np.nan)
c = df.columns.difference(['id','col'])
f = lambda x: x.stack().tolist()
df = df.groupby(['id','col'])[c].apply(f).reset_index(name='new_col')[['id','new_col','col']]
print (df)
id new_col col
0 a [1, 2, 2, 2, 3] 'aa'
1 b [4, 3, 1] 'bb'
2 c [1, 3, 2] 'gg'
df["d"] = df[['a', 'b', 'c']].values.tolist()
dup = df.groupby(['id','col'])['d'].sum().reset_index(name='new_col')
I have a pandas dataframe of size (3x10000). I need to create a dict such that the keys are column headers and column values are arrays.
I understand there are many options to create such a dict where values are saved as lists. But I could not find a way to have the values as arrays.
Dataframe example:
A B C
0 1 4 5
1 6 3 2
2 8 0 9
Expected output:
{'A': array([1, 6, 8, ...]),
'B': array([4, 3, 0, ...]),
'C': array([5, 2, 9, ...])}
I guess following does what you need:
>>> import numpy as np
>>> # assuming df is your dataframe
>>> result = {header: np.array(df[header]) for header in df.columns}
>>> result
>>> {'A': array([1, 6, 8]), 'B': array([4, 3, 0]), 'C': array([5, 2, 9])
pandas added to_numpy in 0.24 and it should be way more efficient so you might want to check it.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html