How to create dict from pandas dataframe with headers as keys and column values as arrays (not lists)? - python-3.x

I have a pandas dataframe of size (3x10000). I need to create a dict such that the keys are column headers and column values are arrays.
I understand there are many options to create such a dict where values are saved as lists. But I could not find a way to have the values as arrays.
Dataframe example:
A B C
0 1 4 5
1 6 3 2
2 8 0 9
Expected output:
{'A': array([1, 6, 8, ...]),
'B': array([4, 3, 0, ...]),
'C': array([5, 2, 9, ...])}

I guess following does what you need:
>>> import numpy as np
>>> # assuming df is your dataframe
>>> result = {header: np.array(df[header]) for header in df.columns}
>>> result
>>> {'A': array([1, 6, 8]), 'B': array([4, 3, 0]), 'C': array([5, 2, 9])
pandas added to_numpy in 0.24 and it should be way more efficient so you might want to check it.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

Related

Update values of non NaN positions in a dataframe column

I want to update the values of non-NaN entries in a dataframe column
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
The data for updating the value column is in a list
new_value = [10, 15, 1, 18]
I could get the non-NaN entries in column value
df["value"].notnull()
I'm not sure how to assign the new values.
Suggestions will be really helpful.
df.loc[df["value"].notna(), 'value'] = new_value
By df["value"].notna() you select the rows where value is not NAN, then you specify the column (value in this case). It is important that the number of rows selected by the condition matches the number of values in new_value.
You can first identify the index which have nan values.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
print(df)
r, _ = np.where(df.isna())
new_value = [10, 15, 18] # There are only 3 nans
df.loc[r,'value'] = new_value
print(df)
Output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 10.0
3 0 2 B 20.0
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 30.0

Edit multiple values with df.at()

Why does
>>> offset = 2
>>> data = {'Value': [7, 9, 21, 22, 23, 100]}
>>> df = pd.DataFrame(data=data)
>>> df.at[:offset, "Value"] = 99
>>> df
Value
0 99
1 99
2 99
3 22
4 23
5 100
change values in indices [0, 1, 2]? I would expect them only to be changed in [0, 1] to be conform with regular slicing.
Like when I do
>>> arr = [0, 1, 2, 3, 4]
>>> arr[0:2]
[0, 1]
.at behaves like .loc, in that it selects rows/columns by label. Label slicing in pandas is inclusive. Note that .iloc, which performs slicing on the integer positions, behaves like you would expect. See this good answer for a motivation.
Also note that the pandas documentation suggests to use .at only when selecting/setting single values. Instead, use .loc.
On line 4, when you say :2, it means all rows from 0 to 2 or 0:2. If you want to change only the 3rd row, you should change it to 2:2

Smallest difference from every row in a dataframe

A = [1,3,7]
B = [6,4,8]
C = [2, 2, 8]
datetime = ['2022-01-01', '2022-01-02', '2022-01-03']
df1 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df1.set_index('DATETIME', inplace = True)
df1
A = [1,3,7,6, 8]
B = [3,8,10,5, 8]
C = [5, 7, 9, 6, 5]
datetime = ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05']
df2 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df2.set_index('DATETIME', inplace = True)
df2
I want to compare the difference between every row of df1 to that of df2 and output that date for each row in df1. Lets take the first row in df1 (2022-01-01) where A=1, B=6, and C = 2. Comparing that to df2 2022-03-01 where A=1, B = 3, and C = 5, we get a total difference of 1-1=0, 6-3=3, and 2-5 = 3 for a total of 0+3+3= 6 total difference. Comparing that 2022-01-01 to the rest of df2 we see that 2022-03-01 is the lowest total difference and would like the date in df1.
I'm assuming that you want the lowest total absolute difference.
The fastest way is probably to convert the DataFrames to numpy arrays, and use numpy broadcasting to efficiently perform the computations.
# for each row of df1 get the (positional) index of the df2 row corresponding to the lowest total absolute difference
min_idx = abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(axis=-1).argmin(axis=1)
df1['min_diff_date'] = df2.index[min_idx]
Output:
>>> df1
A B C min_diff_date
DATETIME
2022-01-01 1 6 2 2022-03-01
2022-01-02 3 4 2 2022-03-01
2022-01-03 7 8 8 2022-03-03
Steps:
# Each 'block' corresponds to the absolute difference between a row of df1 and all the rows of df2
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy())
array([[[0, 3, 3],
[2, 2, 5],
[6, 4, 7],
[5, 1, 4],
[7, 2, 3]],
[[2, 1, 3],
[0, 4, 5],
[4, 6, 7],
[3, 1, 4],
[5, 4, 3]],
[[6, 5, 3],
[4, 0, 1],
[0, 2, 1],
[1, 3, 2],
[1, 0, 3]]])
# sum the absolute differences over the columns of each block
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1)
array([[ 6, 9, 17, 10, 12],
[ 6, 9, 17, 8, 12],
[14, 5, 3, 6, 4]])
# for each row of the previous array get the column index of the lowest value
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1).argmin(1)
array([0, 0, 2])

Store whole dict in one element/cell of Pandas DataFrame?

is it possible to store a complex dict in one element of a Pandas DataFrame, please?
And later fill the whole column with similar structured dictionaries, please?
My Mini-Example
import pandas as pd
import numpy as np
#create an example dict
dict={}
dict['key1']=np.array([[1, 2, 3], [4, 5, 6]])
dict['key2']=np.array([2])
dict['key3']='Mexico'
#create the pd DataFrame
df=pd.DataFrame(index=['0','1'], columns=['A','B'])
df
The following code
df[0,'A']=[dict]
df[1,'A']=[dict]
fails with
ValueError: Length of values (1) does not match length of index (2)
In reality, my dict contains around 20 entries and I don't want to store each entry in a column for the same index. Or will this be the only way, please?
I thought I could create with Pandas some kind of small database.
Thank you for your advice.
You can use .loc to do that:
df.loc['0','A'] = [5,2,3]
df.loc['1','A'] = [dict]
Result:
A B
0 [5, 2, 3] NaN
1 [{'key1': [[1 2 3], [4 5 6]], 'key2': [2], 'ke... NaN
You can also add new entries (rows):
df.loc['5','A'] = [{'test':'dummy'}]
A B
0 [5, 2, 3] NaN
1 [{'key1': [[1 2 3], [4 5 6]], 'key2': [2], 'ke... NaN
5 [{'test': 'dummy'}] NaN

Function on column from dictionary

I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)

Resources