Conversion of dataframe to required dictionary format - python-3.x

I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]

Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]

Related

Update values of non NaN positions in a dataframe column

I want to update the values of non-NaN entries in a dataframe column
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
The data for updating the value column is in a list
new_value = [10, 15, 1, 18]
I could get the non-NaN entries in column value
df["value"].notnull()
I'm not sure how to assign the new values.
Suggestions will be really helpful.
df.loc[df["value"].notna(), 'value'] = new_value
By df["value"].notna() you select the rows where value is not NAN, then you specify the column (value in this case). It is important that the number of rows selected by the condition matches the number of values in new_value.
You can first identify the index which have nan values.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
print(df)
r, _ = np.where(df.isna())
new_value = [10, 15, 18] # There are only 3 nans
df.loc[r,'value'] = new_value
print(df)
Output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 10.0
3 0 2 B 20.0
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 30.0

Function on column from dictionary

I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)

How to create dict from pandas dataframe with headers as keys and column values as arrays (not lists)?

I have a pandas dataframe of size (3x10000). I need to create a dict such that the keys are column headers and column values are arrays.
I understand there are many options to create such a dict where values are saved as lists. But I could not find a way to have the values as arrays.
Dataframe example:
A B C
0 1 4 5
1 6 3 2
2 8 0 9
Expected output:
{'A': array([1, 6, 8, ...]),
'B': array([4, 3, 0, ...]),
'C': array([5, 2, 9, ...])}
I guess following does what you need:
>>> import numpy as np
>>> # assuming df is your dataframe
>>> result = {header: np.array(df[header]) for header in df.columns}
>>> result
>>> {'A': array([1, 6, 8]), 'B': array([4, 3, 0]), 'C': array([5, 2, 9])
pandas added to_numpy in 0.24 and it should be way more efficient so you might want to check it.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

Remove exact rows and frequency of rows of a data.frame where certain column values match with column values of another data.frame in python 3

Consider the following two data.frames created using pandas in python 3:
a1 = pd.DataFrame(({'NO': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
'A': [1, 2, 3, 4, 5, 2, 4, 2],
'B': ['a', 'b', 'c', 'd', 'e', 'b', 'd', 'b']}))
a2 = pd.DataFrame(({'NO': ['d9', 'd10', 'd11', 'd12'],
'A': [1, 2, 3, 2],
'B': ['a', 'b', 'c', 'b']}))
I would like to remove the exact rows of a1 that are in a2 wherever the values of columns 'A' an 'B' are the same (except for the 'NO' column) so that the result should be:
A B NO
4 d d4
5 e d5
4 d d7
2 b d8
Is there any built-in function in pandas or any other library in python 3 to get this result?

mean of all the columns of a panda dataframe?

I'm trying to calculate the mean of all the columns of a DataFrame but it looks like having a value in the B column of row 6 prevents from calculating the mean on the C column. Why?
import pandas as pd
from decimal import Decimal
d = [
{'A': 2, 'B': None, 'C': Decimal('628.00')},
{'A': 1, 'B': None, 'C': Decimal('383.00')},
{'A': 3, 'B': None, 'C': Decimal('651.00')},
{'A': 2, 'B': None, 'C': Decimal('575.00')},
{'A': 4, 'B': None, 'C': Decimal('1114.00')},
{'A': 1, 'B': 'TEST', 'C': Decimal('241.00')},
{'A': 2, 'B': None, 'C': Decimal('572.00')},
{'A': 4, 'B': None, 'C': Decimal('609.00')},
{'A': 3, 'B': None, 'C': Decimal('820.00')},
{'A': 5, 'B': None, 'C': Decimal('1223.00')}
]
df = pd.DataFrame(d)
In : df
Out:
A B C
0 2 None 628.00
1 1 None 383.00
2 3 None 651.00
3 2 None 575.00
4 4 None 1114.00
5 1 TEST 241.00
6 2 None 572.00
7 4 None 609.00
8 3 None 820.00
9 5 None 1223.00
Tests:
# no mean for C column
In : df.mean()
Out:
A 2.7
dtype: float64
# mean for C column when row 6 is left out of the DF
In : df.head(5).mean()
Out:
A 2.4
B NaN
C 670.2
dtype: float64
# no mean for C column when row 6 is part of the DF
In : df.head(6).mean()
Out:
A 2.166667
dtype: float64
dtypes:
In : df.dtypes
Out:
A int64
B object
C object
dtype: object
In : df.head(5).dtypes
Out:
A int64
B object
C object
dtype: object
You could use particular columns if you need only columns with numbers:
In [90]: df[['A','C']].mean()
Out[90]:
A 2.7
C 681.6
dtype: float64
or to change type as #jezrael advice in comment:
df['C'] = df['C'].astype(float)
Probably df.mean trying to convert all object to numeric and if it's fall then it's roll back and calculate only for actual numbers

Resources