mean of all the columns of a panda dataframe? - python-3.x

I'm trying to calculate the mean of all the columns of a DataFrame but it looks like having a value in the B column of row 6 prevents from calculating the mean on the C column. Why?
import pandas as pd
from decimal import Decimal
d = [
{'A': 2, 'B': None, 'C': Decimal('628.00')},
{'A': 1, 'B': None, 'C': Decimal('383.00')},
{'A': 3, 'B': None, 'C': Decimal('651.00')},
{'A': 2, 'B': None, 'C': Decimal('575.00')},
{'A': 4, 'B': None, 'C': Decimal('1114.00')},
{'A': 1, 'B': 'TEST', 'C': Decimal('241.00')},
{'A': 2, 'B': None, 'C': Decimal('572.00')},
{'A': 4, 'B': None, 'C': Decimal('609.00')},
{'A': 3, 'B': None, 'C': Decimal('820.00')},
{'A': 5, 'B': None, 'C': Decimal('1223.00')}
]
df = pd.DataFrame(d)
In : df
Out:
A B C
0 2 None 628.00
1 1 None 383.00
2 3 None 651.00
3 2 None 575.00
4 4 None 1114.00
5 1 TEST 241.00
6 2 None 572.00
7 4 None 609.00
8 3 None 820.00
9 5 None 1223.00
Tests:
# no mean for C column
In : df.mean()
Out:
A 2.7
dtype: float64
# mean for C column when row 6 is left out of the DF
In : df.head(5).mean()
Out:
A 2.4
B NaN
C 670.2
dtype: float64
# no mean for C column when row 6 is part of the DF
In : df.head(6).mean()
Out:
A 2.166667
dtype: float64
dtypes:
In : df.dtypes
Out:
A int64
B object
C object
dtype: object
In : df.head(5).dtypes
Out:
A int64
B object
C object
dtype: object

You could use particular columns if you need only columns with numbers:
In [90]: df[['A','C']].mean()
Out[90]:
A 2.7
C 681.6
dtype: float64
or to change type as #jezrael advice in comment:
df['C'] = df['C'].astype(float)
Probably df.mean trying to convert all object to numeric and if it's fall then it's roll back and calculate only for actual numbers

Related

Update values of non NaN positions in a dataframe column

I want to update the values of non-NaN entries in a dataframe column
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
The data for updating the value column is in a list
new_value = [10, 15, 1, 18]
I could get the non-NaN entries in column value
df["value"].notnull()
I'm not sure how to assign the new values.
Suggestions will be really helpful.
df.loc[df["value"].notna(), 'value'] = new_value
By df["value"].notna() you select the rows where value is not NAN, then you specify the column (value in this case). It is important that the number of rows selected by the condition matches the number of values in new_value.
You can first identify the index which have nan values.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
print(df)
r, _ = np.where(df.isna())
new_value = [10, 15, 18] # There are only 3 nans
df.loc[r,'value'] = new_value
print(df)
Output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 10.0
3 0 2 B 20.0
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 30.0

Conversion of dataframe to required dictionary format

I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]
Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]

Function on column from dictionary

I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)

How to create dict from pandas dataframe with headers as keys and column values as arrays (not lists)?

I have a pandas dataframe of size (3x10000). I need to create a dict such that the keys are column headers and column values are arrays.
I understand there are many options to create such a dict where values are saved as lists. But I could not find a way to have the values as arrays.
Dataframe example:
A B C
0 1 4 5
1 6 3 2
2 8 0 9
Expected output:
{'A': array([1, 6, 8, ...]),
'B': array([4, 3, 0, ...]),
'C': array([5, 2, 9, ...])}
I guess following does what you need:
>>> import numpy as np
>>> # assuming df is your dataframe
>>> result = {header: np.array(df[header]) for header in df.columns}
>>> result
>>> {'A': array([1, 6, 8]), 'B': array([4, 3, 0]), 'C': array([5, 2, 9])
pandas added to_numpy in 0.24 and it should be way more efficient so you might want to check it.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

pandas: convert multiple columns to string

I have some columns ['a', 'b', 'c', etc.] (a and c are float64 while b is object)
I would like to convert all columns to string and preserve nans.
Tried using df[['a', 'b', 'c']] == df[['a', 'b', 'c']].astype(str) but that left blanks for the float64 columns.
Currently I am going through one by one with the following:
df['a'] = df['a'].apply(str)
df['a'] = df['a'].replace('nan', np.nan)
Is the best way to use .astype(str) and then replace '' with np.nan? Side question: is there a difference between .astype(str) and .apply(str)?
Sample Input: (dtypes: a=float64, b=object, c=float64)
a, b, c, etc.
23, 'a42', 142, etc.
51, '3', 12, etc.
NaN, NaN, NaN, etc.
24, 'a1', NaN, etc.
Desired output: (dtypes: a=object, b=object, c=object)
a, b, c, etc.
'23', 'a42', '142', etc.
'51', 'a3', '12', etc.
NaN, NaN, NaN, etc.
'24', 'a1', NaN, etc.
This gives you the list of column names
lst = list(df)
This converts all the columns to string type
df[lst] = df[lst].astype(str)
df = pd.DataFrame({
'a': [23.0, 51.0, np.nan, 24.0],
'b': ["a42", "3", np.nan, "a1"],
'c': [142.0, 12.0, np.nan, np.nan]})
for col in df:
df[col] = [np.nan if (not isinstance(val, str) and np.isnan(val)) else
(val if isinstance(val, str) else str(int(val)))
for val in df[col].tolist()]
>>> df
a b c
0 23 a42 142
1 51 3 12
2 NaN NaN NaN
3 24 a1 NaN
>>> df.values
array([['23', 'a42', '142'],
['51', '3', '12'],
[nan, nan, nan],
['24', 'a1', nan]], dtype=object)
You could apply .astype() function on every elements of dataframe, or could select the column of interest to convert to string by following ways too.
In [41]: df1 = pd.DataFrame({
...: 'a': [23.0, 51.0, np.nan, 24.0],
...: 'b': ["a42", "3", np.nan, "a1"],
...: 'c': [142.0, 12.0, np.nan, np.nan]})
...:
In [42]:
In [42]: df1
Out[42]:
a b c
0 23.0 a42 142.0
1 51.0 3 12.0
2 NaN NaN NaN
3 24.0 a1 NaN
### Shows current data type of the columns:
In [43]: df1.dtypes
Out[43]:
a float64
b object
c float64
dtype: object
### Applying .astype() on each element of the dataframe converts the datatype to string
In [45]: df1.astype(str).dtypes
Out[45]:
a object
b object
c object
dtype: object
### Or, you could select the column of interest to convert it to strings
In [48]: df1[["a", "b", "c"]] = df1[["a","b", "c"]].astype(str)
In [49]: df1.dtypes ### Datatype update
Out[49]:
a object
b object
c object
dtype: object
I did this way.
get all your value from a specific column, e.g. 'text'.
k = df['text'].values
then, run each value into a new declared string, e.g. 'thestring'
thestring = ""
for i in range(0,len(k)):
thestring += k[i]
print(thestring)
hence, all string in column pandas 'text' has been put into one string variable.
cheers,
fairuz

Resources