Format the output in Pandas - python-3.x

I have a excel sheet:
I read it out:
import pandas as pd
import numpy as np
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
print(df)
it shows:
name value
0 a 10000.000000
1 b 20000.000000
2 c 30000.000000
3 d 40000.000000
4 e 50000.000000
5 f 1.142857
6 g 1.285714
how can I format the number output to like %.2f, can I format it directly by print (df) like add something in print like %.2f, or I must first modify it content and commit back to df and then print again?
UPDATE:
I try the answer below, df['value'].apply("{0:.2f}".format) doesn't work:
import pandas as pd
import numpy as np
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
print(df['value'])
df['value'].apply("{0:.2f}".format)
print(df['value'])
print(df)
it shows:
0 10000.000000
1 20000.000000
2 30000.000000
3 40000.000000
4 50000.000000
5 1.142857
6 1.285714
Name: value, dtype: float64
0 10000.000000
1 20000.000000
2 30000.000000
3 40000.000000
4 50000.000000
5 1.142857
6 1.285714
Name: value, dtype: float64
name value
0 a 10000.000000
1 b 20000.000000
2 c 30000.000000
3 d 40000.000000
4 e 50000.000000
5 f 1.142857
6 g 1.285714
pd.set_option('display.float_format', lambda x: '%.2f' % x) works:
0 10000.00
1 20000.00
2 30000.00
3 40000.00
4 50000.00
5 1.14
6 1.29
Name: value, dtype: float64
name value
0 a 10000.00
1 b 20000.00
2 c 30000.00
3 d 40000.00
4 e 50000.00
5 f 1.14
6 g 1.29

You can change the float_format of pandas in pandas set_option like this
pd.set_option('display.float_format', lambda x: '%.2f' % x)
If you want to change the formating of just one column, you can do a apply and change formating like this
df['age'].apply("{0:.2f}".format)
# Output
0 10000.00
1 20000.00
2 30000.00
3 40000.00
4 50000.00
5 1.14
6 1.29
Name: age, dtype: object

The default precision is 6. You can override this this with pandas.set_option:
pd.set_option('display.precision', 2)

Related

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Converting dataframe fraction to float

I would like to convert the string values in column b to float. Wondering how should I make it.
A B
1 16-1/4
2 3-1/4
3 21-1/4
4 8-1/4
Update:
Give map a try to avoid limit 100 rows on pd.eval
df['C'] = df.B.str.replace('-', '+').map(pd.eval)
Original:
As your comment, it seems you adding the fraction to the whole number, so the solution would be
df['C'] = pd.eval(df.B.str.replace('-', '+'))
Out[5]:
A B C
0 1 16-1/4 16.25
1 2 3-1/4 3.25
2 3 21-1/4 21.25
3 4 8-1/4 8.25
Use built-in Python function eval():
df.B = df.B.apply(eval)
Test:
In[1]: df
A B
0 1 16-1/4
1 2 3-1/4
2 3 21-1/4
3 4 8-1/4
In[2]: df.B = df.B.apply(eval)
In[3]: df
A B
0 1 15.75
1 2 2.75
2 3 20.75
3 4 7.75

Replace values on dataset and apply quartile rule by row on pandas

I have a dataset with lots of variables. So I've extracted the numeric ones:
numeric_columns = transposed_df.select_dtypes(np.number)
Then I want to replace all 0 values for 0.0001
transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)
And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.
Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.
To answer your first question
In [36]: from pprint import pprint
In [37]: pprint( numeric_columns.where.__doc__)
('\n'
'Replace values where the condition is False.\n'
'\n'
'Parameters\n'
'----------\n'
because of that your all the values except 0 are getting replaced
Use DataFrame.mask and for second condition compare by DataFrame.quantile:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)
print (transposed_df)
A B C D E F
0 a -1.0 7 1.0 5 a
1 b -1.0 8 3.0 3 a
2 c 4.0 9 -1.0 6 a
3 d 5.0 -1 7.0 9 b
4 e 5.0 2 -1.0 2 b
5 f 4.0 3 -1.0 4 b
EDIT:
from scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
B C D E
0 -2.236068 0.570352 -0.408248 0.073521
1 0.447214 0.950586 0.408248 -0.808736
2 0.447214 1.330821 -0.816497 0.514650
3 0.447214 -0.570352 2.041241 1.838037
4 0.447214 -1.330821 -0.408248 -1.249865
5 0.447214 -0.950586 -0.816497 -0.367607
EDIT1:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,1,1,1,1,1],
'C':[1,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[1,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
from scipy.stats import zscore
df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
A B C D E F
0 a -1.732051 0.577350 0.577350 0.577350 a
1 b -1.063410 1.643452 -0.290021 -0.290021 a
2 c -0.816497 1.360828 -1.088662 0.544331 a
3 d -1.402136 -0.412393 0.577350 1.237179 b
4 e -1.000000 1.000000 -1.000000 1.000000 b
5 f -0.632456 0.632456 -1.264911 1.264911 b

Count positive values for each column in all dataframe

Is it possible to count positive values of each column in a dataframe ?
I tried to do this with 'count'
import pandas as pd
import numpy as np
np.random.seed(18)
df = pd.DataFrame(np.random.randint(-10,10,size=(5, 4)), columns=list('ABCD'))
print(df)
A B C D
0 0 9 -5 7
1 4 8 -8 -2
2 -8 7 -5 5
3 0 0 1 -6
4 -6 1 -9 -7
positive_count = df.gt(0).count()
print(positive_count)
A 5
B 5
C 5
D 5
dtype: int64
The "gt" (greater than) seems doesn't work.
I tried with 'value_counts', and it works for column 'A' in this example
positive_count = df['A'].gt(0).value_counts()[1]
But I would like to get this result for all columns at one time.
Does anyone have an idea to help me?

Pandas use variable for column names [duplicate]

This question already has answers here:
Pandas Passing Variable Names into Column Name
(3 answers)
Closed 2 years ago.
Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
How can I access columns via a variable?
I tried this:
cols='A','B'
df[cols]
...which resulted in this:
KeyError: ('A', 'B')
Bonus Question:
What if my data frame were like this?:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
and I wanted to do this?:
cols=['A','B']
cols2=['C','D']
df[cols,'F',cols2]
Thanks in advance!
You can try subset by list of column names:
cols=['A','B']
print df[cols]
A B
0 1 4
1 2 5
2 3 6
It is same as:
print df[['A','B']]
A B
0 1 4
1 2 5
2 3 6
Bonus answer:
cols=['A','B']
cols2=['C','D']
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Resources