Convert dataframe display float format to human readable for output display purpose only - python-3.x

I wish to display the dataframe column values in human readable format like 10, 100, 1K, 1M, 1B, etc.
So far, I could convert the scientific values (say) 1.111111e1 to numeric float format using pandas options with following arguments:
`display.float_format = '{:.2f}'.format
Note, 2 in the above line means 2 points of decimal. Change them as you like.
But, still the output is pretty hard to read when the column has so many varying numeric values. Especially in financial use case, with columns such as currency, turnover, profit, etc.
How to do this?
Note: I do not wish to convert the stored values into string format. I have calculations on the column values, so that is not feasible. Further, I won't create new columns for display purpose. So, df['new_col'] = df['col']/1000000 won't work either.
Sample dataframe:
pd.DataFrame([10.,100.,1000.,10000.,100000.,1000000.,10000000.,100000000.,1000000000.,10000000000.])
0 1.000000e+01
1 1.000000e+02
2 1.000000e+03
3 1.000000e+04
4 1.000000e+05
5 1.000000e+06
6 1.000000e+07
7 1.000000e+08
8 1.000000e+09
9 1.000000e+10

Use the following function with display.float_format argument in pandas options method to get the desired outcome.
lambda x : '{:.2f}'.format(x) if abs(x) < 1000 else ('{:.2f} K'.format(x/1000) if abs(x) < 1000000 else ('{:.2f} M'.format(x/1000000) if abs(x) < 1000000000 else '{:.2f} B'.format(x/1000000000)))
Output:
0 10.00
1 100.00
2 1.00 K
3 10.00 K
4 100.00 K
5 1.00 M
6 10.00 M
7 100.00 M
8 1.00 B
9 10.00 B

Related

Pandas Dataframe: Dropping Selected rows with 0.0 float type values

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?
You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed
Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

Pandas shift datetimeindex takes too long time running

I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.
The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3
This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.
The relevant code in for DataFrame.GroupBy.shift is:
def shift(self, periods=1, freq=None, axis=0, fill_value=None):
"""..."""
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))
Previously this issue extended to specifying a fill_value

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

Randomly select a percentage of columns and rows in a dataframe (Pandas, Python 3)

I am trying to randomly select a certain percentage of rows and columns in my dataframe and fit these features into a logistic regression over 10 iterations. My dependent variable is whether a team won (1) or lost (0).
If I have a df that looks something like this (data is made up):
Won Field Injuries Weather Fouls Players
1 2 3 1 2 8
0 3 2 0 1 5
1 4 5 3 2 6
1 3 2 1 4 5
0 2 3 0 1 6
1 4 2 0 2 8
...
For example, let's say I want to select 50% (but this could change). I want to randomly select 50% (or the closest amount to 50% if its an odd number) of the columns (field,injuries,weather,fouls,players) and 50% of the rows in those columns to place in my model.
Here is my current code which right now runs by selecting all of the columns and rows and fitting it into my model but I would like to dictate a random percentage:
z = []
For i in range(10):
train_cols = df.columns[1:]
logit = sm.Logit(df['Won'], df[train_cols])
result = logit.fit()
exp = np.exp(result.params)
z.append([i, exp])

Resources