Use Pandas to round based on another column - python-3.x

I have a dataframe where 1 column is a list of values and another is the number of digits I need to round to. It looks like this:
ValueToPlot B_length
0 13.80 1.0
1 284.0 0.0
2 5.9 0.0
3 1.38 1.0
4 287.0 0.0
I am looking for an output that looks like this:
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.0 0.0 284
2 5.9 0.0 6
3 1.38 1.0 1.4
4 287.0 0.0 287
Lastly, I would like the Rounded column to be in a string format, so the final result would be:
ValueToPlot B_length Rounded
0 13.80 1.0 '13.8'
1 284.0 0.0 '284'
2 5.9 0.0 '6'
3 1.38 1.0 '1.4'
4 287.0 0.0 '287'
I have attempted to use apply function in Pandas but have not been successful. I would prefer to avoid looping if possible.

Use chained formats
'{{:0.{}f}}'.format(3) evaluates to '{:0.3f}'. The double '{{}}' tells format to escape the '{}'. Then '{:0.3f}'.format(1) evaluates to 1.000. We can capture this concept by chaining.
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df.assign(Rounded=df.apply(f, 1))
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.00 0.0 284
2 5.90 0.0 6
3 1.38 1.0 1.4
4 287.00 0.0 287
A little more explicit with the column names
f = lambda x: '{{:0.{}f}}'.format(int(x['B_length'])).format(x['ValueToPlot'])
df.assign(Rounded=df.apply(f, 1))
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.00 0.0 284
2 5.90 0.0 6
3 1.38 1.0 1.4
4 287.00 0.0 287
I generally like to use assign as it produces a copy of the data frame with a new column attached. I can edit the original data frame
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df['Rounded'] = df.apply(f, 1)
Or I can use assign with an actual dictionary
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df.assign(**{'Rounded': df.apply(f, 1)})

A little long ... but work
df.apply(lambda x : str(round(x['ValueToPlot'],int(x['B_length']))) if x['B_length']>0 else str(int(round(x['ValueToPlot'],int(x['B_length'])))),axis=1)
Out[1045]:
0 13.8
1 284
2 6
3 1.4
4 287
dtype: object

Related

Frequency calculations on subgroups of pandas-groupby, insertion of new rows and rearrangement of columns

I need some help with performing a few operations over subgroups, but I am getting really confused. I will try to describe quickly the operations and the desired output with the comments.
(1) Calculate the % frequency of appearance per subgroup
(2) Appear a record that does not exist with 0
(3) Rearrange order of records and columns
Assume the df below as the raw data:
df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})
The grouped_df below is close to what I have in mind but I can't get the desired output.
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
I have tried to handle each group separately, but i) it does not take into consideration the replaced NaN values, ii) I should avoid handling each group because I will need to concatenate afterwards a lot of groups (this df is just an example) as I will need to plot the whole group later on.
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
Hopefully my question makes sense. Any insight into what I try to perform is very appreciated in advance, thank you a lot!
With the help of #Quang Hoang who tried to help out with this question a day before I post my answer, I managed to find a solution.
To explain the last bit of the calculation, I transformed every element by dividing it with the sum of counts for each group to find the frequency of each element 0th-level-group-wise and not row/column/total-wise.
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
Running the piece of code above, produces the desired output:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0

Create multiple new columns based multiple conditions in Pandas

I try to get new columns a and b based on the following dataframe:
a_x b_x a_y b_y
0 13.67 0.0 13.67 0.0
1 13.42 0.0 13.42 0.0
2 13.52 1.0 13.17 1.0
3 13.61 1.0 13.11 1.0
4 12.68 1.0 13.06 1.0
5 12.70 1.0 12.93 1.0
6 13.60 1.0 NaN NaN
7 12.89 1.0 NaN NaN
8 11.68 1.0 NaN NaN
9 NaN NaN 8.87 0.0
10 NaN NaN 8.77 0.0
11 NaN NaN 7.97 0.0
If b_x or b_y are 0.0 (at this case they have same values if they both exist), then a_x and b_y share same values, so I take either of them as new columns a and b; if b_x or b_y are 1.0, they are different values, so I calculate means of a_x and a_y as the values of a, take either b_x and b_y as b;
If a_x, b_x or a_y, b_y is not null, so I'll take existing values as a and b.
My expected results will like this:
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0
1 13.42 0.0 13.42 0.0 13.420 0
2 13.52 1.0 13.17 1.0 13.345 1
3 13.61 1.0 13.11 1.0 13.360 1
4 12.68 1.0 13.06 1.0 12.870 1
5 12.70 1.0 12.93 1.0 12.815 1
6 13.60 1.0 NaN NaN 13.600 1
7 12.89 1.0 NaN NaN 12.890 1
8 11.68 1.0 NaN NaN 11.680 1
9 NaN NaN 8.87 0.0 8.870 0
10 NaN NaN 8.77 0.0 8.770 0
11 NaN NaN 7.97 0.0 7.970 0
How can I get an result above? Thank you.
Use:
#filter all a and b columns
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]
#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
But I think solution should be simplify, because mean should be used for both conditions (because mean of same values is same like first value):
b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]
df['a'] = a1
df['b'] = b1
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0

How to substitute None values in python pandas

I'm trying to work with a dataset that has None values:
My uploading code is the following:
import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c = pd.DataFrame(s_rows_cols, columns = header_row)
and
the output from c is :
But it seems that there are some columns that has None values.
How do I replace this None values by zeros?
Thanks
I think it is not necessary, if use read_csv with sep=\s+ for whitespace separator and also parameter names for specify new columns names:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
cols = ['age','sex','chestpain','restBP','chol','sugar','ecg',
'maxhr','angina','dep','exercise','fluor','thal','diagnosis']
df = pd.read_csv(url, sep='\s+', names=cols)
print (df)
age sex chestpain restBP chol sugar ecg maxhr angina dep \
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2
.. ... ... ... ... ... ... ... ... ... ...
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
exercise fluor thal diagnosis
0 2.0 3.0 3.0 2
1 2.0 0.0 7.0 1
2 1.0 0.0 7.0 2
3 2.0 1.0 7.0 1
4 1.0 1.0 3.0 1
.. ... ... ... ...
265 1.0 0.0 7.0 1
266 1.0 0.0 7.0 1
267 2.0 0.0 3.0 1
268 2.0 0.0 6.0 1
269 2.0 3.0 3.0 2
[270 rows x 14 columns]
Then in data are not Nones and no missing values:
print (df.isna().any(1).any())
False
EDIT:
If need replace missing values or Nones to scalar use fillna:
c = c.fillna(0)

Fill the missing values in the data set

I have a dataset as below.
building_id meter meter_reading primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed hour day weekend month
0 0 0 NaN 0 7432 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
1 1 0 NaN 0 2720 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
2 2 0 NaN 0 5376 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
3 3 0 NaN 0 23685 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
4 4 0 NaN 0 116607 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
You can see that the values under meter_reading are Nan and i like to fill that up with that column mean grouped by "primary_use" and "square_feet" column. Which api I could use to achieve this. I am currently using scikit learn's imputer.
Thanks and your help is highly appreciated.
If you use pandas data frame, it already brings everything you need.
Note that priary_use is a categorical feature while square_feet is continuous. So first you would like to split square_feet into categories, so you can calculate the mean meter_reading per group.

Pandas: How to sum (dynamic) columns that are between two specific columns?

I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)

Resources