Checking a condition at a pd.DataFrame with np.where() and apply a function to more than one columns - python-3.x

Assume a dataframe df and two columns within it each one hosting respectively a value for mass and the unit of measurement. The two columns will look like this:
df.head()
Mass Unit
0 14 g
1 1.57 kg
2 701 g
3 0.003 tn
4 0.6 kg
I want to have a consistent system of measurements and thus, I perform the following:
df['Mass']=np.where(df['Unit']=='g', df['Mass']/1000, df['Mass']) #1
df['Unit']=np.where(df['Unit']=='g', 'kg', df['Unit']) #2
df['Mass']=np.where(df['Unit']=='tn', df['Mass']*1000, df['Mass']) #3
df['Unit']=np.where(df['Unit']=='tn', 'kg', df['Unit']) #4
a) Is there a way to perform #1 & #2 in one line, maybe using apply?
b) Is it possible to perform #1, #2, #3 and #4 in only one line?
Thank you for your time!

It is possible by numpy.select, BUT because numeric and string columns numeric values in Mass are converted to strings, so last step is converting to floats:
df['Mass'],df['Unit'] = np.select([df['Unit']=='g', df['Unit']=='tn'],
[(df['Mass']/1000, np.repeat(['kg'], len(df))),
(df['Mass']*1000, np.repeat(['kg'], len(df)))],
(df['Mass'],df['Unit']))
df['Mass'] = df['Mass'].astype(float)
print (df)
Mass Unit
0 0.014 kg
1 1.570 kg
2 0.701 kg
3 3.000 kg
4 0.600 kg
Same problem with numpy.where:
df['Mass'],df['Unit'] = np.where(df['Unit']=='g',
(df['Mass']/1000, np.repeat(['kg'], len(df))),
(df['Mass'],df['Unit']))
df['Mass'] = df['Mass'].astype(float)
print (df)
Mass Unit
0 0.014 kg
1 1.570 kg
2 0.701 kg
3 0.003 tn
4 0.600 kg

You can do the following which does not use any numpy functions,
(i,j) = (df[(df['Unit']=='g')].index, df[(df['Unit']=='tn')].index)
df.loc[i,'Mass'], df.loc[j,'Mass'], df['Unit'] = df.loc[i,'Mass']/1000, df.loc[j,'Mass']*1000,'kg'

Related

Convert different units of a column in pandas

I'm working on a Kaggle project. Below is my CSV file column:
total_sqft
1056
1112
34.46Sq. Meter
4125Perch
1015 - 1540
34.46
10Sq. Yards
10Acres
10Guntha
10Grounds
The column is of type object. First I want to convert all the values to float then update the string 1015 - 1540 with its average value and finally convert the units to square feet. I've tried different StackOverflow solutions but none of them seems to work. Any help would be appreciated.
Expected Output:
total_sqft
1056.00
1112.00
370.307
1123031.25
1277.5
34.46
90.00
435600
10890
24003.5
1 square meter = 10.764 * square foot
1 perch = 272.25 * square foot
1 square yards = 9 * square foot
1 acres = 43560 * square foot
1 guntha = 1089 * square foot
1 grounds = 2400.35 * square foot
First extract numeric values by Series.str.extractall, convert to floats and get averages:
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
print (df)
total_sqft avg
0 1056 1056.00
1 1112 1112.00
2 34.46Sq. Meter 34.46
3 4125Perch 4125.00
4 1015 - 1540 1277.50
5 34.46 34.46
then need more information for convert to square feet.
EDIT: Create dictionary for match units, extract them from column and use map, last multiple columns:
d = {'Sq. Meter': 10.764, 'Perch':272.25, 'Sq. Yards':9,
'Acres':43560,'Guntha':1089,'Grounds':2400.35}
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
df['unit'] = df['total_sqft'].str.extract(f'({"|".join(d)})', expand=False)
df['map'] = df['unit'].map(d).fillna(1)
df['total_sqft'] = df['avg'].mul(df['map'])
print (df)
total_sqft avg unit map
0 1.056000e+03 1056.00 NaN 1.000
1 1.112000e+03 1112.00 NaN 1.000
2 3.709274e+02 34.46 Sq. Meter 10.764
3 1.123031e+06 4125.00 Perch 272.250
4 1.277500e+03 1277.50 NaN 1.000
5 3.446000e+01 34.46 NaN 1.000
6 9.000000e+01 10.00 Sq. Yards 9.000
7 4.356000e+05 10.00 Acres 43560.000
8 1.089000e+04 10.00 Guntha 1089.000
9 2.400350e+04 10.00 Grounds 2400.350

Convert dataframe display float format to human readable for output display purpose only

I wish to display the dataframe column values in human readable format like 10, 100, 1K, 1M, 1B, etc.
So far, I could convert the scientific values (say) 1.111111e1 to numeric float format using pandas options with following arguments:
`display.float_format = '{:.2f}'.format
Note, 2 in the above line means 2 points of decimal. Change them as you like.
But, still the output is pretty hard to read when the column has so many varying numeric values. Especially in financial use case, with columns such as currency, turnover, profit, etc.
How to do this?
Note: I do not wish to convert the stored values into string format. I have calculations on the column values, so that is not feasible. Further, I won't create new columns for display purpose. So, df['new_col'] = df['col']/1000000 won't work either.
Sample dataframe:
pd.DataFrame([10.,100.,1000.,10000.,100000.,1000000.,10000000.,100000000.,1000000000.,10000000000.])
0 1.000000e+01
1 1.000000e+02
2 1.000000e+03
3 1.000000e+04
4 1.000000e+05
5 1.000000e+06
6 1.000000e+07
7 1.000000e+08
8 1.000000e+09
9 1.000000e+10
Use the following function with display.float_format argument in pandas options method to get the desired outcome.
lambda x : '{:.2f}'.format(x) if abs(x) < 1000 else ('{:.2f} K'.format(x/1000) if abs(x) < 1000000 else ('{:.2f} M'.format(x/1000000) if abs(x) < 1000000000 else '{:.2f} B'.format(x/1000000000)))
Output:
0 10.00
1 100.00
2 1.00 K
3 10.00 K
4 100.00 K
5 1.00 M
6 10.00 M
7 100.00 M
8 1.00 B
9 10.00 B

Limiting float value in for loop Python

I'm working on banking project, where my team asks me to limit all float values to .2 precision.
My dataSet.head()
Goal: To find max of all stocks comparatively
My present output:
Bank Ticker
BAC 54.900002
C 564.099976
GS 247.919998
JPM 70.080002
MS 89.300003
WFC 58.520000
dtype: float64
My expected output:
Bank Ticker
BAC 54.90
C 564.10
GS 247.91
JPM 70.08
MS 89.30
WFC 58.52
dtype: float64
Please help me with this!
You make a wrong use of "{:.2f}" in your print statement, you should use .format() to format your float.
You can use print("{:.2f}".format(some float)) to print a float with 2 decimals as explained here.
You could use pandas.Series.round method
I've got a toy DataFrame df:
l1 c1 c2
l2 a b c a b c
0 0.066667 0.666667 6.666667 0.0002 0.002 0.02
1 0.133333 1.333333 13.333333 0.0004 0.004 0.04
2 0.200000 2.000000 20.000000 0.0006 0.006 0.06
3 0.266667 2.666667 26.666667 0.0008 0.008 0.08
df.xs('c', axis=1, level='l2').max().round(2)
Results into this:
l1
c1 26.67
c2 0.08
dtype: float64
I guess in your case
res = bank_stocks.xs('Close', axis=1, level='Stock Info').max().round(2)
would result into a Series res indexed by tickers with name Bank Ticker and desired values rounded up to 2 decimal places.
According to this answer you can then print it with
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(res)
I am not a very advanced python programmer but I this should work:
{:.2f}.format(max())
to say the least, that will print out 564.099976 as 564.09
This works for me, which is reliable, and without loop
bank_stocks.xs(key='Close',axis=1,level='Stock Info').max().round(2)
O(n)^2 is now O(n)
This is not a clean solution but You can multiply the max stock price with 100, then do a floor division with 1 and then divide by 100.
This would solve your problems.

Returning non-NaN values when calling df.X in pandas

This is a rather specific follow up to this question on creating pandas dataframes when entries have different lengths.
I have a dataset where I have:
general environmental variables that apply to the whole problem (e.g. avg precipitation)
values at, say, specific depth (e.g. average amount of water at any depth after rainfall)
so my data looks like
d = dict{'depth': [1,2,3], 'var1',[.01,.009,.002],'globalvar',[2.5]}
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
>>
depth globalvar var1
0 1 2.5 0.010
1 2 NaN 0.009
2 3 NaN 0.002
Is there a way to call globalvar, e.g. df.globalvar without calling df.globalvar[1]? Is there a more pythonic way to do this?
You can do with stack
df.stack().loc[pd.IndexSlice[:,'globalvar']]
Out[445]:
0 2.5
dtype: float64
Or dropna
df.globalvar.dropna()

reading data with varying length header

I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)

Resources