build function which takes values from above row in pandas dataframe - python-3.x

i have the following dataframe:
i want to build func to apply on column 'c' that will take the subtraction from columns 'd' and 'u' and add the value from the row above in column 'c'.
so the the table will look as follow:
for example in row number 2 the calculation will be: 44.37 - 0 + 149.77 = 194.14
in row number 4 the calculation will be 11.09 - 6.45 + 210.78 = 215.42
and so on..
i tried to build function using iloc with while loop or shift but non of them worked as i got an error:
("'numpy.float64' object has no attribute 'iloc'", 'occurred at index 0')
("'numpy.float64' object has no attribute 'shift'", 'occurred at index 0')
any idea how to make this function will be great.
thanks!!

You can apply direct subtraction of columns and use cummulative sum to add values
d u
0 0.000 149.75
1 0.000 44.37
2 0.000 16.64
3 6.450 11.09
4 77.345 5.54
5 64.520 16.40
df1['C'] = (df1['u'] - df1['d']).cumsum()
OUt:
d u c
0 0.000 149.75 149.750
1 0.000 44.37 194.120
2 0.000 16.64 210.760
3 6.450 11.09 215.400
4 77.345 5.54 143.595
5 64.520 16.40 95.475

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Pandas Dataframe: Dropping Selected rows with 0.0 float type values

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?
You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed
Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

How to find the index of a row for a particular value in a particular column and then create a new column with that starting point?

Example of dataframe
What I'm trying to do with my dataframe...
Locate the first 0 value in a certain column (G in the example photo).
Create a new column (Time) with the value 0 lining up on the same row with the same 0 value in column (G).
And then each row after the 0 in column (Time) +(1/60) until the end of the data.
And -(1/60) before the 0 in (Time) column until the beginning of data.
What is the best method to achieve this?
Any advice would be appreciated. Thank you.
Pretty straight forward
identify index of row that contains the value you are looking for
then construct an array where start is negative, zero will be index row and end at value for end of series
import numpy as np
df = pd.DataFrame({"Time":np.full(25, 0), "G":[i if i>0 else 0 for i in range(10,-15,-1)]})
# find the where value is zero
idx = df[df["G"]==0].index[0]
intv = round(1/60, 3)
# construct a numpy array from a range of values (start, stop, increment)
df["Time"] = np.arange(-idx*intv, (len(df)-idx)*intv, intv)
df.loc[idx, "Time"] = 0 # just remove any rounding at zero point
print(df.to_string(index=False))
output
Time G
-0.170 10
-0.153 9
-0.136 8
-0.119 7
-0.102 6
-0.085 5
-0.068 4
-0.051 3
-0.034 2
-0.017 1
0.000 0
0.017 0
0.034 0
0.051 0
0.068 0
0.085 0
0.102 0
0.119 0
0.136 0
0.153 0
0.170 0
0.187 0
0.204 0
0.221 0
0.238 0

Checking a condition at a pd.DataFrame with np.where() and apply a function to more than one columns

Assume a dataframe df and two columns within it each one hosting respectively a value for mass and the unit of measurement. The two columns will look like this:
df.head()
Mass Unit
0 14 g
1 1.57 kg
2 701 g
3 0.003 tn
4 0.6 kg
I want to have a consistent system of measurements and thus, I perform the following:
df['Mass']=np.where(df['Unit']=='g', df['Mass']/1000, df['Mass']) #1
df['Unit']=np.where(df['Unit']=='g', 'kg', df['Unit']) #2
df['Mass']=np.where(df['Unit']=='tn', df['Mass']*1000, df['Mass']) #3
df['Unit']=np.where(df['Unit']=='tn', 'kg', df['Unit']) #4
a) Is there a way to perform #1 & #2 in one line, maybe using apply?
b) Is it possible to perform #1, #2, #3 and #4 in only one line?
Thank you for your time!
It is possible by numpy.select, BUT because numeric and string columns numeric values in Mass are converted to strings, so last step is converting to floats:
df['Mass'],df['Unit'] = np.select([df['Unit']=='g', df['Unit']=='tn'],
[(df['Mass']/1000, np.repeat(['kg'], len(df))),
(df['Mass']*1000, np.repeat(['kg'], len(df)))],
(df['Mass'],df['Unit']))
df['Mass'] = df['Mass'].astype(float)
print (df)
Mass Unit
0 0.014 kg
1 1.570 kg
2 0.701 kg
3 3.000 kg
4 0.600 kg
Same problem with numpy.where:
df['Mass'],df['Unit'] = np.where(df['Unit']=='g',
(df['Mass']/1000, np.repeat(['kg'], len(df))),
(df['Mass'],df['Unit']))
df['Mass'] = df['Mass'].astype(float)
print (df)
Mass Unit
0 0.014 kg
1 1.570 kg
2 0.701 kg
3 0.003 tn
4 0.600 kg
You can do the following which does not use any numpy functions,
(i,j) = (df[(df['Unit']=='g')].index, df[(df['Unit']=='tn')].index)
df.loc[i,'Mass'], df.loc[j,'Mass'], df['Unit'] = df.loc[i,'Mass']/1000, df.loc[j,'Mass']*1000,'kg'

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

Resources