python-3: how to create a new pandas column as subtraction of two consecutive rows of another column? - python-3.x

I have a pandas dataframe
x
1
3
4
7
10
I want to create a new column y as y[i] = x[i] - x[i-1] (and y[0] = x[0]).
So the above data frame will become:
x y
1 1
3 2
4 1
7 3
10 3
How to do that with python-3? Many thanks

Using .shift() and fillna():
df['y'] = (df['x'] - df['x'].shift(1)).fillna(df['x'])
To explain what this is doing, if we print(df['x'].shift(1)) we get the following series:
0 NaN
1 1.0
2 3.0
3 4.0
4 7.0
Which is your values from 'x' shifted down one row. The first row gets NaN because there is no value above it to shift down. So, when we do:
print(df['x'] - df['x'].shift(1))
We get:
0 NaN
1 2.0
2 1.0
3 3.0
4 3.0
Which is your subtracted values, but in our first row we get a NaN again. To clear this, we use .fillna(), telling it that we want to just take the value from df['x'] whenever a null value is encountered.

Related

Pandas: GroupBy Shift And Cumulative Sum

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.
temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 3.0
4 b 1 1.0
5 b 1 2.0
6 c 1 3.0
7 c 1 1.0
This is wrong because the actual or what I am looking for is as below:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
Thanks a lot in advance.
You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.
temp['transformed'] = \
temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
For more info on transform() please see here:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation
You need using apply , since one function is under groupby object which is cumsum another function shift is for all df
temp['transformed'] = temp.groupby('ID')['X'].apply(lambda x : x.cumsum().shift())
temp
Out[287]:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.
So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:
df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

finding mean across rows in a dataframe with pandas

I have a dataframe
L1_1 L1_2 L3_1 L2_1 L2_2 L1_3 L2_3 L3_2 ....
3 5 1 5 7 2 3 2
4 2 4 1 4 1 4 2
I need to find the mean all "L1" , then all "L2" , and then all "L3 "
I tried
data["Mean"]=data.mean(axis=1)
that give me the sum across all "L1, L2 and L3" together
I also tried
data[['L1_1','L1_2','L1_3','Mean']].head()
but I have L1_1 to L1_20
so a loop sounds good. However, I cannot get a loop to work.
for i in range(1,21):
c = "'L1_" + i + "'," + c
Is a loop a good way to go here? or Is there a better?
If a loop is the way to go, How do you get a loop to work in a data frame?
Use groupby by columns (axis=1) with custom function of splitted values:
df1 = df.groupby(lambda x: x.split('_')[0], axis=1).mean()
#another solution
#df1 = df.groupby(df.columns.str.split('_').str[0], axis=1).mean()
print (df1)
L1 L2 L3
0 3.333333 5.0 1.5
1 2.333333 3.0 3.0
If want add nex columns to original df add join with add_suffix if want also rename columns names:
df = df.join(df1.add_suffix('_mean'))
print (df)
L1_1 L1_2 L3_1 L2_1 L2_2 L1_3 L2_3 L3_2 L1_mean L2_mean L3_mean
0 3 5 1 5 7 2 3 2 3.333333 5.0 1.5
1 4 2 4 1 4 1 4 2 2.333333 3.0 3.0

Replacing values in specific columns in a Pandas Dataframe, when number of columns are unknown

I am brand new to Python and stacks exchange. I have been trying to replace invalid values ( x<-3 and x>12) with np.nan in specific columns.
I don't know how many columns I will have to deal with and thus will have to create a general code that takes this into account. I do however know, that the first two columns are ids and names respectively. I have searched google and stacks exchange for a solution but haven't been able to find a solution that solves my specific objective.
My question is; How would one replace values found in the third column and onwards?
My dataframe looks like this;
Data
I tried this line:
Data[Data > 12.0] = np.nan.
this replaced the first two columns with nan
1st attempt
I tried this line:
Data[(Data.iloc[(range(2,Columns))] >=12) & (Data.iloc[(range(2,Columns))]<=-3)] = np.nan
where,
Columns = len(Data.columns)
This is clearly wrong replacing all values in rows 2 to 6 (Columns = 7).
2nd attempt
Any thoughts would be greatly appreciated.
Python 3.6.1 64bits, Qt 5.6.2, PyQt5 5.6 on Darwin
You're looking for the applymap() method.
import pandas as pd
import numpy as np
# get the columns after the second one
cols = Data.columns[2:]
# apply mask to those columns
new_df = Data[cols].applymap(lambda x: np.nan if x > 12 or x <= -3 else x)
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html
This approach assumes your columns after the second contain float or int values.
You can set values to specific columns of a dataframe by using iloc and slicing the columns that you need. Then we can set the values using where
A short example using some random data
df = pd.DataFrame(np.random.randint(0,10,(4,10)))
0 1 2 3 4 5 6 7 8 9
0 7 7 9 4 2 6 6 1 7 9
1 0 1 2 4 5 5 3 9 0 7
2 0 1 4 4 3 8 7 0 6 1
3 1 4 0 2 5 7 2 7 9 9
Now we set the region to update and the region we want to update using iloc, and we slice columns indexed as 2 to the last column
df.iloc[:,2:] = df.iloc[:,2:].where((df < 7) & (df > 2))
Which will set the values in the Data Frame to NaN.
0 1 2 3 4 5 6 7 8 9
0 7 7 NaN 4.0 NaN 6.0 6.0 NaN NaN NaN
1 0 1 NaN 4.0 5.0 5.0 3.0 NaN NaN NaN
2 0 1 4.0 4.0 3.0 NaN NaN NaN 6.0 NaN
3 1 4 NaN NaN 5.0 NaN NaN NaN NaN NaN
For your data the code would be this
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data <= 12) & (Data >= -3))
Operator clarification
The setup I show directly above would look like this
-3 <= Data <= 12, gives everything between those numbers
If we reverse this logic using the & operator it looks like this
-3 >= Data <= 12, a number cannot be both less than -3 and greater than 12 at the same time.
So we use the or operator instead |. Code looks like this now....
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data >= 12) | (Data <= -3))
So the data is checked on a conditional basis
Data <= -3 or Data >= 12

assign a new column in pd.dataframe (pandas) by existing column and defining its starting position

assign a new column in pd.dataframe (pandas) by existing column and defining its starting position. (to shift the values of a col;umn up or down)
For example i cretae a new panda dataframe (pd.dataframe), and then from this column, i want to make a new column, that will start from its e.g. 2nd, 3rd, 4rth element, and the rest values to be replaced with a specific value / string. Can this be done easily, in one line?
# Creating a dataframe with 10 ones:
ww=pd.DataFrame(10*[1])
# then try to assign a new column, that will start from its 2nd element
# however, for soem reason (can you expalin it why?), it places a "nan"
# as the first element in that new column.
ww=ww.assign(ww=ww[1:])
ww
the output given is that:
0 ww
0 1 NaN
1 1 1.0
2 1 1.0
3 1 1.0
4 1 1.0
5 1 1.0
6 1 1.0
7 1 1.0
8 1 1.0
9 1 1.0
while the expected output will be:
0 ww
0 1 1.0
1 1 1.0
2 1 1.0
3 1 1.0
4 1 1.0
5 1 1.0
6 1 1.0
7 1 1.0
8 1 1.0
9 1 NaN

Resources