Create new columns by comparing the current row's values and previous in Pandas - python-3.x

Given a dummy dataset df as follow:
year v1 v2
0 2017 0.3 0.1
1 2018 0.1 0.1
2 2019 -0.2 0.5
3 2020 NaN -0.3
4 2021 0.8 0.0
or:
[{'year': 2017, 'v1': 0.3, 'v2': 0.1},
{'year': 2018, 'v1': 0.1, 'v2': 0.1},
{'year': 2019, 'v1': -0.2, 'v2': 0.5},
{'year': 2020, 'v1': nan, 'v2': -0.3},
{'year': 2021, 'v1': 0.8, 'v2': 0.0}]
I need to create two more columns trend_v1 and trend_v2 based on v1 and v2 respectively.
The logic to create new columns is this: for each column, if its current value is greater than the previous, the trend value is increase, if its current value is less than the previous, the trend value is decrease, if its current value is equal to the previous, the trend value is equal, if the current or previous value is NaN, the trend also is NaN.
year v1 v2 trend_v1 trend_v2
0 2017 0.3 0.1 NaN NaN
1 2018 0.1 0.1 decrease equal
2 2019 -0.2 0.5 decrease increase
3 2020 NaN -0.3 NaN decrease
4 2021 0.8 0.0 NaN increase
How could I achieve that in Pandas? Thanks for your help at advance.

You can specify columns for test trend by compare shifted values with filtered missing values:
cols = ['v1','v2']
arr = np.where(df[cols] < df[cols].shift(),'decrease',
np.where(df[cols] > df[cols].shift(),'increase',
np.where(df[cols].isna() | df[cols].shift().isna(), None, 'equal')))
df = df.join(pd.DataFrame(arr, columns=cols, index=df.index).add_prefix('trend_'))
print (df)
year v1 v2 trend_v1 trend_v2
0 2017 0.3 0.1 None None
1 2018 0.1 0.1 decrease equal
2 2019 -0.2 0.5 decrease increase
3 2020 NaN -0.3 None decrease
4 2021 0.8 0.0 None increase
Or:
cols = ['v1','v2']
m1 = df[cols] < df[cols].shift()
m2 = df[cols] > df[cols].shift()
m3 = df[cols].isna() | df[cols].shift().isna()
arr = np.select([m1, m2, m3],['decrease','increase', None], default='equal')
df = df.join(pd.DataFrame(arr, columns=cols, index=df.index).add_prefix('trend_'))
EDIT:
Nice improvement is change m3 like mentioned in comments:
cols = ['v1','v2']
m1 = df[cols] < df[cols].shift()
m2 = df[cols] > df[cols].shift()
m3 = df[cols] == df[cols].shift()
arr = np.select([m1, m2, m3],['decrease','increase', 'equal'], default=None)

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

Read multi-index excel file and reshape the headers in Pandas

Given an excel file data.xlsx as follows:
I have read it with df = pd.read_excel('data.xlsx', header = [0, 1], index_col = [0, 1], sheet_name = 'Sheet1'),
Out:
district 2018 2019
price ratio price ratio
bj cy 12 0.01 6 0.02
sh hp 4 0.02 3 0.05
I wonder if it's possible to transform it to the following format? Thank you for your help.
Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack(0).rename_axis(('city','district','year')).reset_index()
print (df)
city district year price ratio
0 bj cy 2018 12 0.01
1 bj cy 2019 6 0.02
2 sh hp 2018 4 0.02
3 sh hp 2019 3 0.05

crosstab pandas with condition on columns doesn't display summed values

I have a problem displaying what I want with pd.crosstab
I tried those lines:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True])
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][df_temp['state'] >= 20])
And they both display this:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 6.0 118.0 754.0
2018 328.0 167.0 3.0 58.0 556.0
All 631.0 494.0 9.0 176.0 1310.0`
But what I want is not for each state to count the number of values being the state. For example for the state 20 for each year I want the value to be the count of all values greater or equal than 20. Thus it should be 754. For 30 it sould be 754 - 303 = 451. And so on for the other states.
I also tried this line of command but it doesn't work either:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][(df_temp['state'] >= 20) | (df_temp['state'] == 30)], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][(df_temp['state'] == 20) | (df_temp['state'] == 30)])
It displays the following table:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 0.0 0.0 630.0
2018 328.0 167.0 0.0 0.0 495.0
All 631.0 494.0 NaN NaN 1125.0

How to check the value is postive or negative and insert a new value in a column of a data frame

I am working with pandas and python and I have a task where, I need to check
GDP diff sign
Quarter
1999q4 12323.3 NaN None
2000q1 12359.1 35.8 None
2000q2 12592.5 233.4 None
2000q3 12607.7 15.2 None
2000q4 12679.3 71.6 None
Let the above dataframe be df and when I do
if df.iloc[2]['diff'] > 0:
df.iloc[2]['sign'] = "Positive"
The value is not getting updated on the dataframe. Is there something where I'm doing wrong. Its a direct assignment like how we do df['something'] = 'some value' and by doing this it will insert than value into df under that column. But when i do the above where i need to determine positive or negative, it is still showing as None when I do
df.iloc[2]['sign']
I tried using apply with lambdas, but couldn't get what I wanted.
Some help would be appreciated
Thank you.
You can use double numpy.where first with filtering NaN values by isnull and then by condition df['diff'] > 0:
df.sign = np.where(df['diff'].isnull(), np.nan,
np.where(df['diff'] > 0, 'Positive', 'Negative'))
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN NaN
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
because if use only df['diff'] > 0 get Negative for NaN values:
df.sign = np.where(df['diff'] > 0, 'Positive', 'Negative')
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN Negative
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
I'd create a categorical column
d = df['diff']
sign = np.where(d < 0, 'Negative',
np.where(d == 0, 'UnChanged',
np.where(d > 0, 'Positive', np.nan)))
df['sign'] = pd.Categorical(sign,
categories=['Negative', 'UnChanged', 'Positive'],
ordered=True)
df

Updating values in a pandas dataframe using another dataframe

I have an existing pandas Dataframe with the following format:
sample_dict = {'ID': [100, 200, 300], 'a': [1, 2, 3], 'b': [.1, .2, .3], 'c': [4, 5, 6], 'd': [.4, .5, .6]}
df_sample = pd.DataFrame(sample_dict)
Now, I want to update df_sample using another dataframe that looks like this:
sample_update = {'ID': [100, 300], 'a': [3, 2], 'b': [.4, .2], 'c': [2, 5], 'd': [.7, .1]}
df_updater = pd.DataFrame(sample_update)
The rule for the update is this:
For column a and c, just add values from a and c in df_updater.
For column b, it depends on the updated value of a. Let's say the update function would be b = old_b + (new_b / updated_a).
For column d, the rules are similar to that of column b except that it depends on values of the updated c and new_d.
Here is the desired output:
new = {'ID': [100, 200, 300], 'a': [4, 2, 5], 'b': [.233333, .2, .33999999], 'c': [6, 5, 11], 'd': [.51666666, .5, .609090]}
df_new = pd.DataFrame(new)
My actual problems are using a little more complicated version of this but I think this example is enough to solve my problem. Also, In my real DataFrame, I have more columns following the same rules so I would like to make this method to loop over the columns if possible. Thanks!
You can use functions merge, add and div:
df = pd.merge(df_sample,df_updater,on='ID', how='left')
df[['a','c']] = df[['a_y','c_y']].add(df[['a_x','c_x']].values, fill_value=0)
df['b'] = df['b_x'].add(df['b_y'].div(df.a_y), fill_value=0)
df['d'] = df['c_x'].add(df['d_y'].div(df.c_y), fill_value=0)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y a c b d
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7 4.0 6.0 0.233333 4.35
1 200 2 0.2 5 0.5 NaN NaN NaN NaN 2.0 5.0 0.200000 5.00
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1 5.0 11.0 0.400000 6.02
print (df[['a','b','c','d']])
a b c d
0 4.0 0.233333 6.0 4.35
1 2.0 0.200000 5.0 5.00
2 5.0 0.400000 11.0 6.02
Instead merge is posible use concat:
df=pd.concat([df_sample.set_index('ID'),df_updater.set_index('ID')], axis=1,keys=('_x','_y'))
df.columns = [''.join((col[1], col[0])) for col in df.columns]
df.reset_index(inplace=True)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7
1 200 2 0.2 5 0.5 NaN NaN NaN NaN
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1

Resources