Search value in Next Month Record Pandas - python-3.x

Given that i have a df like this:
ID Date Amount
0 a 2014-06-13 12:03:56 13
1 b 2014-06-15 08:11:10 14
2 a 2014-07-02 13:00:01 15
3 b 2014-07-19 16:18:41 22
4 b 2014-08-06 09:39:14 17
5 c 2014-08-22 11:20:56 55
...
129 a 2016-11-06 09:39:14 12
130 c 2016-11-22 11:20:56 35
131 b 2016-11-27 09:39:14 42
132 a 2016-12-11 11:20:56 18
I need to create a column df['Checking'] to show that ID will appear in next month or not and i tried the code as below:
df['Checking']= df.apply(lambda x: check_nextmonth (x.Date,
x.ID), axis=1)
where
def check_nextmonth(date, id)=
x= id in df['user_id'][df['Date'].dt.to_period('M')== ((date+
relativedelta(months=1))).to_period('M')].values
return x
but it take too long to process a single row.
How can i improve this code or another way to achieve what i want?

Using pd.to_datetime with ts tricks:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df['tmp'] = (df['Date'] - pd.DateOffset(months=1)).dt.month
s = df.groupby('ID').apply(lambda x:x['Date'].dt.month.isin(x['tmp']))
df['Checking'] = s.reset_index(level=0)['Date']
Output:
ID Date Amount tmp Checking
0 a 2014-06-13 12:03:56 13 5 True
1 b 2014-06-15 08:11:10 14 5 True
2 a 2014-07-02 13:00:01 15 6 False
3 b 2014-07-19 16:18:41 16 6 True
4 b 2014-08-06 09:39:14 17 7 False
5 c 2014-08-22 11:20:56 18 7 False

Here's one method of doing it, check if the grouped id's next month is equal to current month + 1, and assign the same by sorting the ID.
check = df.groupby('ID').apply(lambda x : x['Date'].dt.month.shift(-1) == x['Date'].dt.month+1).stack().values
df = df.sort_values('ID').assign( checking = check).sort_index()
ID Date Amount checking
0 a 2014-06-13 12:03:56 13 True
1 b 2014-06-15 08:11:10 14 True
2 a 2014-07-02 13:00:01 15 False
3 b 2014-07-19 16:18:41 16 True
4 b 2014-08-06 09:39:14 17 False
5 c 2014-08-22 11:20:56 18 False

Related

Replace values in Columns

I want to replace values in columns using if loop:
If value in column [D] is not same as any values in [A,B,C] then replace column with first NaN with D, and if there is no NaN in a row, create a new column [E] and add value from column [D] in column [E].
ID A B C D
0 22 32 NaN 22
1 25 13 NaN 15
2 27 NaN NaN 20
3 29 10 16 29
4 12 92 33 55
I want output to be:
ID A B C D E
0 22 32 NaN 22
1 25 13 15 15
2 27 20 NaN 20
3 29 10 16 29
4 12 92 33 55 55
List = [[22 , 32 , None , 22],
[25 , 13 , None , 15],
[27 , None , None , 20],
[29 , 10 , 16 , 29],
[12 , 92 , 33 , 55]]
for Row in List:
Target_C = Row[3]
if Row.count(Target_C) < 2: # If there is no similar condetion pass
None_Found = False # Small bool to check later if there is no None !
for enumerate_Column in enumerate(Row): # get index for each list
if(None in enumerate_Column): # if there is None gin the row
Row[enumerate_Column[0]] = Target_C # replace None with column D
None_Found = True # Change None_Found to True
if(None_Found): # Break the loop if found None
break
if(None_Found == False): # if you dont found None add new clulmn
Row.append(Target_C)
My Code example
You can do it this way
a = df.isnull()
b = (a[a.any(axis=1)].idxmax(axis=1))
nanindex = b.index
check = (df.A!=df.D) & (df.B!=df.D) & (df.C!=df.D)
commonind = check[~check].index
replace_ind_list = list(nanindex.difference(commonind))
new_col_list = df.index.difference(list(set(commonind.tolist()+nanindex.tolist()))).tolist()
df['E']=''
for index, row in df.iterrows():
for val in new_col_list:
if index == val:
df.at[index,'E'] = df['D'][index]
for val in replace_ind_list:
if index == val:
df.at[index,b[val]] = df['D'][index]
df
Output
ID A B C D E
0 0 22 32.0 NaN 22
1 1 25 13.0 15.0 15
2 2 27 20.0 NaN 20
3 3 29 10.0 16.0 29
4 4 12 92.0 33.0 55 55

How to select rows in a DataFrame based on every transition for particular values in a particular column?

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.
I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.
df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()
Given DataFrame:
Index UniqID Value
1    a    1
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
7    b    0
8    b    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
16    c    2
Expected Output:
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
Only expecting those rows when there is a transition from either 0-1 or 1-2.
Thank you in advance.
Use this my solution working for groups with tuples of patterns:
np.random.seed(123)
N = 100
d = {
'UniqID': np.random.choice(list('abcde'), N),
'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)
pat = [(0, 1), (1, 2)]
a = np.array(pat)
s = (df.groupby('UniqID')['Value']
.rolling(2, min_periods=1)
.apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))
mask = (s.mask(s == 0)
.groupby(level=0)
.bfill(limit=1)
.fillna(0)
.astype(bool)
.reset_index(level=0, drop=True))
df = df[mask]
print (df)
UniqID Value
99 a 1
98 a 2
12 a 1
63 a 2
38 a 0
41 a 1
9 a 1
72 a 2
64 b 1
67 b 2
33 b 0
68 b 1
57 b 1
71 b 2
10 b 0
8 b 1
61 c 1
66 c 2
46 c 0
0 c 1
40 c 2
21 d 0
74 d 1
15 d 1
85 d 2
6 d 1
88 d 2
91 d 0
83 d 1
4 d 1
34 d 2
96 d 0
48 d 1
29 d 0
84 d 1
32 e 0
62 e 1
37 e 1
55 e 2
16 e 0
23 e 1
Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)
Similar Sample data:
index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2
Load this in pandas dataframe.
Then,
Using below code:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])
It gives desired dataframe based on assumption:
index id value
1 2 a 0
2 3 a 1
3 4 a 0
4 5 a 1
5 6 a 2
8 9 b 1
9 10 b 2
10 11 b 0
11 12 b 1
12 13 c 0
13 14 c 1
14 15 c 2
Edit: To include transition 1->0, below is updated function:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
return pd.DataFrame(index1+index2)
My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2
df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)
Actually waiting for updates about the 2-1 and 1-2 relationship

panda dataframe row wise iteration with referencing previous row values for conditional matching

I have to find out how many times a bike was on overspeed, and in each instances for how long(for simplicity how many kms)
df = pd.DataFrame({'bike':['b1']*15, 'km':list(range(1,16)), 'speed':[20,30,38,33,28,39,26,33,35,46,53,27,37,42,20]})
>>> df
bike km speed
0 b1 1 20
1 b1 2 30
2 b1 3 38
3 b1 4 33
4 b1 5 28
5 b1 6 39
6 b1 7 26
7 b1 8 33
8 b1 9 35
9 b1 10 46
10 b1 11 53
11 b1 12 27
12 b1 13 37
13 b1 14 42
14 b1 15 20
#Expected result is
bike last_OS_loc for_how_long_on_OS
b1 4 2km
b1 11 5km
b1 15 1km
Now Logic-
has to flag the speed >= 30 as Overspeed_Flag
If the speed remains on more than 30 for 1 or 1+km then those continuation are treated as a overspeed session (eg: when b1 was in between 2 to 4 km, 6to11, 13-14km, MARK it was not a overspeed session when b1 was at 6km, as it was only for that row, no continuation on >30 found).
then measure for a session, how long/for how many kms he remains at overspeed limit. Refer expected result tab.
also finding out for a overspeed session what was the last km mark.
Kindly suggest how can i achieve this. And do let me know if anything is not clear in the question.
P:S: i am also trying, but it is little complex for me(Pretty confused on how to mark if it is a continuation of OS_flag or a single instance of OS.), Will get back if successful in doing this. Thanks in ADV.
You can use:
#boolean mask
mask = df['speed'] >= 30
#consecutive groups
df['g'] = mask.ne(mask.shift()).cumsum()
#get size of each group
df['count'] = mask.groupby(df['g']).transform('size')
#filter by mask and remove unique rows
df = df[mask & (df['count'] > 1)]
print (df)
bike km speed g count
1 b1 2 30 2 3
2 b1 3 38 2 3
3 b1 4 33 2 3
7 b1 8 33 6 4
8 b1 9 35 6 4
9 b1 10 46 6 4
10 b1 11 53 6 4
12 b1 13 37 8 2
13 b1 14 42 8 2
#aggregate first and last values
df1 = df.groupby(['bike','g'])['km'].agg([('last_OS_loc', 'last'),
('for_how_long_on_OS','first')])
#substract last with first
df1['for_how_long_on_OS'] = df1['last_OS_loc'] - df1['for_how_long_on_OS']
#data cleaning
df1 = df1.reset_index(level=1, drop=True).reset_index()
print (df1)
bike last_OS_loc for_how_long_on_OS
0 b1 4 2
1 b1 11 3
2 b1 14 1
EDIT:
print (pd.concat([mask,
mask.shift(),
mask.ne(mask.shift()),
mask.ne(mask.shift()).cumsum()], axis=1,
keys=('mask', 'shifted', 'not equal (!=)', 'cumsum')))
mask shifted not equal (!=) cumsum
0 False NaN True 1
1 True False True 2
2 True True False 2
3 True True False 2
4 False True True 3
5 True False True 4
6 False True True 5
7 True False True 6
8 True True False 6
9 True True False 6
10 True True False 6
11 False True True 7
12 True False True 8
13 True True False 8
14 False True True 9
Here is another approach using a couple of helper Series and a lambda func:
os_session = (df['speed'].ge(30) & (df['speed'].shift(-1).ge(30) | df['speed'].shift().ge(30))).astype(int)
groups = (os_session.diff(1) != 0).astype('int').cumsum()
f_how_long = lambda x: x.max() - x.min()
grouped_df = (df.groupby([os_session, groups, 'bike'])['km']
.agg([('last_OS_loc', 'max'),
('for_how_long_on_OS',f_how_long)])
.xs(1, level=0)
.reset_index(level=0, drop=True))
print(grouped_df)
last_OS_loc for_how_long_on_OS
bike
b1 4 2
b1 11 3
b1 14 1

Skipping every nth row in pandas

I am trying to slice my dataframe by skipping every 4th row. The best way I could get it done is by getting the index of every 4th row and then selecting all the other rows. Like below:-
df[~df.index.isin(df[::4].index)]
I was wondering if there is a simpler and/or more pythonic way of getting this done.
One possible solution is create mask by modulo and filter by boolean indexing:
df = pd.DataFrame({'a':range(10, 30)}, index=range(20))
#print (df)
b = df[np.mod(np.arange(df.index.size),4)!=0]
print (b)
a
1 11
2 12
3 13
5 15
6 16
7 17
9 19
10 20
11 21
13 23
14 24
15 25
17 27
18 28
19 29
Details:
print (np.mod(np.arange(df.index.size),4))
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
print (np.mod(np.arange(df.index.size),4)!=0)
[False True True True False True True True False True True True
False True True True False True True True]
If unique index values use a bit changed #jpp solution from comment:
b = df.drop(df.index[::4], 0)

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources